Article:
  Emulating Analytic (AKA Ranking) Functions with MySQL
Subject:   Very good article
Date:   2007-04-01 06:27:03
From:   gmaxia
Hi.
Thanks for this very insightful article.
You said that you were using MySQL 5.0. Could you be more specific about the exact version?
If there is a public place where we can download the sample data used for your test, it would be greatly appreciated.


Best regards


Giuseppe Maxia

Full Threads Newest First

Showing messages 1 through 4 of 4.

  • Stephane Faroult photo MySQL version
    2007-04-02 00:24:19  Stephane Faroult | O'Reilly Author [View]

    If you want all the gory details the tests were performed with MySQL 5.0.27-standard and Oracle 10.2.0.1.0. Both were running on a double Xeon 2.4 GHz test machine under what initially was a Mandriva 2007.0 that is in a perpetual state of flux and that Oracle anyway believes to be a redhat-4.
    Sample data available on request :-).
    • MySQL version
      2007-04-04 19:51:13  jaypipes [View]

      Hi there! Great article, Stephane! Reminds me of the examples I put into Chapter 8 of "Pro MySQL"...

      Couple things I wanted to note, however.

      First, the number one reason you see such a performance difference with adding an index *isn't* actually the (correct) point you make about the subquery being fired once for each subset row. The reason is because you are using the InnoDB storage engine, and InnoDB has a known issue with doing a SELECT COUNT(*) when there is no WHERE condition on an indexed field. InnoDB is forced to do a full table scan, as you point out, but for a different reason, each time a SELECT COUNT(*) is used in this manner. See http://dev.mysql.com/doc/refman/5.0/en/innodb-restrictions.html for more info on that.

      Second, is that you can get *much* better performance from your MySQL SELECTs by avoiding correlated subqueries and rewriting as a derived table... It's a bug in the MySQL optimizer unfortunately. Even better performance would be from creating a temporary table in place of the derived table and having an appropriate index on the temporary table. Unfortunately, a derived table is created internally with no indexes.

      Here's your correlated subquery rewritten as a derived table. Let me know if you get much better performance from it, though I think the biggest performance benefit comes from adding an index because of the InnoDB limitation mentioned above:

      original SQL:

      select *
      from (select a.DEPTNO, a.EMPNO, a.LASTNAME, a.FIRSTNAME, a.SAL,
      (select 1 + count(*)
      from EMPLOYEES b
      where b.DEPTNO = a.DEPTNO
      and b.SAL > a.SAL) RANK
      from EMPLOYEES as a) as x
      where x.RANK <= 5
      order by x.DEPTNO, x.RANK;

      Rewritten SQL:

      select a.DEPTNO, a.EMPNO, a.LASTNAME, a.FIRSTNAME, a.SAL, x.emp_count + 1 as RANK
      from EMPLOYEES as a
      inner join (
      select count(*) as emp_count
      from EMPLOYEES
      ) as x
      on x.DEPTNO = a.DEPTNO
      and x.SAL > a.SAL
      where RANK <= 5
      order by a.DEPTNO, RANK;

      Do the comparison after adding that index, of course. :)

      Cheers,

      Jay Pipes
      jay at mysql dot com
      • Stephane Faroult photo MySQL version
        2007-04-06 10:16:23  Stephane Faroult | O'Reilly Author [View]

        Jay,

        Where did I say I was using InnoDB? I didn't want to compare Oracle to Oracle ;-). Actually, as my point was a purely SQL point, I haven't specified any storage engine - which means that my table is of the default MyISAM type. That said, the rewriting you suggest is indeed interesting. I have used the code of your second rewriting, added the missing condition on RANK and ran it twice, one without the index, and one with the index. There is a noted improvement WITHOUT the index on the query with the subquery in the SELECT list, as could indeed be expected:
        30 rows in set (1 min 48.16 sec)

        With the index, though, it's quite comparable:
        30 rows in set (33.57 sec)

        The problem with the index is the way I have generated my data ( ... randomly). As a result, employees from a same department are spread all over the table. We might expect greater efficiency if employees were physically clustered by DEPTNO (by "clustered" I mean of course no more than "if the rows were close from each other").


    • MySQL version
      2007-04-04 20:10:40  jaypipes [View]

      Hi there! Great article, Stephane! Reminds me of the examples I put into Chapter 8 of "Pro MySQL"...

      Couple things I wanted to note, however.

      First, the number one reason you see such a performance difference with adding an index *isn't* actually the (correct) point you make about the subquery being fired once for each subset row. The reason is because you are using the InnoDB storage engine, and InnoDB has a known issue with doing a SELECT COUNT(*) when there is no WHERE condition on an indexed field. InnoDB is forced to do a full table scan, as you point out, but for a different reason, each time a SELECT COUNT(*) is used in this manner. See http://dev.mysql.com/doc/refman/5.0/en/innodb-restrictions.html for more info on that.

      Second, is that you can get *much* better performance from your MySQL SELECTs by avoiding correlated subqueries and rewriting as a derived table... It's a bug in the MySQL optimizer unfortunately. Even better performance would be from creating a temporary table in place of the derived table and having an appropriate index on the temporary table. Unfortunately, a derived table is created internally with no indexes.

      Here's your correlated subquery rewritten as a derived table. Let me know if you get much better performance from it, though I think the biggest performance benefit comes from adding an index because of the InnoDB limitation mentioned above:

      original SQL:

      select *
      from (select a.DEPTNO, a.EMPNO, a.LASTNAME, a.FIRSTNAME, a.SAL,
      (select 1 + count(*)
      from EMPLOYEES b
      where b.DEPTNO = a.DEPTNO
      and b.SAL > a.SAL) RANK
      from EMPLOYEES as a) as x
      where x.RANK <= 5
      order by x.DEPTNO, x.RANK;

      Rewritten SQL:
      select a.DEPTNO, a.EMPNO, a.LASTNAME, a.FIRSTNAME, a.SAL, x.rank as RANK
      from EMPLOYEES as a
      inner join (
      # Get the ranks of each employee within the dept
      select e1.EMPNO, count(DISTINCT e2.SAL) as rank
      from EMPLOYEES e1
      inner join EMPLOYEES e2
      on e1.DEPTNO = e2.DEPTNO
      and e1.SAL <= e2.SAL
      group by e1.EMPNO
      ) as x
      on a.EMPNO = x.EMPNO
      order by a.DEPTNO, RANK;

      Do the comparison after adding that index, of course. :)

      Cheers,

      Jay Pipes
      jay at mysql dot com