select x.rn + p.NUM BATCH, p.NUM, x.DEPTNO, x.EMPNO, x.LASTNAME, x.HIREDATE, x.SAL from (select if (@dept = DEPTNO, @rn := @rn + 1, @rn := 1 + least(0, @dept := DEPTNO)) rn, e.DEPTNO, e.EMPNO, e.LASTNAME, e.HIREDATE, e.SAL from EMPLOYEES as e, (select (@dept := 0)) as z order by e.DEPTNO, e.HIREDATE) as x, (select 0 NUM union all select 1 NUM union all select 2 NUM union all select 3 NUM union all select 4 NUM) as p
Although Cartesian joins are usually carefully avoided, multiplying 10,000 rows by as small a number as 5 is very fast. Therefore, we are going to get each row five times, each time associated to a different value of
NUM (0 to 5). Then I compute a value that I call
BATCH, which is the sum of
NUM and of the initial row number
rn. Why this value? Because since
NUM varies, we will have some overlap for the values of
Suppose we have row numbers 21, 22, 23, 24, an 25. The first row will yield batch numbers 21, 22, 23, 24, and 25. The second one 22, 23, 24, 25, and 26. And so on. When I want to compute the average of those 5 rows and associate it with row 23, all I need to do is average values for batch 25, for which I will retrieve one instance of each of the rows. Now I also want the detail for each row, and I don't want a 50,000 but a 10,000 row result. I therefore doctor the result so I'll have NULL for the details, except for the "center row" in my batch, and use
MAX() to aggregate and squash five rows into one. And here it is:
select a.DEPTNO, max(case a.NUM when 2 then a.EMPNO else NULL end) EMPNO, max(case a.NUM when 2 then a.LASTNAME else NULL end) LASTNAME, max(case a.NUM when 2 then a.HIREDATE else NULL end) HIREDATE, round(max(case a.NUM when 2 then a.SAL else NULL end), 2) SAL, round(avg(SAL), 2) from (select x.rn + p.NUM BATCH, p.NUM, x.DEPTNO, x.EMPNO, x.LASTNAME, x.HIREDATE, x.SAL from (select if (@dept = DEPTNO, @rn := @rn + 1, @rn := 1 + least(0, @dept := DEPTNO)) rn, e.DEPTNO, e.EMPNO, e.LASTNAME, e.HIREDATE, e.SAL from EMPLOYEES as e, (select (@dept := 0)) as z order by e.DEPTNO, e.HIREDATE) as x, (select 0 NUM union all select 1 NUM union all select 2 NUM union all select 3 NUM union all select 4 NUM) as p) as a group by a.DEPTNO, a.BATCH having count(*) > 2 order by 1, 4, 2;
Two passing remarks about this code:
- The rather wild treatment creates some rounding errors. I have had to round not only the average salary, but the salary itself.
- We have a problem at the boundaries. For the computations to be valid, we need at least three rows in the aggregate. This is the reason for the
And now the verdict:
| 60 | 1903 | BARNES | 2006-12-15 | 4760.32 | 3435.31 |
| 60 | 3240 | FARRELL | 2006-12-21 | 4046.99 | 4260.59 |
| 60 | 5030 | POULIN | 2006-12-23 | 3942.44 | 5099.71 |
| 60 | 8345 | PADRON | 2007-01-03 | 5956.80 | 4551.25 |
| 60 | 3218 | LADD | 2007-01-12 | 6791.98 | 3920.99 |
| 60 | 3046 | WELLMAN | 2007-01-18 | 2018.05 | 4021.25 |
| 60 | 2787 | WILLIAMS | 2007-02-05 | 895.66 | 3263.33 |
| 60 | 3284 | ANDERSON | 2007-02-15 | 4443.75 | 3196.43 |
| 60 | 3336 | DUNBAR | 2007-02-16 | 2167.19 | 3191.85 |
| 60 | 4687 | LUO | 2007-02-18 | 6457.50 | 4594.29 |
| 60 | 6840 | JACKSON | 2007-02-28 | 1995.17 | 4599.79 |
| 60 | 10417 | NEILL | 2007-02-28 | 7907.86 | 4480.14 |
| 60 | 6232 | CARLSON | 2007-03-01 | 4471.24 | 3985.81 |
| 60 | 9697 | LYNCH | 2007-03-02 | 1568.95 | 4649.35 |
10000 rows in set (0.44 sec)
Wham, bang. Almost twice as fast as the Oracle analytic function on this example...
I hope this article has given you a good idea of what analytic functions are and how you can get similar results, both in terms of result set and response time, with MySQL. Analytic functions definitely have an advantage in terms of elegance and legibility, particularly when compared to the somewhat laborious usage of variables, or to the particularly wild last example.
One thing is certain: if you want to get good performance, don't restrict yourself to "clean" SQL. Analytic functions are not clean SQL. From a relational point of view, they deserve at least to fry in the sixth circle of hell. Performing the same computations as fast as analytic functions do requires using MySQL peculiarities and a good deal of lateral thinking. As General Grant said, "No war has ever been won by a slavish respect of the rules."
There is another thing that you must keep in mind: analytic functions apply to the result set defined by the query in which they appear. If you emulate the function with several subqueries, any additional restriction must be applied at several places. It prevents creating a view upon the query for fear of getting wrong results, compared to the true analytic query. I am not totally persuaded, however, that this is necessarily a bad thing. I fear that if you create a view
V_EMPLOYEES_WITH_SAL_RANK, it will only be a matter of time before someone writes:
select distinct e1.DEPTNO, e1.SAL from V_EMPLOYEES_WITH_SAL_RANK e1 where e1.RANK = (select max(e2.RANK) from V_EMPLOYEES_WITH_SAL_RANK e2 where e2.DEPTNO = e1.DEPTNO)
which, as you may have realized (but the optimizer won't), is just an inefficient
select DEPTNO, MIN(SAL) FROM EMPLOYEES group by DEPTNO
Stephane Faroult first discovered the relational model and the SQL language in 1983.
Return to ONLamp.com.