My name is Roland Bouman, and I’m a certification developer for MySQL AB. This is my first post on the O’Reilly database weblog, and I figured it would be nice to start with a technical article about MySQL cursors, a subject I have written about before on my blogger weblog.
The first part of this article explains why cursors are usually unnecessary.
A few common problems with cursors are briefly discussed.
Also, typical stored procedure pattern is described that uses a cursor,
and a demonstration is given that shows how it can be refactored
to an equivalent procedure that uses a single SQL statement instead.
In the second part of this article,
the negative performance implications of using cursors are illustrated with a few benchmarks,
and the cases where a cursor might be useful after all are briefly discussed.
Some time ago, I used to be quite active in some of the MySQL Forums, especially the ones that cover the features introduced in MySQL 5.0 (views, stored procedures, triggers and cursors).
After a little while it became quite obvious to me that most of the problems that were posted in the the cursor forum could be grouped into just a few classes.
The following lists them in ascending order according of abundance (as perceived by me) of posts mentioning that particular type of problem.
- Attempts to use unsupported features
In a minority of cases, people expected a feature that just isn’t supported by MySQL. To name a few useful ones: cursor
FOR UPDATEcursors and passing cursors to and from stored procedures as a parameter.
- Nested Cursor loops
Nesting cursors is usually a combined problem of the following two problems. If you really need nested cursors, then you have to be extra careful with loop handling. However, usually, nesting cursors is entirely unnecessary and inappropriate. (Many, many nested cursor loops can and should be written as simple
- Basic cursor loop handling problems
- This really is quite a common problem. To be frank, there are quite a few things you need to think of if you want to do proper cursor looping in MySQL, and I think I can safely call it a fact that a lot of people run into trouble there.
- Unnecessary usage of cursors
- This is by far the most ubiquitous problem with cursors. In maybe as many as 80% of the cases where a cursor is used, it isn’t needed at all. In quite a lot cases, the cursor can be written as a single SQL statement.
In the forum, I found myself explaining the same problem and the solution over and over again. So, I wrote little articles to illustrate basic loop handling (Want to write a CURSOR LOOP with MySQL?,
Why REPEAT and WHILE are usually not handy to handle MySQL CURSORs) and cursor nesting (Nesting MySQL Cursor Loops). These articles gave me a quick pointer to throw with in case I bumped into yet another thread requesting help regarding these types of problems.
On some occasions, I have briefly pointed out that cursors are often unnecessary.
This time, I’ll try to elaborate on this a little further. My objective is to illustrate how a typical case of using cursors can be refactored into a single SQL statement that uses no cursor at all. We’ll see how this allows us to discard about two thirds of the code and gain considerable performance at the same time.
Are you Sure you need a cursor?
In quite a few cases where I tried to help by explaining in what way the loop handling or the nesting was flawed, I pointed out that although the problem could be fixed, it was probably a better idea to do away with the cursor altogether. More often than not it was not too hard to think of a way to rewrite the entire cursor loop (including the body) as a single SQL statement.
However, it sometimes turned out to be quite hard to convince the person that originally wrote the cursor loop. In many cases, cursors are used by developers that are more familiar with a procedural or object-oriented language than with SQL.
I feel that a single SQL statement is nicer, more elegant. Usually, it involves less code, and often it is easier to understand. However, these are matters of aesthetics and preference, and thus do not count - not really. In my opinion, no matter how useful, creative, revolutionary or brilliant a piece of software is, it will never become a work of art (although it may demonstrate sublime craftmanship).
Luckily, there is an extra reason to avoid cursors in favour of single SQL statements. It’s a good one too, and it will often convince the most stubborn procedurally oriented programmer.
There extra reason is speed. I think it’s safe to say that a single SQL statement will always outperform the equivalent cursor loop.
A nearly Real World Example
My nearly Real World Example is some sort of reservation database. The particular nature of the reservation is not really important. We can pretend that this concerns a car-rental shop, or maybe a hotel. The important thing is that customers can make reservations and that the reservation lists the reserved products.
A product has a default value for the price (
value_prod), vat percentage (
vat_prod) and discount percentage (
discount_prod). The items that appear on a reservation have a reference to the reservation, and a reference to the product. Like a product, an item too has a price (
unit_value), vat percentage (
vat_value) and a discount percentage (
discount). In addition, the item also stores the number of rental days (
no_days), and a flag (
disc_bool) that determines whether a discount will be charged. The products are stored in the
product table, and the items are stored in the
At first it might seem as if the
discount columns in the
line_reserve table are redundant, because these values could be derived from the related
product table. However, this is not the case. The idea is that a reservation maybe made in advance. So, the vat and discount percentages as well as the price of the product could change after the reservation was made.
(Although it might be tempting to dicuss the database design, that is not the goal of this article. We won’t discuss the validity of the database design here - it just serves as the background for a stored procedure I want’t to describe.)
A procedure to update price, vat and discount for reserved items
The database also contains a
UpdatePriceAndVatAndDiscountForReservationItems procedure. The procedure is meant to do the actual legwork of calculating and storing
the vat, discount and price values for all items that make up a reservation. The reason why I’m higlighting this procedure is because it contains a pattern of improper cursor handling, and I hope to demonstrate how that pattern can be recognized and rewritten.
(I’m deliberately not discussing any issues concerning coding style, such as identifier conventions. Also, I won’t go into the detail concerning the chosen datatypes. Although improvements might be possible in this respect, this article tries to focus on the cursor loop)
Here’s the code:
CREATE PROCEDURE `UpdatePriceAndVatAndDiscountForReservationItems`( IN var_id_res INTEGER ) BEGIN -- the product record DECLARE var_unit_val DOUBLE; DECLARE var_vat DOUBLE; DECLARE var_discount DOUBLE; -- the line_reserve record DECLARE var_id_line INTEGER; DECLARE var_id_prod INTEGER; DECLARE var_disc TINYINT; DECLARE var_no_days INTEGER; -- the calculated vat DECLARE var_val_vat DOUBLE; -- cursor loop book-keeping DECLARE no_more_rows BOOLEAN; DECLARE num INTEGER; -- the line reserve curosr DECLARE cur_res CURSOR FOR SELECT id_line , id_prod , disc_bool , no_days FROM line_reserve WHERE id_res = var_id_res; -- more cursor loop book-keeping DECLARE CONTINUE HANDLER FOR NOT FOUND SET no_more_rows = TRUE; -- loop control OPEN cur_res; loop_cur_res: LOOP FETCH cur_res INTO var_id_line , var_id_prod , var_disc , var_no_days; IF no_more_rows THEN CLOSE cur_res; LEAVE loop_cur_res; END IF; -- get corresponding product SELECT value_prod , vat_prod , discount_prod INTO var_unit_val , var_vat , var_discount FROM product WHERE id_prod=var_id_prod; -- calculate vat SET var_val_vat = var_unit_val * ROUND((var_vat/100),2); -- calculate discount IF var_disc=1 THEN SET var_discount = ((var_unit_val+var_val_vat)*var_no_days) * ROUND((var_discount/100),2); END IF; -- update line_reserve with vat and discount UPDATE line_reserve SET unit_value = var_unit_val , vat_value = var_val_vat , discount = var_discount WHERE id_line = var_id_line; -- record the line number SET num = num+1; END LOOP loop_cur_res; -- return line count SELECT num as result; END;
(I admit I’m guilty of including some comments and indentation to aid the readability of the code. So, now you know why I call it a nearly real world example.)
The procedure uses the
var_id_res parameter to specify a particular reservation:
CREATE PROCEDURE `UpdatePriceAndVatAndDiscountForReservationItems`( IN var_id_res INTEGER ) ...
This procedure parameter is used to control a cursor to select only items that correspond to the reservation passed by the parameter:
-- the line reserve curosr DECLARE cur_res CURSOR FOR SELECT id_line , id_prod , disc_bool , no_days FROM line_reserve WHERE id_res = var_id_res;
The procedure loops through the cursor with an ordinary unstructured
LOOP, fetching a record from the cursor for each iteration:
-- loop control OPEN cur_res; loop_cur_res: LOOP FETCH cur_res INTO var_id_line , var_id_prod , var_disc , var_no_days; IF no_more_rows THEN CLOSE cur_res; LEAVE loop_cur_res; END IF; ... -- record the line number SET num = num+1; END LOOP loop_cur_res; -- return line count SELECT num as result;
The loop is controlled with the
no_more_rows flag. The value of the flag is switched inside a
NOT FOUND handler:
DECLARE CONTINUE HANDLER FOR NOT FOUND SET no_more_rows = TRUE;
This ensures the loop is terminated properly when the cursor is exhausted.
A little side-note. The end of the loop code contains explicit code to count the number of lines that are processed by the cursor: the
num variable is incremented for each iteration of the loop, and the value is selected and returned to the client caller right after the end of the loop. However, it is flawed. the procedure always returns a
NULL for the result column:
mysql> call UpdatePriceAndVatAndDiscountForReservationItems(76); +--------+ | result | +--------+ | NULL | +--------+ 1 row in set (0.01 sec) Query OK, 0 rows affected (0.01 sec)
That’s because of an easily overseen detail: the
num variable was never initialized. Therefore, the value will remain
NULL for evermore. The easiest way to prevent such trouble is to give such variables an explicit default value in their declaration:
DECLARE num INTEGER
So far, we’ve seen nothing special.
The loop body
Now for the body of the loop. Each iteration essentially executes three separate things:
- Fetch data from the product that corresponds to the current reservation item
- Calculate the vat and discount using data from both the reservation item and the product
- Update the reservation item with the calculated values
SELECT...INTO as a
For each record fetched from the cursor, the corresponding product record is retrieved with a
SELECT...INTO statement. It uses the
prod_id from the reservation item that was
FETCH-ed in the top of the loop to identify exactly one corresponding record from the product table:
-- get corresponding product SELECT value_prod , vat_prod , discount_prod INTO var_unit_val , var_vat , var_discount FROM product WHERE id_prod=var_id_prod;
This is a good moment to realize a couple of things.
SELECT...INTO statement can retrieve at most one record. In this case, it’s probably probably quite safe to assume that it does, because the reservation item has a mandatory foreign key that references the primary key of the product table.
But suppose we would use this type of statement to retrieve data for an optional relationship. Well,
val_id_prod could be
NULL, and in that case, the
SELECT...INTO statement won’t retrieve a record at all.
That would mean we’re in big trouble: the loop is controlled using a
NOT FOUND handler, and that’s exaclty the handler that will be fired when this
SELECT...INTO statement matches no records. Our loop would finish unintentionally, and most probably prematurely,
So, what can we do about it? Well, a lot of things. To name a few:
- If we can be sure that a record will be matched when
NULL, we can wrap a
IF...END IFaround it.
- We can wrap a
BEGIN...ENDblock around it, and declare a separate
NOT FOUNDhandler there. Because that handler is nearer than the one that controls our loop, this will capture the condition, and the loop will never be left. The nice thing about this approach is that it also offers an opportunity to handle cases where the statement matches more than one record.
However the real question is: what is a
SELECT...INTO doing inside this loop? If we are sure it will match a record for each record fetched for the loop, why not rewrite the cursor to a join?
DECLARE cur_res CURSOR FOR SELECT r.id_line , r.disc_bool , r.no_days , p.value_prod , p.vat_prod , p.discount_prod FROM line_reserve r JOIN product p ON r.id_prod = p.id_prod WHERE id_res = var_id_res;
So, this is a pattern we can regonize, and avoid. Any
SELECT...INTO that is executed for each iteration of a cursor loop, can be rewritten by joining it to the
SELECT-statement of the cursor. The
INTO fields are simply added to the
SELECT-list of the cursor query. In this particular case, we can also remove the id_prod column from the
SELECT-list. It was only used to retrieve the product data, and this is now already solved.
We should keep in the back of our minds that when we change the
SELECT-list we shoud also change the
FETCH, or else there will be a mismatch in the number of columns selected by the cursor and the number of available variables wherein these are fetched
It is important to realize that this will not result in extra iterations, at least, not if the
SELECT...INTO was a right choice in the first place.
SET assignments as part of
The loop body continues by calculating the vat and discount values:
-- calculate vat SET var_val_vat = var_unit_val * ROUND((var_vat/100),2); -- calculate discount IF var_disc=1 THEN SET var_discount = ((var_unit_val+var_val_vat)*var_no_days) * ROUND((var_discount/100),2); END IF;
If we settle for using a
JOIN in the cursor query instead of a
SELECT...INTO inside the loop body, we might just as well add the expressions that calculate the vat and the discount to the
DECLARE cur_res CURSOR FOR SELECT r.id_line , p.value_prod * ROUND((p.vat_prod/100),2) var_val_vat , if( r.disc_bool , ( p.value_prod * round(p.vat_prod/100,2) + p.value_prod ) * r.no_days * round(p.discount_prod/100,2) , NULL ) var_discount FROM line_reserve r JOIN product p ON r.id_prod = p.id_prod WHERE id_res = var_id_res;
Again, we can discard columns we don’t need anymore, and we must modify the
It seem trivial, but for completeless, we should note that this is a pattern too: any expressions that appear on the right-hand side of the assignment operator in a
SET statement, can and probably should be written as an expression in the
SELECT-list, and the actual assignment to variables is then handled in the
In the original example, the
var_discount variable is conditionally set using an
IF..THEN statement. If the
IF..THEN statement only contains
SET statements, then all of the corresponding expressions on the right hand side of the assignment operator can be written using the
IF function. This can be useful if the expression is expensive or if it performs some kind of side-effect.
Rewriting the single row
The final action that is performed inside the cursor loop is an
UPDATE of the
line_reserve table. In order to store the calculated vat and discount data, only the record that corresponds to the record fetched by the cursor is updated:
UPDATE line_reserve SET unit_value = var_unit_val , vat_value = var_val_vat , discount = var_discount WHERE id_line = var_id_line;
(There are rdbms-products that support a
FOR UPDATE clause in the cursor declaration. This lets you modify the values in the cursor record immediately. MySQL does not support this feature.)
We should realize that
UPDATE is perfectly capable of updating multiple records. In fact, any
UPDATE statement that does not include a
WHERE-clause, will update all the records in the table. To limit the update to only those
line_reserve records that corrsepond to a reservation, we will need to add a proper
In fact, we know exactly which
WHERE-clause we need, because we used it to write the cursor declaration:
UPDATE line_reserve SET unit_value = ? , vat_value = ? , discount = ? WHERE id_res = var_id_res;
Of course, we cannot just use the original variables for the column assignments, because the value of these variables potentially varies for each record touched by the
UPDATE-statement. Thinking of an expression that calculates them directly as part of the
UPDATE-statement poses a slight problem, because we have seen that the calculation of vat and discount depend on values from both the line_reserve record, as well as the corresponding product record.
So, somehow, we need to find a way for the corresponding product records to contribute their values to the
UPDATE statement. One way of solving this problem is using subqueries:
UPDATE line_reserve r SET unit_value = ( select p.value_prod from product p where p.id_prod = r.id_prod ) , vat_value = ( select p.value_prod * ROUND((p.vat_prod/100),2) from product p where p.id_prod = r.id_prod ) , discount = ( select if( r.disc_bool , ( p.value_prod * round(p.vat_prod/100,2) + p.value_prod ) * r.no_days * round(p.discount_prod/100,2) , p.discount ) from product p where p.id_prod = r.id_prod ) WHERE id_res = var_id_res;
Before explaining these subqueries, please take a moment to realize that this single statement does all the work the cursor loop used to do. So, this proves that the cursor loop really was unnecessary in the sense that we can perform the same task using a single SQL statement.
The subqueries are the parenthesis enclosed
SELECT expressions appearing on the right-hand side of the assignment operator in the
SET-clause. In order for a subquery to be used like this, the subquery needs to be scalar: they must select a single row, with a single column. Because of this special form, the ‘resultset’ can be interpreted as a scalar: a simple, singular value-expression.
Note that these three subqueries differ only as far as the
SELECT-list is concerned. They all select one single row from the
product table that corresponds to the current
line_reserve record. To select only the corresponding
product record, the subquery is bound to the main query using an appropriate
(It is no coincidence that the
WHERE-clause used here is identical to the one used in the
SELECT...INTO statement in the original code.)
Subqueries that use expressions from the surrounding query to limit the resultset are said to correlated.
Some people might feel uncomfortable about the repetition of code. Three times, essentially similar queries are performed. Even if there would be some kind of optimization that accounts for any performance issues that might rise from that, it is still objectionable that we must repeatedly write part of the code.
Some rdbms-products allow a a variation of this kind of syntax to address this issue:
UPDATE line_reserve r SET ( unit_value , vat_value , discount ) = ( select value_prod , p.value_prod * ROUND((p.vat_prod/100),2) , if( r.disc_bool , ( p.value_prod * round(p.vat_prod/100,2) + p.value_prod ) * r.no_days * round(p.discount_prod/100,2) , p.discount ) from product p where p.id_prod = r.id_prod ) WHERE id_res = var_id_res;
So, the update is performed by assigning a complete record. This kind of syntax is not supported by MySQL. But as we shall see, there is a much better way to handle this. MySQL and a few other rdbms products support the
JOIN syntax as part of an
We already discussed how a
JOIN could be used to fetch
product data corresponding to the
line_reserve records in the cursor declaration. Now all we have to do is to modify that statement to an update statement:
UPDATE line_reserve r INNER JOIN product p ON r.id_prod = p.id_prod SET r.unit_value = p.value_prod , r.vat_value = p.value_prod * round(p.vat_prod/100,2) , r.discount = if( r.disc_bool , ( p.value_prod * round(p.vat_prod/100,2) + p.value_prod ) * r.no_days * round(p.discount_prod/100,2) , r.discount ) WHERE r.id_res = var_id_res;
So in this case, the
JOIN is performed, and any assignments that are applied in the
SET-clause are somehow pushed through to the underlying table. At first, this might seem a bit awkward, but it is actually quite elegant once you get used to it. It is generally faster too than a solution with subqueries, and you can actually use this to modify multiple tables at once, something that can occasionally be quite useful.
Returning the rowcount
There is only one thing in the original that we did not account for yet. The original procedure returned the number of records that were actually touched by the procedure. For the single statement solution, we can use the built-in
ROW_COUNT() function in MySQL. This function returns the number of records that were affected by the last
We managed to rewrite the orignal 68 line stored procedure to this mere 23 lines of code:
CREATE PROCEDURE `sp_UpdateLineReserveVatAndDiscount`( IN p_id_res INTEGER ) BEGIN UPDATE line_reserve l INNER JOIN product p ON l.id_prod = p.id_prod SET l.unit_value = p.value_prod , l.vat_value = p.value_prod * round(p.vat_prod/100,2) , l.discount = if( l.disc_bool , ( p.value_prod * round(p.vat_prod/100,2) + p.value_prod ) * l.no_days * round(p.discount_prod/100,2) , l.discount ) WHERE r.id_res = p_id_res; SELECT ROW_COUNT(); END;
This clearly illustrates that the cursor was not necessary in this case. To me, it also illustrates how extraordinarily powerful pure SQL is, because there must be some kind of mechanism that actually does iterate through all the records to perform the join and update actions on a low-level. The SQL language allows us to express the actual data manipulation
without requiring that we specify anything about the workflow that is needed to retrieve and wade through the individual records. This is certainly one of the reasons why I like this refactored solution better.
What about speed?
Is there anything we can say about how fast these different procedures run? Sure!
I created a simple benchmarking procedures to repeatedly execute a modified version of the two procedures. I modified the procedures by discarding the final
SELECT statement that returns the number of affected records. For performing the benchmarks, I found it impractical to have the resultset returned and because the purpose was primarily to show the difference between the cursor and the single statement, I gathered that it was safe to leave it out.
I also generated series of data: a series of reservations with 1,10,100,250,500,750 and 1000 of corresponding
line_reserve records, all of them referencing just one product record. There was only a very small (5) amount of product records in the table as a whole at this point.
As far as the rest of environment is concerned: the benchmark was performed on a lenovo 3000 n100 1.8Gb/1024 Mb (centrino Duo) notebook with Ubuntu linux and MySQL 5.1.11-beta. Executing the procedures was done using the mysql commandline utility. The tables are InnoDB tables, and there were proper indexes on the relevant columns. MySQL configuration is completely standard, but autocommit was turned off during a single procedure execution.
I then had my benchmarking procedures execute the procedure 10000 times for each of the series. Running the benchmarking procedure was repeated 2 to 5 times. For each run of the benchmarking procedure, the time needed to complete the benchmarking procedure was recorded, and averaged.
Then, more data was generated for the product table as well so that it contained 1000 records. The
line_reserve table was updated, randomizing the product references.
So, to summarize: there are four different sets of measurements:
- the original procedure using the cursor
- few product records, and only one used product (cursor1)
- randomized usage acros 1000 product records (cursor1000)
- the refactored procedure using a single statement
- few product records, and only one used product (statement1)
- randomized usage acros 1000 product records (statement1000)
For each of these, a series of reservations with 1,10,100,250,500,750 and 1000 records was tested by executing the procedure 10000 times. The time required to do that was recorded. Elapsed time was then overaged over 2 to 5 runs of 10000 cycles.
The results are shown in this chart:
An important fact is that the single statement solution was faster than the cursor solution under all tested circumstances. Depending upon the exact situation, the single statement solution was at least 1.44 times as fast as the cursor solution, and at most 8.25 times as fast. The message is very clear if speed is all you care about: avoid cursors as much as you can in favour of single SQL statements.
In nearly all cases, the increase in the number of processed
line_reserve records appears to have a linear effect on the increase in execution time. The one exception is measurement of the cursor method for the 500
line_reserve records with randomized references to the available 1000 products. In that case, the execution time takes a drop and is only a little higher than the corresponding figure for 100
line_reserve records, but
considerably higher than the corresponding figure for 250
line_reserve records. It is unclear to me what could be causing this peculiar effect.
The differences between the two methods became bigger when more
line_reserve records were involved. This seems to indicate that the overhead of setting up and running a cursor loop is just a little bigger than executing a single statement, whereas the direct processing of the rows inside a single SQL statement is considerable faster than performing row-by-row operations explicitly with a cursor. This is not very surprising when we realize that opening the cursor loop requires the execution of an SQL statement in addition to processing the resultset row by row.
Surprisingly, the addition of extra product records sped up execution time for both the cursor as well as the single-statement solution. This was unexpected. It was expected that the execution time of the single statement solution would remain nearly the same, but that the execution time of the cursor solution would increase, possibly quite a great deal. This expectation was based on the idea that only a little increase in the time needed to find a single product record would have a quite large effect on the cursor loop execution time, because it fetches a single product record in each iteration of the loop. However, the assumption is clearly wrong.
Possibly this is caused by the total number of records in the product table:
One can imagine that in the initial situation (5 product records) a full table scan was performed to find the correct record,whereas 1000 records in the product table would result in using an index for the data access, speeding up the query. However, this would mean that a full table scan, even for 5 records, is an extremely expensive operation - something I cannot believe to be true,.
So, why use cursors at all?
The question remains why you would use cursors at all. In what cases do we really need one, or what benefit could a cursor have.
Well, I already pointed out a few reasons in one of my earlier articles. There are some things you just cannot do in a single SQL statement. For example, suppose you want execute SQL statements dynamically, and the statements are generated by another SQL statement. There is no way to do that without a cursor, so in this case you really need one. (If this example sounds artificial, check out this snippet on MySQLForge).