Batch Updates with PL/pgSQL
Pages: 1, 2, 3, 4, 5
Benchmarks
How does this refactoring translate into performance? It's hard to say really, because the performance of the previous iterations of the functions depends on how many IDs were in coll_ids. Nevertheless, I reworked my original benchmarking program into a new version that compares the new functions to the old, plus the original Perl version, just as a control. The results:
batch: 8.27 wallclock seconds (0.02 usr + 0.56 sys = 0.58 CPU) @ 36.28/s
func: 10.54 wallclock seconds (0.03 usr + 0.56 sys = 0.59 CPU) @ 28.47/s
perl: 16.46 wallclock seconds (0.11 usr + 2.77 sys = 2.88 CPU) @ 18.23/s
Assuming that my benchmarking script tests a typical
number of objects in the collection, the constant functions (with the label batch
) are about 27.5 percent faster than the looping versions (func
), while running nearly twice as fast as when all the work is done from within Perl (which also loops). Yay for batch updates!
Download the test to test the difference for yourself.
The Race Condition Again
In the previous article in this series, I wrote that I had eliminated the race condition between two processes updating the same collection. That was a little misleading. I almost eliminated it. One of the reviewers pointed out that there now exists a different--and much more subtle--race condition.
The issue is that, although locking the appropriate entry table row prevents inserts from executing until the collection update finishes, PostgreSQL checks foreign key constraints after executing an INSERT. This is the reverse of how the new functions behave: they lock the entry row before they do anything else.
Why is this a problem? Consider two connections to the database, one updating a collection via these functions and the other updating the same collection via manual queries.
- Manual connection inserts into entry_coll_tag.
- Function connection gets lock on entry row.
- Manual connection attempts to check the foreign key constraint, thus requesting a lock on entry. It blocks while it waits for the function connection to commit.
- Function connection makes a conflicting insert into the collection.
- Function connection waits on the manual connection as part of checking the foreign key constraint. Result: deadlock.
The key here is Step 2: If the function gets a lock after the manual connection has inserted the new row but before it checks the foreign key constraint, and makes a conflicting insert, there will be a deadlock; the connections wait for each other to finish, and therefore neither of them ever does. This is a very narrow race condition--much narrower than the original race condition--but it is still very real.
What's the solution to this problem? It's simple, really: never allow any connection to use anything other than your functions to update a collection. The functions always acquire the lock before doing anything else, so there's no way that one connection using the functions can get into a deadlock with another connection using them.
The best way to enforce this rule is with PostgreSQL permissions. For example, suppose that you let applications connect to your database using a PostgreSQL user named appuser
. First, revoke that user's permission to make changes to the entry_coll_tag table:
try=# REVOKE INSERT, UPDATE, DELETE ON entry_coll_tag
try-# FROM appuser;
REVOKE
Then, create the functions as a user that has these permissions, and use the SECURITY DEFINER at the end of each function declaration:
-- ...
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;
The SECURITY DEFINER clause allows the function to execute with the permissions of the user that defined the function, rather than the user calling it. Because a privileged user created the function, unprivileged users will be able to use it to make changes to the collection table, even if they can't change the collection table directly.
Conclusion
Functions that perform a finite number of tasks are nearly always preferable to those that perform a variable number of tasks based on the number of items to process. PostgreSQL kindly provides the tools to eliminate looping constructs in linear functions. By using the generate_series() function and PostgreSQL's batch update syntax, it's relatively straightforward to eliminate array-based looping constructs in PL/pgSQL functions. Now that those tools are at your disposal, get out there and take advantage of them!
Acknowledgments
My thanks to Josh Berkus for showing me how to use generate_series() and batch updates to eliminate the loops in my original collection functions. I learned a great deal in the process, and I am pleased to pass on the benefits of Josh's knowledge here. I'm also grateful to AndrewSN for pointing out the race condition and explaining it to me.
David Wheeler is a developer at Portland, Oregon-based Values of n, where he writes the code that makes Stikkit's little yellow notes think.
Return to O'Reilly Databases







