O'Reilly    
 Published on O'Reilly (http://oreilly.com/)
 See this if you're having trouble printing code examples


Batch Updates with PL/pgSQL

by David Wheeler
09/07/2006

The previous article in this series, Practical PL/pgSQL: Managing Ordered Sets, created four functions to simplify the management of ordered collections as many-to-many relationships. The two more complex functions, entry_coll_tag_set() and entry_coll_tag_add(), take an iterative approach to managing those relationships. By iterative, I mean that they use loops to iterate over an array of IDs in order to do the right thing for each.

The downside to this approach is that the performance of those functions is directly proportional to the number of IDs in the array (Ο(n)). It would be ideal to make the runtime of the functions constant, regardless of the number of IDs in the array (Ο(1)).

Fortunately, there is a way to do just that in PostgreSQL. Before then however, think back to the Fibonacci examples from the first article in this series, Introduction to PostgreSQL PL/pgSQL. Returning to those examples, I'll introduce some new concepts in a simpler format than the collection functions allow.

Set-Returning Functions

As I mentioned in the first article, PostgreSQL functions can return a value of any supported data type. I didn't mention that they can also return sets of a particular type. A set is a list of values of a particular type, but rather than returning those values as a list or an array (as you might expect in a dynamic programming language), PostgreSQL functions return them as rows of data.

Suppose that you need to get a list of Fibonacci numbers up to a particular place in the Fibonacci sequence. Writing such a function in PL/pgSQL is as simple as modifying the fib_fast() function to return each Fibonacci number as it's calculated. It does so using the PL/pgSQL RETURN NEXT statement. Here are fib_fast() and the new set-returning function fibs_to():


1  CREATE OR REPLACE FUNCTION fib_fast(
2      fib_for integer
3  ) RETURNS integer AS $$
4  DECLARE
5      ret integer := 0;
6      nxt integer := 1;
7      tmp integer;
8  BEGIN
9      FOR num IN 1..fib_for LOOP
10
11         tmp := ret;
12         ret := nxt;
13         nxt := tmp + nxt;
14     END LOOP;
15
16     RETURN ret;
17 END;
18 $$ LANGUAGE plpgsql;

1  CREATE OR REPLACE FUNCTION fibs_to(
2      max_num integer
3  ) RETURNS SETOF integer AS $$
4  DECLARE
5      ret integer := 0;
6      nxt integer := 1;
7      tmp integer;
8  BEGIN
9      FOR num IN 1..max_num LOOP
10         RETURN NEXT ret;
11         tmp := ret;
12         ret := nxt;
13         nxt := tmp + nxt;
14     END LOOP;
15
16    RETURN NEXT ret;
17 END;
18 $$ LANGUAGE plpgsql;

There are really only three differences aside from the function names, and I've emphasized them in fibs_to(). The first difference is on line three, where the fibs_to() declaration indicates that it returns a SETOF integer instead of simply an integer. The SETOF keyword tells PostgreSQL that this function returns a set.

The other differences are that, rather than simply returning the value of the ret integer variable, fibs_to() uses the RETURN NEXT statement to return each Fibonacci number after its calculation in the loop. The final RETURN NEXT statement returns the final Fibonacci number in the sequence.

Those are the only changes necessary to create a set-returning function. As such a function, fibs_to() must also be called in a different context. While you can call fib_fast() in a SELECT statement:

try=% select fib_fast(8);
 fib_fast 
----------
       21
(1 row)

fibs_to() essentially behaves like a table, and you must treat it as such by using it in a FROM clause:

try=% select * from fibs_to(8);
 fibs_to 
---------
       0
       1
       1
       2
       3
       5
       8
      13
      21
(9 rows)

Be warned, however, that while it looks like fibs_to() and behaves like a continuation, (and, for most practical purposes, is treatable like a continuation), PostgreSQL actually buffers all of the values returned by RETURN NEXT and only returns them to the calling context after the function has calculated them all. That means that if you write a set-returning function that returns a lot of values, you need to make sure that your server's memory can handle it.

That caveat aside, set-returning functions can be extremely useful.

Batch-Update Syntax

PostgreSQL supports inserting multiple rows at once using its special INSERT INTO expression syntax, which essentially allows you to select a series of rows and then insert them elsewhere, all in a single statement. For example, suppose that you want to copy all of the rows in table foo into table bar (that is, all rows that aren't already in table bar). Here's how to do it in a single INSERT statement:

INSERT INTO bar (a, b, c)
SELECT d, e, f FROM foo
WHERE  foo.d not in (SELECT a FROM bar);

That's it. By using a SELECT statement in the expression part of the INSERT statement, you can insert multiple rows in a single statement, and even limit the rows through the use of the WHERE clause. Of course, I'm sure that you don't duplicate data like this unless you're deprecating table foo in favor of table bar, right? This syntax does have its uses, though.

As for batch updates, you likely already know that you can update multiple rows at once using a WHERE clause in an UPDATE statement. As with INSERT, UDPATE can also update a series of rows from an expression, generally a SELECT statement. This is possible because, in PostgreSQL, a FROM clause can be an expression, and UPDATE supports the FROM clause. For example, if you want to update all of the rows in a bar table with values from the foo table, where bar.a has the same value as foo.d, you can write:

UPDATE bar
SET    b = foo.e, c = foo.f
FROM   (SELECT d, e, f FROM foo) AS foo
WHERE  a = foo.d;

Pretty simple, right? Not only can you use an expression in the FROM clause, but you can use AS to name it so that the remaining SQL code can treat it exactly like a table. Now how do you construct such SELECT statements within your functions?

From Arrays to Sets

To take advantage of bulk updates in collection functions, you need some way to create a SELECT statement that generates a row for every value in the array passed to the function. Then you can use that SELECT statement as part of the INSERT and UPDATE statements to update the collections with the same number of queries, regardless of how many IDs are in the array.

When I initially tackled this problem, my first attempt was to create a function that converts an array into a set. Fortunately, this turned out to be pretty easy, given how set-returning functions work:

CREATE OR REPLACE FUNCTION array_to_set(
    arr anyarray
) RETURNS SETOF anyelement AS $$
BEGIN
    FOR idx IN array_lower(arr, 1)..array_upper(arr, 1) LOOP
        RETURN NEXT arr[idx];
    END LOOP;
END;
$$ LANGUAGE plpgsql;

PostgreSQL provides several pseudo-types, which are convenient for contexts that need to handle any number of different data types. Because this example code doesn't really care about the data type of the array it converts, I have declared that array_to_set() can take any kind of array--denoted as anyarray--as an argument, and that it returns a SETOF anyelement. Indeed, this function works pretty well:

try=% select * from array_to_set( ARRAY[3,4,10,56,2] );
 array_to_set 
--------------
            3
            4
           10
           56
            2
(5 rows)

Then I realized that I needed not only each value in the array, but also its position in the array so that I could populate the ordering column in the collection table. That would be trickier to write, given the limited return values of a function. Fortunately, there is another approach.

Series Generation

PostgreSQL comes with a very nice set-returning function called generate_series(). This function takes three arguments: a beginning number, an ending number, and an optional step number, and returns a set of numbers from the beginning number to the ending number, each incremented by the step number. It's easiest to understand when you see it in action:

try=% select * from generate_series(2, 4);
 generate_series 
-----------------
               2
               3
               4
(3 rows)

try=% select * from generate_series(1, 10, 2);
 generate_series 
-----------------
               1
               3
               5
               7
               9
(5 rows)

Nice, eh? It's a powerful solution to the problem of creating a SELECT statement that generates rows for each number in the series and each value in an array--without having to call the array_to_set() function at all:

try=% SELECT gs.ser, coll.ids[gs.ser] as id
try-% FROM   (SELECT ARRAY[ 2,4,6,7,8 ]) AS coll(ids),
try-%        generate_series(1, 5) AS gs(ser);
 ser | id
-----+----
   1 |  2
   2 |  4
   3 |  6
   4 |  7
   5 |  8
(5 rows)

Because PostgreSQL FROM clauses may be SQL expressions, you can call generate_series() in the FROM clause and use its values to iterate over the array for each row. The key to being able to reference the series is the AS gs(ser) clause, which gives the call to generate_series() the table name gs and its lone column the name ser. Then simply reference the appropriate value in each row as gs.ser.

Thus, using the generate_series() set-returning function, you can get at each element in the array for each row, while also outputting the series number, all in a single query. This is exactly what you need in order to use batch updates in the collection-management functions.

Improved Tag-Adding Function

Just as a reminder, here's how the original entry_coll_tag_add() function looked:


1  CREATE OR REPLACE FUNCTION entry_coll_tag_add (
2      obj_id   integer,
3      coll_ids integer[]
4  ) RETURNS VOID AS $$
5  DECLARE
6      last_ord smallint;
7      got_ord  boolean;
8      iord     integer  := 1;
9  BEGIN
10     PERFORM true FROM entry WHERE id = obj_id FOR UPDATE;
11
12     SELECT INTO last_ord COALESCE(max(tag_order), 0)
13     FROM   entry_coll_tag
14     WHERE  entry_id = obj_id;
15 
16     FOR iloop IN 1..array_upper(coll_ids, 1) LOOP
17         IF coll_ids[iloop] IS NULL THEN
18             CONTINUE;
19         END IF;
20 
21         SELECT INTO got_ord true 
22         FROM   entry_coll_tag
23         WHERE  entry_id = obj_id
24                AND tag_id = coll_ids[iloop];
25 
26         IF got_ord IS NULL THEN
27             INSERT INTO entry_coll_tag (entry_id, tag_id, tag_order)
28             VALUES (obj_id, coll_ids[iloop], last_ord + iord);
29             iord := iord + 1;
30         END IF;
31     END LOOP;
32 END;
33 $$ LANGUAGE plpgsql;

The bit that I want to eliminate is the loop. It's going to be a bit tricky, because it executes a SELECT to see if a record already exists, and only does the insert if it doesn't already exist. That won't work for the batch insert, but because it uses an expression to batch it, a WHERE clause can limit the array values that are actually inserted. Here's how that change looks:


1  CREATE OR REPLACE FUNCTION entry_coll_tag_add (
2      obj_id   integer,
3      coll_ids integer[]
4  ) RETURNS VOID AS $$
5  DECLARE
6      last_ord smallint;
7  BEGIN
8      PERFORM true FROM entry WHERE id = obj_id FOR UPDATE;
9
10     SELECT INTO last_ord COALESCE(MAX(ord), 0)
11     FROM   entry_coll_tag
12     WHERE  entry_id = obj_id;
13 
14     INSERT INTO entry_coll_tag (entry_id, tag_id, ord )
15     SELECT obj_id, coll_ids[gs.ser], gs.ser + last_ord
16     FROM   generate_series(1, array_upper(coll_ids, 1)) AS gs(ser)
17     WHERE  coll_ids[gs.ser] NOT IN (
18         SELECT tag_id FROM entry_coll_tag
19         WHERE entry_id = obj_id
20     );
21 
22 END;
23 $$ LANGUAGE plpgsql;

The first difference to note is the reduced number of declared variables. Using generate_series() to generate the order numbers, and not using a SELECT statement, makes the got_ord and iord variables unnecessary (although there's more on iord in a bit).

The first two statements in the body of the function are the same as before, namely the PERFORM statement, which locks the entry record, and the SELECT statement, which determines the highest existing value for the entry ID's tag_order column stored in obj_id. Rather than the loop, there is a single INSERT statement. Taking a closer look...

That's it. Really! No matter how many tag IDs you pass in the coll_ids array, it will execute no more than three queries total. Given that the previous version of the function would execute four queries when coll_ids had but a single ID (potentially adding two more queries for every additional ID), it's clear that this version, with its constant number of queries, is a big win. The fact that it actually takes less code to write doesn't hurt, either.

That said, this function does not behave identically to the original; it may not insert the new IDs with perfectly sequential values for the tag_order column. iord served this purpose in the previous version of the function, but there's no place for it in the new version. For example, if you had called the function with:

SELECT entry_coll_tag_add(1, '{5,13,65,12}');

... where entry ID 1 already had tag IDs 13 and 65 associated with it. In such a case, the result might look something like:

try=% SELECT * FROM entry_coll_tag WHERE entry_id = 1 ORDER BY tag_order;
 entry_id | tag_id | tag_order
----------+--------+-----------
        1 |     13 |         1
        1 |     65 |         2
        1 |      5 |         3
        1 |     12 |         6
(4 rows)

Note how the tag_order jumps from 3 to 6. This is because IDs 13 and 65 would have been ordered 4 and 5, but because they weren't inserted, neither were their order numbers. In truth, this situation existed already, thanks to the entry_coll_tag_del() function created in the previous article, since that function deletes certain tag IDs without resetting the tag_order of any remaining rows. But in most applications this shouldn't matter, as long as you use tag_order purely for ordering, rather than for ordinal positional values. It does not represent array indexes, only ordering.

Improved Tag-Setting Function

What of entry_coll_tag_set()? Remember the original:


1  CREATE OR REPLACE FUNCTION entry_coll_tag_set (
2      obj_id   integer,
3      coll_ids integer[]
4  ) RETURNS VOID AS $$
5  BEGIN
6      PERFORM true FROM entry WHERE id = obj_id FOR UPDATE;
7
8      UPDATE entry_coll_tag
9      SET    tag_order = -tag_order
10     WHERE  entry_id = obj_id
11 
12     FOR iloop IN 1..array_upper(coll_ids, 1) LOOP
13         IF coll_ids[iloop] IS NULL THEN
14             CONTINUE;
15         END IF;
16 
17         UPDATE entry_coll_tag
18         SET    tag_order = iloop
19         WHERE  entry_id = obj_id
20                AND tag_id = coll_ids[iloop];
21 
22         IF FOUND IS false THEN
23             INSERT INTO entry_coll_tag (entry_id, tag_id, tag_order)
24             VALUES (obj_id, coll_ids[iloop], iloop);
25         END IF;
26     END LOOP;
27 
28     DELETE FROM entry_coll_tag
29     WHERE  entry_id = obj_id AND tag_order < 0;
30 END;
31 $$ LANGUAGE plpgsql;

The number of queries run by this function is even more variable than the number run by entry_coll_tag_add(), as each ID in coll_ids will trigger the execution of either one or two queries: one if the UPDATE statement updates a row, and two if it doesn't. Here is the batch-updating version:


1  CREATE OR REPLACE FUNCTION entry_coll_tag_set (
2      obj_id   integer,
3      coll_ids integer[]
4  ) RETURNS VOID AS $$
5  BEGIN
6      PERFORM true FROM entry WHERE id = obj_id FOR UPDATE;
7
8      UPDATE entry_coll_tag
9      SET    ord = -ord
10     WHERE  entry_id = obj_id;
11 
12     IF FOUND IS false THEN
13         INSERT INTO entry_coll_tag (entry_id, tag_id, ord)
14         SELECT obj_id, coll_ids[gs.ser], gs.ser
15         FROM   generate_series(1, array_upper(coll_ids, 1))
16                AS gs(ser)
17         WHERE  coll_ids[gs.ser] IS NOT NULL;
18     ELSE
19         UPDATE entry_coll_tag SET ord = ser
20         FROM (
21             SELECT gs.ser, coll_ids[gs.ser] as move_tag
22             FROM   generate_series(1, array_upper(coll_ids, 1)) AS gs(ser)
23             WHERE  coll_ids[gs.ser] IS NOT NULL 
24         ) AS expansion
25             WHERE move_tag = entry_coll_tag.tag_id
26               AND entry_id = obj_id;
27 
28         INSERT INTO entry_coll_tag (entry_id, tag_id, ord )
29         SELECT obj_id, coll_ids[gs.ser], gs.ser
30         FROM   generate_series(1, array_upper(coll_ids, 1)) AS gs(ser)
31         WHERE  coll_ids[gs.ser] NOT IN (
32             SELECT tag_id FROM entry_coll_tag
33             WHERE  entry_id = obj_id
34         );
35 
36         DELETE FROM entry_coll_tag
37         WHERE  entry_id = obj_id AND ord < 0;
38     END IF;        
39 END;
40 $$ LANGUAGE plpgsql;

I admit that, at first glance, this looks more complicated than the original, but that's only because I added an optimization that I could have included in the original, had I noticed it before. Here is the function piece-by-piece.

Phew! That was a lot to cover, but the result is that this function now also runs a constant number of queries for each execution. Or nearly constant. If there are no existing rows in the collection table, it will run only three queries. Otherwise, it will run five. In neither case will it ever run any more than five.

Benchmarks

How does this refactoring translate into performance? It's hard to say really, because the performance of the previous iterations of the functions depends on how many IDs were in coll_ids. Nevertheless, I reworked my original benchmarking program into a new version that compares the new functions to the old, plus the original Perl version, just as a control. The results:

batch:  8.27 wallclock seconds (0.02 usr + 0.56 sys = 0.58 CPU) @ 36.28/s
 func: 10.54 wallclock seconds (0.03 usr + 0.56 sys = 0.59 CPU) @ 28.47/s
 perl: 16.46 wallclock seconds (0.11 usr + 2.77 sys = 2.88 CPU) @ 18.23/s

Assuming that my benchmarking script tests a typical number of objects in the collection, the constant functions (with the label batch) are about 27.5 percent faster than the looping versions (func), while running nearly twice as fast as when all the work is done from within Perl (which also loops). Yay for batch updates!

Download the test to test the difference for yourself.

The Race Condition Again

In the previous article in this series, I wrote that I had eliminated the race condition between two processes updating the same collection. That was a little misleading. I almost eliminated it. One of the reviewers pointed out that there now exists a different--and much more subtle--race condition.

The issue is that, although locking the appropriate entry table row prevents inserts from executing until the collection update finishes, PostgreSQL checks foreign key constraints after executing an INSERT. This is the reverse of how the new functions behave: they lock the entry row before they do anything else.

Why is this a problem? Consider two connections to the database, one updating a collection via these functions and the other updating the same collection via manual queries.

  1. Manual connection inserts into entry_coll_tag.
  2. Function connection gets lock on entry row.
  3. Manual connection attempts to check the foreign key constraint, thus requesting a lock on entry. It blocks while it waits for the function connection to commit.
  4. Function connection makes a conflicting insert into the collection.
  5. Function connection waits on the manual connection as part of checking the foreign key constraint. Result: deadlock.

The key here is Step 2: If the function gets a lock after the manual connection has inserted the new row but before it checks the foreign key constraint, and makes a conflicting insert, there will be a deadlock; the connections wait for each other to finish, and therefore neither of them ever does. This is a very narrow race condition--much narrower than the original race condition--but it is still very real.

What's the solution to this problem? It's simple, really: never allow any connection to use anything other than your functions to update a collection. The functions always acquire the lock before doing anything else, so there's no way that one connection using the functions can get into a deadlock with another connection using them.

The best way to enforce this rule is with PostgreSQL permissions. For example, suppose that you let applications connect to your database using a PostgreSQL user named appuser. First, revoke that user's permission to make changes to the entry_coll_tag table:

try=# REVOKE INSERT, UPDATE, DELETE ON entry_coll_tag
try-# FROM   appuser;
REVOKE

Then, create the functions as a user that has these permissions, and use the SECURITY DEFINER at the end of each function declaration:

-- ...
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;

The SECURITY DEFINER clause allows the function to execute with the permissions of the user that defined the function, rather than the user calling it. Because a privileged user created the function, unprivileged users will be able to use it to make changes to the collection table, even if they can't change the collection table directly.

Conclusion

Functions that perform a finite number of tasks are nearly always preferable to those that perform a variable number of tasks based on the number of items to process. PostgreSQL kindly provides the tools to eliminate looping constructs in linear functions. By using the generate_series() function and PostgreSQL's batch update syntax, it's relatively straightforward to eliminate array-based looping constructs in PL/pgSQL functions. Now that those tools are at your disposal, get out there and take advantage of them!

Acknowledgments

My thanks to Josh Berkus for showing me how to use generate_series() and batch updates to eliminate the loops in my original collection functions. I learned a great deal in the process, and I am pleased to pass on the benefits of Josh's knowledge here. I'm also grateful to AndrewSN for pointing out the race condition and explaining it to me.

David Wheeler is a developer at Portland, Oregon-based Values of n, where he writes the code that makes Stikkit's little yellow notes think.


Return to O'Reilly Databases

Copyright © 2009 O'Reilly Media, Inc.