Published on O'Reilly (http://oreilly.com/)
 See this if you're having trouble printing code examples

Using More Perl in PostgreSQL

by Andrew Dunstan

The first article in this series examined the use of PL/Perl to create triggers. The trigger inserted a row into a database table for audit purposes using a new PL/Perl method called spi_exec_query(). This article looks in more detail at uses of that function and its new cousin, as well as other features for handling bulk data and composite types.

Getting Data out of the Database and Issuing Database Commands

Pl/Perl now allows the programmer to run an arbitrary query against the database. The query can be any statement that PostgreSQL will recognize as legal SQL. The mechanism used is PostgreSQL's Server Programming Interface, an API that makes it relatively easy for modules to run database commands. While SPI can do more than connect to programming language modules, this is probably its principal use.

Most of the SPI is at a level too low to make it appropriate to expose to the user in PL/Perl. Instead, the glue code does most of the work for you. PostgreSQL 8.0 therefore provides only one function to allow you to talk to the database: spi_exec_query().

This function takes either one or two arguments. The first is a string containing the SQL that you want the database to run. The second is a limit on the number of rows returned. This argument is always optional and makes sense only if the SQL statement is a SELECT. You can always put a LIMIT clause on the SQL statement anyway, so I always just leave it off.

A common request is for the ability to grant select on all tables. That's easy enough in PL/Perl. The function attached to this article will do it all nicely for you. The function takes three arguments: a schema name, a grantee name, and a set of privileges as a string. Inside, it runs a query that gets a list of all the table, view, and sequence names in the named schema that belong to the person running the function. It then iterates over that list, granting the set of specified privileges on each object to the grantee.

In case you think making up such a function was hard, because the query to get the list of objects is a bit complex, I have a confession to make: I cheated. The PostgreSQL command-line client psql has an extremely nifty mode in which it will tell you the SQL that it runs to execute its slash commands. I asked it to show me the SQL that corresponded to the \d command, cut and pasted the result into my query, and adjusted it slightly.

Having created the function, you can use it any time you like. Just call it in a select statement:

select grantall('public','Fred','select');

and the user Fred will get select privileges on all tables, views, and sequences in the public schema that you own.

Note: version 8.1 of PostgreSQL has an enhanced privileges regime, and you will probably need a slight modification of the function. This version works with version 8.0 of PostgreSQL.

Now dig a little deeper inside the function. Each query to run is a plain Perl string with some SQL inside. Pass that as the sole argument to spi_exec_query(), and it returns a hashref. The hashref has three possible members. The one called status is always present. It's a string, corresponding to the names of the SPI return code, as set out in the SPI section of the PostgreSQL documentation. Usually, if the status starts with SPI_OK_, the command worked. If not, it failed. There are several SPI_OK_ return codes. You can either look them up, or do what I did and do a regular expression match against the beginning of the string.

If the SQL statement processed any table rows, then an additional member, processed, is present with a count of the rows. Finally, if the query was a SELECT, then there is an additional member called rows, which is an arrayref. The contents of the array are hashrefs, one per row returned by the query. The keys of each hash are the column names, and the values are the column values for the row.

No matter the data type of the column returned by the query, the value is set in the row hash as a string. This is true if it's a date, a record, an array, or any other type--even numeric types. This can be inconvenient. Either you need to get the data from PostgreSQL in a form that Perl can digest, or you might need to get Perl to parse the strings. Luckily, Perl is good at parsing strings. One of the items on the TODO list for version 8.2 of PostgreSQL is to have row values of composite types returned to PL/Perl as arrayrefs or hashrefs, corresponding to PostgreSQL's row and array types.

Handing Data Back to the Database

On the other side of the coin, you can also now return records to PostgreSQL, either singly or in sets. The case of a set of records is probably the most interesting and useful. The simple way to return a set of records is to hand them back to PostgreSQL via the return statement, as an arrayref of hashrefs, with the column names as keys of the hashes, one hash per row. Notice that this is exactly the way that PostgreSQL hands rowsets to PL/Perl via spi_exec_query(). In fact, you could take the rows member of the result from spi_exec_query() and just return it, as long as it had the right keys in each hash.

Any missing field in a return value will become a NULL, as will any field whose value is undef.

You can also hand back a single record to PostgreSQL as a single hashref. PostgreSQL 8.1 allows the use of arrayrefs too.

How is this useful? Consider a task that is not easy in SQL but fairly simple using PL/Perl. Suppose that you have a table of exam scores. The table contains the name of the person with the score, the class he or she is in, and the actual score. The sample script contains some necessary type definitions as well as a PL/Perl function to populate the table with some test data. The goal is to augment that information with the ranking of the people, but only within their class. The function get_ranks() in the sample script will do that. Notice that the SQL orders the results by class and then descending score, in order to do control break processing.

In general, you need to use a PL function for any processing that needs to keep track of state, as this does. This example keeps track of the class so that it can reset the rank counter when it changes. The processing simply gets each row, looks to see whether there's a change in class and resets the counter if there is, stores the rank in the row, and stores the row on an arrayref to return at the end of processing. The ordering breaks ties in score (that's why it sorts by name), but implementing a tie-keeping mechanism would be a fairly simple alteration.

You can use a function like this one that returns a set of records as a table expression in the FROM clause of a SELECT query, just as you would a function returning a single value in the SELECT or WHERE clauses. The result looks like:

andrew=# select * from get_ranks();
     name  | class  | score | class_rank 
     Ellen | green  |    91 |          1
     Anne  | green  |    37 |          2
     Bill  | green  |    30 |          3
     May   | green  |    23 |          4
     Joe   | green  |    11 |          5
     Fred  | green  |     1 |          6
     Joe   | red    |    93 |          1
     Fred  | red    |    69 |          2
     Bill  | red    |    53 |          3
     May   | red    |    51 |          4
     Anne  | red    |    45 |          5
     Ellen | red    |    14 |          6
     Anne  | yellow |    88 |          1
     Ellen | yellow |    68 |          2
     May   | yellow |    25 |          3
     Joe   | yellow |    21 |          4
     Bill  | yellow |    11 |          5
     Fred  | yellow |     8 |          6
(18 rows)

Memory Problems with Lots of Data

Fetching data into PL/Perl using the above method requires storing the whole resultset in Perl's main memory. Similarly, if you just return an array of hashrefs, PL/Perl first creates these in Perl's memory all at once, before handing them off to PostgreSQL. Because databases are often about handling huge amounts of data, which can frequently exceed the size of available memory, this is not a good idea. In version 8.0 there's no real solution--you just have to make sure you don't SELECT or return huge data sets that will use lots of memory. Version 8.1 of PostgreSQL provides some much better ways of doing things that do not suffer from the memory bloat problem. The credit for this goes to Abhijit Menon-Sen, who has done some great work on Perl and PostgreSQL.

The first tip is that when fetching a large data set, you should use spi_query() and spi_fetchrow() instead of spi_exec_query(). spi_query() returns a handle to use as the parameter to spi_fetchrow(), which returns one row at a time, or undef when there are no more rows. So you never need to get all the rows into Perl memory at one time. Just fetch one row at a time in a loop.

To return values, use a similar mechanism--in fact, to return large amounts of data your function should never return it directly at all--the return value of the function should be undef! Instead, use a new callback function called return_next(), which you give a single return value.

The function get_ranks_revised() in the sample script uses the these new mechanisms to produce the same results as above. Transforming the function was quite trivial, and the new version not only uses less memory but runs faster too.

Next Article

The final article in this series will examine sharing data between PL/Perl functions, turning on strict mode, Unicode string handling, and the road map for further enhancements to PL/Perl.

Andrew Dunstan works for a small consulting and software company in the Triangle area of North Carolina, and contributes to PostgreSQL as an enthusiastic hobbyist as well as a sometime professional user.

Return to the Databases DevCenter

Copyright © 2009 O'Reilly Media, Inc.