Using Regular Expressions in PostgreSQL
Pages: 1, 2, 3
Using Regular Expressions to Extract Substrings
As cool as regular expressions are for returning strings, they really shine when used for parsing text strings. PostgreSQL implements an overloaded substring() function that returns strings based on character position using integer values and, more importantly, by the use of regular expressions against strings and their substrings. You now have the awesome ability of returning formatted substrings from large, complex strings with very little coding instruction.
I'm going to demonstrate the power of regular expressions by generating a formatted query. The query will parse firewall syslog records and return the date, the time, the remote IP address, the originating source port, and the port that it's attempting to connect on the attacked server.
Create a single-column table called log:
CREATE TABLE log(record text);
Populate this table by executing this SQL file from your psql session:
\i log.sql
First things first, try to retrieve return one record using a generic regular expression. (Note that there is a space at the end of the class definition.)
SELECT substring(record, '[a-zA-Z0-9:. ]{1,}') FROM log LIMIT 1;
This next query returns a valid date stamp. The date function serves the double function of formatting the date as well as validating it, thus catching possible errors:
SELECT date(substring(record, '[a-zA-Z ]{1,}[0-9]{1,}') || ' 2005')
AS "Date" FROM log LIMIT 1;
Now test this new query to return the time:
SELECT substring('Nov 3 07:37:51 localhost', '[:0-9]{2,}') AS "Time";
But wait--it doesn't work if you were to add 0 to the 3, thus making the date November 30:
SELECT substring('Nov 30 07:37:51 localhost', '[:0-9]{2,}') AS "Time";
Edit the query to correct the problem by insisting that it looks for three sequence character patterns instead of two, which the day of the month will never be (one hopes, anyway):
SELECT substring('Nov 30 07:37:51 localhost', '[:0-9]{3,}') AS "Time";
If that works, test one real record:
SELECT substring(record, '[:0-9]{3,}') AS "Time" FROM log LIMIT 1;
People recognize the remote IP address by scanning the syslog line and locating the character string SRC= and the concatenated IP address. This query emulates the same process by looking for a sequence of at least two characters in length that includes a period and a digit. This sequence immediately follows the characters SRC=.
SELECT substring(record, 'SRC=*([.0-9]{2,})') AS "IP Address" FROM log LIMIT 1;
That's almost too easy! Repeat the process to obtain the IP address for the source port (SPT=) and destination port (DPT=), respectively:
SELECT substring(record, 'SPT=*([.0-9]{2,})') AS "Remote Source Port"
FROM log LIMIT 1;
SELECT substring(record, 'DPT=*([.0-9]{2,})') AS "Destination Port"
FROM log LIMIT 1;
Here's the complete SQL statement:
SELECT
date(substring(record, '[a-zA-Z ]{1,}[0-9]{1,}') || ' 2005') AS "Date",
substring(record, '[:0-9]{3,}') AS "Time",
substring(record, 'SRC=*([.0-9]{2,})') AS "Remote IP Address",
substring(record, 'SPT=*([.0-9]{2,})') AS "Remote Source Port",
substring(record, 'DPT=*([.0-9]{2,})') AS "Destination Port"
FROM log;
It's important to experiment with your expressions before putting them into a production environment; otherwise, you may be surprised by unintended results.
Conclusion
Guess what? There's still more that I've haven't covered! Among other things, PostgreSQL also provides class shortcuts, as per the Posix standard, including digits, alphanumeric characters, and control characters. PostgreSQL provides escape sequences for characters and Unicode too.
You'll find that the biggest hurdle using regular expressions is being comfortable with them, as they are so badly taught. Therefore, people often have to learn them without support. Sometimes they're lucky and they can live with a simple understanding, or perhaps they found a good book.
I hope you can consider this article as one such example of a good book.
Chapter 9.7 of the PostgreSQL manual, Functions and Operators, explains more about matching functions and string functions.
Robert Bernier is the PostgreSQL business intelligence analyst for SRA America, a subsidiary of Software Research America (SRA).
Return to O'Reilly Databases
You must be logged in to the O'Reilly Network to post a talkback.
Showing messages 1 through 2 of 2.
-
They can also be used in check constraints
2006-02-02 23:30:35 Joseph Scott [Reply | View]
Another handy thing that you can do with regular expressions in PostgreSQL is use them in check constraints. I wrote an example of this at:
http://joseph.randomnetworks.com/archives/2004/05/24/postgresql-check-constraint-supports-regular-expressions/






select name from user wher name !~* '[a-z0-9.!@] ;
I want to find all the data on my database that has got chars like '^c' ect.
Thanks