O'Reilly Databases

oreilly.comSafari Books Online.Conferences.

We've expanded our coverage and improved our search! Search for all things Database across O'Reilly!

Search Search Tips

advertisement
AddThis Social Bookmark Button

Listen Print Discuss Subscribe to Databases Subscribe to Newsletters

Using Regular Expressions in PostgreSQL
Pages: 1, 2, 3

Using Regular Expressions to Extract Substrings

As cool as regular expressions are for returning strings, they really shine when used for parsing text strings. PostgreSQL implements an overloaded substring() function that returns strings based on character position using integer values and, more importantly, by the use of regular expressions against strings and their substrings. You now have the awesome ability of returning formatted substrings from large, complex strings with very little coding instruction.



I'm going to demonstrate the power of regular expressions by generating a formatted query. The query will parse firewall syslog records and return the date, the time, the remote IP address, the originating source port, and the port that it's attempting to connect on the attacked server.

Create a single-column table called log:

CREATE TABLE log(record text);

Populate this table by executing this SQL file from your psql session:

\i log.sql

First things first, try to retrieve return one record using a generic regular expression. (Note that there is a space at the end of the class definition.)

SELECT substring(record, '[a-zA-Z0-9:. ]{1,}') FROM log LIMIT 1;

This next query returns a valid date stamp. The date function serves the double function of formatting the date as well as validating it, thus catching possible errors:

SELECT date(substring(record, '[a-zA-Z ]{1,}[0-9]{1,}') || ' 2005')
    AS "Date" FROM log LIMIT 1;

Now test this new query to return the time:

SELECT substring('Nov  3 07:37:51 localhost', '[:0-9]{2,}') AS "Time";

But wait--it doesn't work if you were to add 0 to the 3, thus making the date November 30:

SELECT substring('Nov  30 07:37:51 localhost', '[:0-9]{2,}') AS "Time";

Edit the query to correct the problem by insisting that it looks for three sequence character patterns instead of two, which the day of the month will never be (one hopes, anyway):

SELECT substring('Nov  30 07:37:51 localhost', '[:0-9]{3,}') AS "Time";

If that works, test one real record:

SELECT substring(record, '[:0-9]{3,}') AS "Time" FROM log LIMIT 1;

People recognize the remote IP address by scanning the syslog line and locating the character string SRC= and the concatenated IP address. This query emulates the same process by looking for a sequence of at least two characters in length that includes a period and a digit. This sequence immediately follows the characters SRC=.

SELECT substring(record, 'SRC=*([.0-9]{2,})') AS "IP Address" FROM log LIMIT 1;

That's almost too easy! Repeat the process to obtain the IP address for the source port (SPT=) and destination port (DPT=), respectively:

SELECT substring(record, 'SPT=*([.0-9]{2,})') AS "Remote Source Port"
    FROM log LIMIT 1;
SELECT substring(record, 'DPT=*([.0-9]{2,})') AS "Destination Port"
    FROM log LIMIT 1;

Here's the complete SQL statement:

SELECT
    date(substring(record, '[a-zA-Z ]{1,}[0-9]{1,}') || ' 2005') AS "Date",
    substring(record, '[:0-9]{3,}') AS "Time",
    substring(record, 'SRC=*([.0-9]{2,})') AS "Remote IP Address",
    substring(record, 'SPT=*([.0-9]{2,})') AS "Remote Source Port",
    substring(record, 'DPT=*([.0-9]{2,})') AS "Destination Port"
FROM log;

It's important to experiment with your expressions before putting them into a production environment; otherwise, you may be surprised by unintended results.

Conclusion

Guess what? There's still more that I've haven't covered! Among other things, PostgreSQL also provides class shortcuts, as per the Posix standard, including digits, alphanumeric characters, and control characters. PostgreSQL provides escape sequences for characters and Unicode too.

You'll find that the biggest hurdle using regular expressions is being comfortable with them, as they are so badly taught. Therefore, people often have to learn them without support. Sometimes they're lucky and they can live with a simple understanding, or perhaps they found a good book.

I hope you can consider this article as one such example of a good book.

Chapter 9.7 of the PostgreSQL manual, Functions and Operators, explains more about matching functions and string functions.

Robert Bernier is the PostgreSQL business intelligence analyst for SRA America, a subsidiary of Software Research America (SRA).


Return to O'Reilly Databases


Have a question about the finer points of regexes? Ask Robert here.
You must be logged in to the O'Reilly Network to post a talkback.
Post Comment
Full Threads Oldest First

Showing messages 1 through 2 of 2.

  • Selecting non-displayable chars using regex
    2006-06-27 04:11:41  btv-ucs [Reply | View]

    I would like to know how can I use regex to retrieve data on my system that has characters in that are not normal keyboard chars. is there a simpler way of doing than hardcoding each char like below :
    select name from user wher name !~* '[a-z0-9.!@] ;

    I want to find all the data on my database that has got chars like '^c' ect.

    Thanks
  • They can also be used in check constraints
    2006-02-02 23:30:35  Joseph Scott [Reply | View]

    Another handy thing that you can do with regular expressions in PostgreSQL is use them in check constraints. I wrote an example of this at:

    http://joseph.randomnetworks.com/archives/2004/05/24/postgresql-check-constraint-supports-regular-expressions/


Tagged Articles

Be the first to post this article to del.icio.us

Sponsored Resources

  • Inside Lightroom

Related to this Article

Understanding Oracle Clinical Understanding Oracle Clinical
by Joan M. Johnson
June 2009
$9.99 USD

New Features in Oracle 9i New Features in Oracle 9i
by Howard J. Rogers
June 2009
$5.95 USD

Advertisement
O'Reilly Media

©2009, O'Reilly Media, Inc.
(707) 827-7000 / (800) 998-9938
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
About O'Reilly
Academic Solutions
Authors
Contacts
Customer Service
Jobs
Newsletters
O'Reilly Labs
Press Room
Privacy Policy
RSS Feeds
Terms of Service
User Groups
Writing for O'Reilly
Content Archive
Business Technology
Computer Technology
Google
Microsoft
Mobile
Network
Operating System
Digital Photography
Programming
Software
Web
Web Design
More O'Reilly Sites
O'Reilly Radar
Ignite
Tools of Change for Publishing
Digital Media
Inside iPhone
makezine.com
craftzine.com
hackszine.com
perl.com
xml.com

Partner Sites
InsideRIA
java.net
O'Reilly Insights on Forbes.com