O'Reilly Databases

oreilly.comSafari Books Online.Conferences.

We've expanded our coverage and improved our search! Search for all things Database across O'Reilly!

Search Search Tips

advertisement
AddThis Social Bookmark Button

Print Subscribe to Databases Subscribe to Newsletters

Building a Data Warehouse with MySQL and Perl
Pages: 1, 2

ETL

If you want to build a data warehouse, your first job is to come up with a dimensional model for your data. This task is largely outside the scope of this article, but I'll point out some good references in Going Further.

Once you have a data schema for your warehouse, you'll need to fill it with data. This process is known as extract, transform, and load, or ETL for short. The first step, extraction, is simply the process of selecting all the data of interest from the operational database. Then the data must be transformed into the format needed by the warehouse. This could be as simple as renaming some of the fields or as complex as cleaning dirty data and computing new fields. Finally the data must be loaded into the data warehouse.

ETL is typically a batch process--often running on a nightly basis. Some systems do real-time ETL, updating the warehouse as new data appears in the operational databases.

Extract

Our ETL process uses MySQL's SELECT ... INTO OUTFILE system to create CSVs containing data from our operational database. Many tables are dumped as-is, but for the more complex cases we do mix some transformations into the SELECT statement. For example, here's a statement that extracts all the data from the link table that has corresponding records in person_link or campaign_link_click:

    SELECT l.link_id, l.link_type_id, l.url, l.from_url, 
           l.position, l.message_id, l.message_part_id, 
           COALESCE(l.campaign_id, m.campaign_id) AS campaign_id
    FROM link l 
      LEFT OUTER JOIN mailing_message m ON (l.message_id = m.message_id) 
    WHERE EXISTS (     
      SELECT 1 FROM person_link WHERE link_id = l.link_id   
    ) OR EXISTS (     
      SELECT 1 FROM campaign_link_click WHERE link_id = l.link_id   
    )
    INTO OUTFILE ?
    FIELDS TERMINATED BY ','
    ENCLOSED BY '"'
    LINES TERMINATED BY '\n'

In the process the link is assigned a single campaign_id from either the mailing_message or the link record itself. The details of why this is done aren't important, but suffice it to say that doing this in the SELECT statement is easier and faster than it would be in the transform step proper.

You may be wondering what kind of strain extraction places on the operational database. The answer is quite a lot! To deal with this problem we use MySQL's replication system to produce a low-cost copy of the operational database on the warehouse machine. This means the warehouse loader can run without affecting operational performance.

Transform

The end result of the extraction process is a CSV file containing data destined for the warehouse. We use the excellent Text::CSV_XS module to read these files, and apply any additional transformations needed in Perl code.

For example, we compute data about the most recent, highest, and first contribution given by each person in our database. These values are frequently queried, and queries can run much faster with these values in the person dimension.

Another example is geo-coding. We feed address data to Geo::Coder::US and Geo::PostalCode to produce a latitude and longitude for each person. These values go into the person dimension and power "within X miles of" queries. The warehouse transform step provides an ideal time to run potentially expensive data-sweetening operations like geo-coding.

The transform step ends by writing out another CSV file with the transformed data. We use Text::CSV_XS to format the data for writing, too.

Load

Finally, the data needs to get into the warehouse. Here we use MySQL's LOAD DATA INFILE command to load CSV data directly into the database. To speed up the load, we first disable indexes with ALTER TABLE foo DISABLE KEYS, and after the load, we re-enable them with ALTER TABLE foo ENABLE KEYS. Each table needs to be cleared before loading, which we do with the MySQL TRUNCATE command.

You may be wondering what happens to clients using the warehouse while an ETL process is running. In our case, nothing at all! This magic is achieved by actually having two warehouse databases, one in use and the other free for loading. All the data goes into the loading database, and when it's full we swap it into place with a big ol' RENAME:

   RENAME TABLE
    live_db.table_foo          TO loading_db.table_foo_tmp,
    loading_db.table_foo       TO live_db.table_foo,
    loading_db.table_foo_tmp   TO loading_db.table_foo
    live_db.table_bar          TO loading_db.table_bar_tmp,
    loading_db.table_bar       TO live_db.table_bar,
    loading_db.table_bar_tmp   TO loading_db.table_bar
    ...

This produces an atomic switch of all tables in the loading database with the tables in the live database. It will wait for any running queries in the warehouse to finish before performing the swap, which is exactly what we want.

ETL Observations

CSV Isn't a Standard

One major problem we had in constructing our ETL process was in finding a way to speak MySQL's rather perverse CSV syntax (as used by SELECT ... INTO OUTFILE and LOAD DATA INFILE) with Text::CSV_XS. MySQL uses \N to indicate NULLs in CSV data output, while also using the backslash as a quote character in string. Thus, when Text::CSV_XS reads \N it gets turned into N before we can process it!

Worse, trying to get Text::CSV_XS to write a literal \N proved impossible. Fortunately LOAD DATA INFILE will also treat the string "NULL" as NULL.

An ETL process based on something better standardized like XML wouldn't have these problems, although it might not perform as well due to formatting overhead.

Transforms Are Not Always Needed

Many tables in our operational database don't need any transformation to become dimensions in the warehouse. We have plans to provide a straight database-to-database extract-and-load for these tables using MySQL's SELECT ... INTO TABLE syntax. This should perform much better by entirely skipping the intermediate CSV files.

Incremental Loading Is Highly Desirable

As we developed more tools that use the warehouse, it became more and more desirable to have it kept up-to-date more frequently. We are in the process of designing a system that will keep the warehouse up-to-date in real time. The system will be incremental, moving data as it is added and changed in the main database, likely using triggers to help us know what needs moving without burdening application code.

Using MySQL as a Data Warehouse

Any relational database system can be used to build a data warehouse. At Plus Three we chose MySQL v4.1.x for our warehouse system. This choice was natural for us as we're largely a MySQL shop. We chose v4.1.x because it was the stable version when we started development.

Our operational database uses MySQL's InnoDB backend, providing referential integrity and transactions. However, we chose MySQL's MyISAM backend for our warehouse. Warehouse applications are read-only, and as such, have little need for run-time referential integrity checks and transactions. MyISAM can also be faster than InnoDB in some circumstances.

An upside to using MySQL for the warehouse is high capacity and decent performance with little tuning. Our larger installations handle tables with more than one hundred million rows without major difficulties. Although we've spent some time optimizing for particular warehouse queries, the majority were fast enough out of the gate.

All that said, there are features we'd like to have that MySQL is lacking. Perhaps chief among them is support for bitmap indexes. Bitmap indexes are ideal for the kind of low-cardinality data that is commonly used in data warehouses. PostgreSQL supports bitmap indexes as of version v8.1, as do a number of commercial database systems.

We also found MySQL's subquery optimization to be of poor quality. Subqueries are relatively new to MySQL, so there's a good chance this will improve when we upgrade to MySQL v5.1.x soon.

A nontraditional alternative is a column-oriented database, like Vertica and MonetDB (which has a DBI driver). Column-oriented database systems store data by column rather than by row, and as such, offer performance advantages for typical data-warehouse usage patterns.

Going Further

Data warehousing is a deep topic, and I've only really scratched the surface. If this article has piqued your interest, I suggest you pick up a copy of The Data Warehouse Toolkit by Ralph Kimball and Margy Ross. This book is invaluable in presenting dimensional modeling, the key to getting the most out of a data warehouse.

Sam Tregar is employed as a Perl programmer for Plus Three LP and is an active participant in the Perl development community.


Return to Databases.


Comments on this article
Full Threads Newest First

Showing messages 1 through 3 of 3.

  • Matthew Russell photo Awesome Article!
    2007-05-01 09:40:11  Matthew Russell | O'Reilly AuthorO'Reilly Blogger [View]

    I just wanted to say that this is one of the best articles I've ever read on the O'Reilly Network. Great job of introducing this topic and providing a great example of how it can actually be used in real world applications. Thanks again!
  • delimiters
    2007-07-25 10:55:16  awillis [View]

    I've found that not using csv at all when loading data into and out of mysql is often easier. SELECT INTO OUTFILE and LOAD DATA INFILE default to tab separated output. If you have text fields that contain tabs, and you can't convert them to spaces, then perhaps I would use a comma delimiter.
  • Data Warehousing using MySQL and Perl
    2008-09-06 06:28:33  DutchAlbatros [View]

    Great article.
    There is only one thing missing that is to visualize the progress of the ETL process.

    What would be greater than to have the ability to monitor the whole ETL process with the help of a Perl programme. The TK module from Perl would be an excellent tool to visualize the process. Even better would be to define different OO objects in Perl for different tasks that have to be carried out in the ETL process. For each ETL OO object specific properties could be defined. For example to carry out and sql task, it has to have a name, a SQL command, and so forth.

    When this is compiled as a Perl service the ETL process can be run continously.

    When at the end of the process a email will be sent to confirm that a process has been carried out succesfully or has gone wrong the whole programme is completely selfsufficient.

    DutchAlbatros


Tagged Articles

Be the first to post this article to del.icio.us

Sponsored Resources

  • Inside Lightroom

Related to this Article

Access 2013 For Dummies Access 2013 For Dummies
March 2013
$24.99 USD

PHP, MySQL, JavaScript & HTML5 All-in-One For Dummies PHP, MySQL, JavaScript & HTML5 All-in-One For Dummies
by Steve Suehring
March 2013
$44.99 USD

Advertisement
Sign up today to receive special discounts,
product alerts, and news from O'Reilly.
Privacy Policy >
View Sample Newsletter >
  • Youtube
  • http://www.youtube.com/OreillyMedia
  • Twitter
  • Subscribe
  • View All RSS Feeds >
O'Reilly Media

800-889-8969 or 707-827-7019
Monday-Friday 7:30am-5pm PT
©2011, O'Reilly Media, Inc.
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
  • About O'Reilly
  • Academic Solutions
  • Contacts
  • Customer Service
  • Careers
  • Press Room
  • Privacy Policy
  • Terms of Service
  • Writing for O'Reilly
  • Community
  • Authors
  • Forums
  • Membership
  • Newsletters
  • RSS Feeds
  • User Groups
  • Partner Sites
  • makezine.com
  • makerfaire.com
  • craftzine.com
  • igniteshow.com
  • PayPal Developer Zone
  • O'Reilly Insights on Forbes.com