Building a Data Warehouse with MySQL and Perl
Pages: 1, 2
If you want to build a data warehouse, your first job is to come up with a dimensional model for your data. This task is largely outside the scope of this article, but I'll point out some good references in Going Further.
Once you have a data schema for your warehouse, you'll need to fill it with data. This process is known as extract, transform, and load, or ETL for short. The first step, extraction, is simply the process of selecting all the data of interest from the operational database. Then the data must be transformed into the format needed by the warehouse. This could be as simple as renaming some of the fields or as complex as cleaning dirty data and computing new fields. Finally the data must be loaded into the data warehouse.
ETL is typically a batch process--often running on a nightly basis. Some systems do real-time ETL, updating the warehouse as new data appears in the operational databases.
Our ETL process uses MySQL's
SELECT ... INTO OUTFILE system to create CSVs containing data from our operational database. Many tables are dumped as-is, but for the more complex cases we do mix some transformations into the
SELECT statement. For example, here's a statement that extracts all the data from the
link table that has corresponding records in
SELECT l.link_id, l.link_type_id, l.url, l.from_url, l.position, l.message_id, l.message_part_id, COALESCE(l.campaign_id, m.campaign_id) AS campaign_id FROM link l LEFT OUTER JOIN mailing_message m ON (l.message_id = m.message_id) WHERE EXISTS ( SELECT 1 FROM person_link WHERE link_id = l.link_id ) OR EXISTS ( SELECT 1 FROM campaign_link_click WHERE link_id = l.link_id ) INTO OUTFILE ? FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n'
In the process the link is assigned a single
campaign_id from either the
mailing_message or the
link record itself. The details of why this is done aren't important, but suffice it to say that doing this in the
SELECT statement is easier and faster than it would be in the transform step proper.
You may be wondering what kind of strain extraction places on the operational database. The answer is quite a lot! To deal with this problem we use MySQL's replication system to produce a low-cost copy of the operational database on the warehouse machine. This means the warehouse loader can run without affecting operational performance.
The end result of the extraction process is a CSV file containing data destined for the warehouse. We use the excellent
Text::CSV_XS module to read these files, and apply any additional transformations needed in Perl code.
For example, we compute data about the most recent, highest, and first contribution given by each person in our database. These values are frequently queried, and queries can run much faster with these values in the person dimension.
Another example is geo-coding. We feed address data to
Geo::PostalCode to produce a latitude and longitude for each person. These values go into the person dimension and power "within X miles of" queries. The warehouse transform step provides an ideal time to run potentially expensive data-sweetening operations like geo-coding.
The transform step ends by writing out another CSV file with the transformed data. We use
Text::CSV_XS to format the data for writing, too.
Finally, the data needs to get into the warehouse. Here we use MySQL's
LOAD DATA INFILE command to load CSV data directly into the database. To speed up the load, we first disable indexes with
ALTER TABLE foo DISABLE KEYS, and after the load, we re-enable them with
ALTER TABLE foo ENABLE KEYS. Each table needs to be cleared before loading, which we do with the MySQL
You may be wondering what happens to clients using the warehouse while an ETL process is running. In our case, nothing at all! This magic is achieved by actually having two warehouse databases, one in use and the other free for loading. All the data goes into the loading database, and when it's full we swap it into place with a big ol'
RENAME TABLE live_db.table_foo TO loading_db.table_foo_tmp, loading_db.table_foo TO live_db.table_foo, loading_db.table_foo_tmp TO loading_db.table_foo
live_db.table_bar TO loading_db.table_bar_tmp, loading_db.table_bar TO live_db.table_bar, loading_db.table_bar_tmp TO loading_db.table_bar
This produces an atomic switch of all tables in the loading database with the tables in the live database. It will wait for any running queries in the warehouse to finish before performing the swap, which is exactly what we want.
CSV Isn't a Standard
One major problem we had in constructing our ETL process was in finding a way to speak MySQL's rather perverse CSV syntax (as used by
SELECT ... INTO OUTFILE and
LOAD DATA INFILE) with
Text::CSV_XS. MySQL uses
\N to indicate NULLs in CSV data output, while also using the backslash as a quote character in string. Thus, when
\N it gets turned into
N before we can process it!
Worse, trying to get
Text::CSV_XS to write a literal
\N proved impossible. Fortunately
LOAD DATA INFILE will also treat the string
An ETL process based on something better standardized like XML wouldn't have these problems, although it might not perform as well due to formatting overhead.
Transforms Are Not Always Needed
Many tables in our operational database don't need any transformation to become dimensions in the warehouse. We have plans to provide a straight database-to-database extract-and-load for these tables using MySQL's
SELECT ... INTO TABLE syntax. This should perform much better by entirely skipping the intermediate CSV files.
Incremental Loading Is Highly Desirable
As we developed more tools that use the warehouse, it became more and more desirable to have it kept up-to-date more frequently. We are in the process of designing a system that will keep the warehouse up-to-date in real time. The system will be incremental, moving data as it is added and changed in the main database, likely using triggers to help us know what needs moving without burdening application code.
Using MySQL as a Data Warehouse
Any relational database system can be used to build a data warehouse. At Plus Three we chose MySQL v4.1.x for our warehouse system. This choice was natural for us as we're largely a MySQL shop. We chose v4.1.x because it was the stable version when we started development.
Our operational database uses MySQL's InnoDB backend, providing referential integrity and transactions. However, we chose MySQL's MyISAM backend for our warehouse. Warehouse applications are read-only, and as such, have little need for run-time referential integrity checks and transactions. MyISAM can also be faster than InnoDB in some circumstances.
An upside to using MySQL for the warehouse is high capacity and decent performance with little tuning. Our larger installations handle tables with more than one hundred million rows without major difficulties. Although we've spent some time optimizing for particular warehouse queries, the majority were fast enough out of the gate.
All that said, there are features we'd like to have that MySQL is lacking. Perhaps chief among them is support for bitmap indexes. Bitmap indexes are ideal for the kind of low-cardinality data that is commonly used in data warehouses. PostgreSQL supports bitmap indexes as of version v8.1, as do a number of commercial database systems.
We also found MySQL's subquery optimization to be of poor quality. Subqueries are relatively new to MySQL, so there's a good chance this will improve when we upgrade to MySQL v5.1.x soon.
A nontraditional alternative is a column-oriented database, like Vertica and MonetDB (which has a DBI driver). Column-oriented database systems store data by column rather than by row, and as such, offer performance advantages for typical data-warehouse usage patterns.
Data warehousing is a deep topic, and I've only really scratched the surface. If this article has piqued your interest, I suggest you pick up a copy of The Data Warehouse Toolkit by Ralph Kimball and Margy Ross. This book is invaluable in presenting dimensional modeling, the key to getting the most out of a data warehouse.
Sam Tregar is employed as a Perl programmer for Plus Three LP and is an active participant in the Perl development community.
Return to Databases.