Data warehousing is the practice of building databases specifically to support complex ad-hoc querying by end-users. It isn't a new concept, but it doesn't seem to be a common technique in the Perl database community. In this article, I'll explain the basic parts of a data warehouse and describe how we created one with Perl and MySQL.
At my place of employment, Plus Three LP, we designed a data warehouse as part of our Arcos product to provide efficient access to large databases of supporters for our clients, who are mainly political campaigns and nonprofit groups engaged in fundraising. The warehouse is used for reporting and for list segmentation, which is the process of finding a subset of people to which to send a particular email. The chief architect of our data warehouse at Plus Three is Perrin Harkins, who also helped me with this article. Thanks, Perrin!
If you have experience with database programming, then most likely you've spent your time working on operational databases. Operational databases serve as order entry systems for online stores, content storage for content management systems, post storage for blogs, and much more. The key attribute of an operational database is that it is designed to be modified frequently and with maximum ease. A business will often have more than one operational database; Arcos has two: Krang's content-management database and a database used by our front-end applications.
Operational databases are most commonly designed using normalized modeling, often using third-normal form or entity-relationship modeling. Normalized database schemas are tuned to support fast updates and inserts by minimizing the number of rows that must be changed when recording new data.
Consider, for example, Figure 1--a simple normalized representation for an order-management database. The normalized layout means changing a category name, or a SKU type means altering a single row.
Figure 1. A simple normalized schema
Data warehouses differ from operational databases in the way they are designed; they are optimized for efficient querying and not for updating. Data warehouses provide a read-only version of the data in the operational databases, which is optimized for querying. The kind of modeling most commonly used in warehouse design is called dimensional modeling, and the schemas produced are known as star schemas. In dimensional modeling, a database is organized around a small number of fact tables. Each row in a fact table is a single measurable event: a single sale, a single hit to a web page, etc.
Surrounding each fact table is a set of dimension tables. Dimension tables add descriptive detail to the facts in the fact table. An important aspect of dimensional modeling is that dimension tables are expected to be both verbose and simple in their construction. Denormalized data is the rule in dimension table design.
Figure 2 shows the previous example in a possible dimensional layout. The normalized data from Figure 1 has been flattened into a single fact table with three dimension tables. If this is your first exposure to dimensional modeling, you're probably recoiling in horror, particularly at the sight of the "Date_Dimension." However, consider how much easier it would be to write reporting SQL for this schema--finding all the orders for a particular category or the total of all orders made on holidays. Consider how much improvement is possible for the kind of complexity regularly found in real-world normalized schemas.
Figure 2. A simple dimensional schema
Dimensional modeling is a complex topic, and not one that I'm even qualified to teach. If you decide to build a data warehouse yourself, you'll need to learn more about it.
Data warehouses provide for fast, simple access to the fundamental data of an organization. There are many valid uses for data warehousing, but the most common uses focus on consolidation and simplification of data.
Consolidation of data is necessary when you have multiple different database systems, and you need the ability to write queries that examine data from all of them. These operational databases may be physically separate and may run on different database software. Providing a central place where all the data for an organization is collected and made accessible can be of very high value.
Simplification of data is a more subtle benefit. Data in normalized form is often highly complex. Normalization tends to result in a "spider web" pattern, where many tables link to many other tables, creating a dense "web" of links. The central tenant of normalized data design, that each piece of information should be present in exactly one row, necessitates the creation of many tables to store each unique type of data in an application.
Dimensional modeling offers a chance to reduce the level of complexity in your database. By reducing complex chains of tables into dimension tables, the schema becomes smaller and performance tends to improve. For example, our main operational database has 79 tables. Our warehouse has just 18 tables! The two databases contain almost all the same information, but the warehouse schema stores the data in denormalized dimension and fact tables.
If you want to build a data warehouse, your first job is to come up with a dimensional model for your data. This task is largely outside the scope of this article, but I'll point out some good references in Going Further.
Once you have a data schema for your warehouse, you'll need to fill it with data. This process is known as extract, transform, and load, or ETL for short. The first step, extraction, is simply the process of selecting all the data of interest from the operational database. Then the data must be transformed into the format needed by the warehouse. This could be as simple as renaming some of the fields or as complex as cleaning dirty data and computing new fields. Finally the data must be loaded into the data warehouse.
ETL is typically a batch process--often running on a nightly basis. Some systems do real-time ETL, updating the warehouse as new data appears in the operational databases.
Our ETL process uses MySQL's
SELECT ... INTO OUTFILE system to create CSVs containing data from our operational database. Many tables are dumped as-is, but for the more complex cases we do mix some transformations into the
SELECT statement. For example, here's a statement that extracts all the data from the
link table that has corresponding records in
SELECT l.link_id, l.link_type_id, l.url, l.from_url, l.position, l.message_id, l.message_part_id, COALESCE(l.campaign_id, m.campaign_id) AS campaign_id FROM link l LEFT OUTER JOIN mailing_message m ON (l.message_id = m.message_id) WHERE EXISTS ( SELECT 1 FROM person_link WHERE link_id = l.link_id ) OR EXISTS ( SELECT 1 FROM campaign_link_click WHERE link_id = l.link_id ) INTO OUTFILE ? FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n'
In the process the link is assigned a single
campaign_id from either the
mailing_message or the
link record itself. The details of why this is done aren't important, but suffice it to say that doing this in the
SELECT statement is easier and faster than it would be in the transform step proper.
You may be wondering what kind of strain extraction places on the operational database. The answer is quite a lot! To deal with this problem we use MySQL's replication system to produce a low-cost copy of the operational database on the warehouse machine. This means the warehouse loader can run without affecting operational performance.
The end result of the extraction process is a CSV file containing data destined for the warehouse. We use the excellent
Text::CSV_XS module to read these files, and apply any additional transformations needed in Perl code.
For example, we compute data about the most recent, highest, and first contribution given by each person in our database. These values are frequently queried, and queries can run much faster with these values in the person dimension.
Another example is geo-coding. We feed address data to
Geo::PostalCode to produce a latitude and longitude for each person. These values go into the person dimension and power "within X miles of" queries. The warehouse transform step provides an ideal time to run potentially expensive data-sweetening operations like geo-coding.
The transform step ends by writing out another CSV file with the transformed data. We use
Text::CSV_XS to format the data for writing, too.
Finally, the data needs to get into the warehouse. Here we use MySQL's
LOAD DATA INFILE command to load CSV data directly into the database. To speed up the load, we first disable indexes with
ALTER TABLE foo DISABLE KEYS, and after the load, we re-enable them with
ALTER TABLE foo ENABLE KEYS. Each table needs to be cleared before loading, which we do with the MySQL
You may be wondering what happens to clients using the warehouse while an ETL process is running. In our case, nothing at all! This magic is achieved by actually having two warehouse databases, one in use and the other free for loading. All the data goes into the loading database, and when it's full we swap it into place with a big ol'
RENAME TABLE live_db.table_foo TO loading_db.table_foo_tmp, loading_db.table_foo TO live_db.table_foo, loading_db.table_foo_tmp TO loading_db.table_foo
live_db.table_bar TO loading_db.table_bar_tmp, loading_db.table_bar TO live_db.table_bar, loading_db.table_bar_tmp TO loading_db.table_bar
This produces an atomic switch of all tables in the loading database with the tables in the live database. It will wait for any running queries in the warehouse to finish before performing the swap, which is exactly what we want.
One major problem we had in constructing our ETL process was in finding a way to speak MySQL's rather perverse CSV syntax (as used by
SELECT ... INTO OUTFILE and
LOAD DATA INFILE) with
Text::CSV_XS. MySQL uses
\N to indicate NULLs in CSV data output, while also using the backslash as a quote character in string. Thus, when
\N it gets turned into
N before we can process it!
Worse, trying to get
Text::CSV_XS to write a literal
\N proved impossible. Fortunately
LOAD DATA INFILE will also treat the string
An ETL process based on something better standardized like XML wouldn't have these problems, although it might not perform as well due to formatting overhead.
Many tables in our operational database don't need any transformation to become dimensions in the warehouse. We have plans to provide a straight database-to-database extract-and-load for these tables using MySQL's
SELECT ... INTO TABLE syntax. This should perform much better by entirely skipping the intermediate CSV files.
As we developed more tools that use the warehouse, it became more and more desirable to have it kept up-to-date more frequently. We are in the process of designing a system that will keep the warehouse up-to-date in real time. The system will be incremental, moving data as it is added and changed in the main database, likely using triggers to help us know what needs moving without burdening application code.
Any relational database system can be used to build a data warehouse. At Plus Three we chose MySQL v4.1.x for our warehouse system. This choice was natural for us as we're largely a MySQL shop. We chose v4.1.x because it was the stable version when we started development.
Our operational database uses MySQL's InnoDB backend, providing referential integrity and transactions. However, we chose MySQL's MyISAM backend for our warehouse. Warehouse applications are read-only, and as such, have little need for run-time referential integrity checks and transactions. MyISAM can also be faster than InnoDB in some circumstances.
An upside to using MySQL for the warehouse is high capacity and decent performance with little tuning. Our larger installations handle tables with more than one hundred million rows without major difficulties. Although we've spent some time optimizing for particular warehouse queries, the majority were fast enough out of the gate.
All that said, there are features we'd like to have that MySQL is lacking. Perhaps chief among them is support for bitmap indexes. Bitmap indexes are ideal for the kind of low-cardinality data that is commonly used in data warehouses. PostgreSQL supports bitmap indexes as of version v8.1, as do a number of commercial database systems.
We also found MySQL's subquery optimization to be of poor quality. Subqueries are relatively new to MySQL, so there's a good chance this will improve when we upgrade to MySQL v5.1.x soon.
A nontraditional alternative is a column-oriented database, like Vertica and MonetDB (which has a DBI driver). Column-oriented database systems store data by column rather than by row, and as such, offer performance advantages for typical data-warehouse usage patterns.
Data warehousing is a deep topic, and I've only really scratched the surface. If this article has piqued your interest, I suggest you pick up a copy of The Data Warehouse Toolkit by Ralph Kimball and Margy Ross. This book is invaluable in presenting dimensional modeling, the key to getting the most out of a data warehouse.
Sam Tregar is employed as a Perl programmer for Plus Three LP and is an active participant in the Perl development community.
Return to Databases.
Copyright © 2009 O'Reilly Media, Inc.