If you are a DBA or a database programmer, your manager may have asked you to prepare a breakdown of employees by location and department, or a breakdown of products by category and provider. If your boss is a techie, instead of a breakdown she might have mentioned a pivot table, or, if she likes to use "the right words," a cross-tabulation.
Cross-tabulations, also known as "pivot tables" or crosstabs, are statistical reports that group data by one field, creating one column for each distinct value of another field. In colloquial terms, this way of representing data is called "breaking down the data by X and Y," where X and Y are the names of two columns in the dataset.
In the good old days before computers were on every desk, you would take the list of employees and manually count how many males and females were in each department. After spreadsheets became a common working tool, you would use some cute formulas to count the values. Advanced spreadsheets available since the early 1990s could do the whole report for you from a bare list.
In its basic form, a crosstab is as simple as this: starting from a list of values, we want to group them by the values of field A and create a column for each distinct value of field B. The desired result is a table with one column for field A and several columns for each value of field B.
Now, you would think that a database engine, being more powerful than a spreadsheet, would solve this problem quite easily. When your dataset is so large that it won't fit into a spreadsheet, you start digging for an SQL solution. Unfortunately, in standard SQL there is no predefined function to accomplish this task. Nonetheless, you can get the job done with some creativity. Let's assume that your bare data resembles this table.
+------------------------------+
| person |
+----+--------+--------+-------+
| id | name | gender | dept |
+----+--------+--------+-------+
| 1 | John | m | pers |
| 2 | Mario | m | pers |
| 7 | Mary | f | pers |
| 8 | Bill | m | pers |
| 3 | Frank | m | sales |
| 5 | Susan | f | sales |
| 6 | Martin | m | sales |
| 4 | Otto | m | dev |
| 9 | June | f | dev |
+----+--------+--------+-------+
Given the column gender, having distinct values m
and f, a crosstab between dept and gender would generate a row
for each dept, with a column for the dept, one for m, and one
for f.
The task is not trivial, even when the distinct values are known in advance and are relatively scarce. A solution that should work for most DBMS engines from this simple table is:
SELECT dept,
COUNT(CASE WHEN gender ='m' THEN id ELSE NULL END) AS 'm',
COUNT(CASE WHEN gender ='f' THEN id ELSE NULL END) AS 'f'
FROM
person
GROUP BY
dept
It's easy to get discouraged when the possible values increase by a dozen, especially if they may change.
The easy way of performing a cross-tabulation, as I mentioned before, is to export your data locally, load it into a spreadsheet, and use its pivot table capabilities (also called "data pilot," depending on the flavor of your spreadsheet). This solution, affordable when you have to perform the task just once and with a limited dataset, becomes intolerable when you can count your records by the millions. In such cases, a spreadsheet is not the proper tool, and you should turn your attention to DBMS calculation skills.
A hybrid solution would be to export only aggregated data and finish the job with a spreadsheet. Again, this is acceptable when you have to do it once or twice, but if you need those tables as support for your data warehouse, then you are putting too much effort on the client side, and perhaps devoting too much time dealing with macros and supporting code, to accomplish your task in a manner that looks smooth to the final user. There are, of course, specialized tools that take the burden off of your shoulders to do the dirty work for you. The good ones are on the server side and give you the final product without sweating. The bad ones are often colorful and rich in features but they shift the workload to the client machine, resulting in a slow process that makes everybody unhappy.
A viable alternative, then, is a server-side cross-tabulation calculated by the database server. For programmers unaccustomed to SQL, it may seem a Herculean task, and even the smart programmer may look at the complexity ahead with a wary eye. What is needed is a tool that reduces the complexity of the task — building a sometimes-huge SQL statement — to a few parameters that will be translated into the actual query.
DBIx::SQLCrosstab is a Perl module that can do exactly that. But before plunging into details, it would be better to analyze what a crosstab report exactly is, how complex it can get, and what you can ask the DBMS to do.
Using the same raw data seen earlier, let's look at some trivial possibilities.
A simple crosstab between dept and gender gives us:
+-------------------------+
| dept by gender |
+-------+----+----+-------+
| dept | m | f | total |
+-------+----+----+-------+
| dev | 1 | 1 | 2 |
| pers | 3 | 1 | 4 |
| sales | 2 | 1 | 3 |
+-------+----+----+-------+
The query to create this result is:
SELECT
dept,
COUNT(CASE WHEN gender = 'm' THEN id ELSE NULL END) AS m,
COUNT(CASE WHEN gender = 'f' THEN id ELSE NULL END) AS f,
COUNT(*) AS total
FROM
person
GROUP BY
dept
This crosstab is trivial for several reasons:
Before going on to explore further reasons for complexity, I want to explore
the basic method of cross-tabulating with SQL, assisted by Perl. Suppose that
we don't know in advance which values are stored for gender. In
that case, we need to scan the dataset first.
SELECT DISTINCT
gender
FROM
person
Then we can take the result values to create the COUNT lines in
the query. Here is when a supporting high-level programming language comes in
handy. If your DBMS supports stored procedures, you can do that directly on the
server, but its portability is questionable, given the variety of dialects in
that field.
A Perl implementation of this task should be:
my $columns = $dbh->selectcol_arrayref(
"SELECT DISTINCT gender FROM person"
) or die "query failed: $DBI::errstr\n";
my $query = "SELECT dept\n";
for (@$columns) {
$query .= ",COUNT(CASE WHEN gender = '$_' THEN id ELSE NULL END) AS '$_'\n";
}
$query .= ", COUNT(*) AS total\n"
$query .= "FROM person\n"
. "GROUP BY dept";
The resulting query would be:
SELECT dept
,COUNT(CASE WHEN gender = 'm' THEN id ELSE NULL END) AS 'm'
,COUNT(CASE WHEN gender = 'f' THEN id ELSE NULL END) AS 'f'
,COUNT(*) AS total
FROM person
GROUP BY dept
This is almost the same as the one we did manually — minus some stylistic tidbits perhaps, but it gets the job done.
You must have realized that there are some issues with this already. For
example, the header column is hardcoded, as is the total line and the
GROUP BY clause. This code, while saving you some keystrokes, is
not useful in a general case. You may be tempted to make a template out of it,
but wait! There is more to be seen before coding a solution.
Cross-tabulations can get complex in several ways, especially when the level of headers increases, and when several operations are performed in the same result. The level of headers can be further complicated when the values in the column headers have internal dependencies. Consider the case where you have a first level of countries and a second level of locations, where a given location can only belong to a single country. I will explain each issue in detail.
If single-level cross-tabulations haven't worried you, multiple level tables should give you something to ponder. In addition to everything already mentioned, multiple-level crosstabs have:
Query-composition complexity. Each column is the combination of several conditions, one for each level.
Column subtotals. To be inserted after the appropriate section.
Row subtotals. To be inserted after the relevant rows.
An explosive increase of column numbers. For a three-level crosstab where each level has three values, you get 27 columns. If you include subtotals, your number rises to 36. If you have just a few levels with five or six values, you may be counting rows by the hundreds.
Visualization problems. While the result set from the DBMS is a simple matrix, the conceptual table has a visualization tree at the top (for columns) and a visualization tree at the left side (for rows).
| A | B | C1 | C2 | Total | column header 1 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| D1 | D2 | Total | D1 | D2 | Total | column header 2 | |||||||||||
| E1 | E2 | Total | E1 | E2 | Total | E1 | E2 | Total | E1 | E2 | Total | column header 3 | |||||
| A1 | B1 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | |
| B2 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | ||
| Total | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | row sub total | |
| A2 | B1 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | |
| B2 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | ||
| Total | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | row sub total | |
| Total | --- | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | row total |
| row header 1 |
row header 2 |
col sub total |
col sub total |
col sub total |
col sub total |
col sub total |
col sub total |
col total |
|||||||||
Solving this type of cross-tabulation follows the general guidelines for single-level tabulations. First we collect the values for each column header, and then we produce the lines by combining the distinct values for each row.
Consider the following test database. (A sample DBIx::SQLCrosstab database schema is also available.)
+-------------------+--------------------+
| categories | countries |
| ================= | ================== |
| cat_id category | country_id country |
| ------ ---------- | ---------- ------- |
| 1 employee | 1 UK |
| 2 contractor | 2 Germany |
| 3 consultant | 3 Italy |
+-------------------+--------------------+
+-------------------+--------------------------+
| depts | locs |
| ================= | ======================== |
| dept_id dept | loc_id loc country_id |
| ------- -------- | ------ ------ ---------- |
| 1 pers | 1 Rome 3 |
| 2 sales | 2 London 1 |
| 3 dev | 3 Munich 2 |
| 4 research | 4 Berlin 2 |
| | 5 Bonn 2 |
+-------------------+--------------------------+
+----------------------------------------------+
| person |
| ============================================ |
| id name dept_id loc_id gender salary cat_id |
| -- ----- ------- ------ ------ ------ ------ |
| 1 John 1 2 m 5000 2 |
| 2 Mario 1 1 m 6000 1 |
| 3 Frank 2 5 m 5000 1 |
| 4 Otto 3 4 m 6000 1 |
| 5 Susan 2 3 f 5500 3 |
| 6 Martin 2 2 m 5500 2 |
| 7 Mary 1 4 f 5500 1 |
| 8 Bill 1 3 m 5000 1 |
| 9 June 3 1 f 6000 3 |
+----------------------------------------------+
We want to get the number of persons by crossing country and location with department and gender.
SELECT DISTINCT dept FROM depts; # results in 'pers', 'sales, 'dev'
SELECT DISTINCT gender FROM person; # results in 'm', 'f'
SELECT
country, loc AS location,
COUNT(CASE WHEN dept = 'pers' AND gender = 'f' THEN id ELSE NULL END)
AS 'pers-f',
COUNT(CASE WHEN dept = 'pers' AND gender = 'm' THEN id ELSE NULL END)
AS 'pers-m',
COUNT(CASE WHEN dept = 'sales' AND gender = 'f' THEN id ELSE NULL END)
AS 'sales-f',
COUNT(CASE WHEN dept = 'sales' AND gender = 'm' THEN id ELSE NULL END)
AS 'sales-m',
COUNT(CASE WHEN dept = 'dev' AND gender = 'f' THEN id ELSE NULL END)
AS 'dev-f',
COUNT(CASE WHEN dept = 'dev' AND gender = 'm' THEN id ELSE NULL END)
AS 'dev-m',
COUNT(*) AS total
FROM
person
INNER JOIN
depts ON (person.dept_id=depts.dept_id)
INNER JOIN
locs ON (locs.loc_id=person.loc_id)
INNER JOIN
countries ON (locs.country_id=countries.country_id)
GROUP BY
country, loc
You can see that making this task automatic is a bit trickier than the previous one. Notice that now the column names can be composite, so we need to add a character to separate their components. The choice of such a character can be difficult if the column values already contain non-alphabetic letters. Notice I said "values," not "names," since the values from a column become column names in the crosstab. In addition, if you want subtotals by column, you need to add a count by department in the appropriate places, and our query would become:
SELECT
country, loc AS location,
COUNT(CASE WHEN dept = 'pers' AND gender = 'f' THEN id ELSE NULL END)
AS 'pers-f',
COUNT(CASE WHEN dept = 'pers' AND gender = 'm' THEN id ELSE NULL END)
AS 'pers-m',
COUNT(CASE WHEN dept = 'pers' THEN id ELSE NULL END)
AS 'pers',
COUNT(CASE WHEN dept = 'sales' AND gender = 'f' THEN id ELSE NULL END)
AS 'sales-f',
COUNT(CASE WHEN dept = 'sales' AND gender = 'm' THEN id ELSE NULL END)
AS 'sales-m',
COUNT(CASE WHEN dept = 'sales' THEN id ELSE NULL END)
AS 'sales',
COUNT(CASE WHEN dept = 'dev' AND gender = 'f' THEN id ELSE NULL END)
AS 'dev-f',
COUNT(CASE WHEN dept = 'dev' AND gender = 'm' THEN id ELSE NULL END)
AS 'dev-m',
COUNT(CASE WHEN dept = 'dev' THEN id ELSE NULL END)
AS 'dev',
COUNT(*) AS total
FROM
person
INNER JOIN
depts ON (person.dept_id=depts.dept_id)
INNER JOIN
locs ON (locs.loc_id=person.loc_id)
INNER JOIN
countries ON (locs.country_id=countries.country_id)
GROUP BY
country, loc
The result is here (don't worry about the formatting, for now. We'll come to it soon).
| country | location | pers | sales | dev | total | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| f | m | total | f | m | total | f | m | total | |||
| Germany | Berlin | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 2 |
| Bonn | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | |
| Munich | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 2 | |
| Italy | Rome | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 2 |
| UK | London | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 2 |
What about row subtotals and total? You can calculate them in the client,
but assuming that you want them all at once, you can use a UNION query, such as this monster:
SELECT
country, loc AS location,
COUNT(CASE WHEN dept = 'pers' AND gender = 'f' THEN id ELSE NULL END)
AS 'pers-f',
COUNT(CASE WHEN dept = 'pers' AND gender = 'm' THEN id ELSE NULL END)
AS 'pers-m',
COUNT(CASE WHEN dept = 'pers' THEN id ELSE NULL END)
AS 'pers',
COUNT(CASE WHEN dept = 'sales' AND gender = 'f' THEN id ELSE NULL END)
AS 'sales-f',
COUNT(CASE WHEN dept = 'sales' AND gender = 'm' THEN id ELSE NULL END)
AS 'sales-m',
COUNT(CASE WHEN dept = 'sales' THEN id ELSE NULL END)
AS 'sales',
COUNT(CASE WHEN dept = 'dev' AND gender = 'f' THEN id ELSE NULL END)
AS 'dev-f',
COUNT(CASE WHEN dept = 'dev' AND gender = 'm' THEN id ELSE NULL END)
AS 'dev-m',
COUNT(CASE WHEN dept = 'dev' THEN id ELSE NULL END)
AS 'dev',
COUNT(*) AS total
FROM
person
INNER JOIN
depts ON (person.dept_id=depts.dept_id)
INNER JOIN
locs ON (locs.loc_id=person.loc_id)
INNER JOIN
countries ON (locs.country_id=countries.country_id)
GROUP BY
country, location
UNION
SELECT
country, 'zzzz' AS location,
COUNT(CASE WHEN dept = 'pers' AND gender = 'f' THEN id ELSE NULL END)
AS 'pers-f',
COUNT(CASE WHEN dept = 'pers' AND gender = 'm' THEN id ELSE NULL END)
AS 'pers-m',
COUNT(CASE WHEN dept = 'pers' THEN id ELSE NULL END)
AS 'pers',
COUNT(CASE WHEN dept = 'sales' AND gender = 'f' THEN id ELSE NULL END)
AS 'sales-f',
COUNT(CASE WHEN dept = 'sales' AND gender = 'm' THEN id ELSE NULL END)
AS 'sales-m',
COUNT(CASE WHEN dept = 'sales' THEN id ELSE NULL END)
AS 'sales',
COUNT(CASE WHEN dept = 'dev' AND gender = 'f' THEN id ELSE NULL END)
AS 'dev-f',
COUNT(CASE WHEN dept = 'dev' AND gender = 'm' THEN id ELSE NULL END)
AS 'dev-m',
COUNT(CASE WHEN dept = 'dev' THEN id ELSE NULL END)
AS 'dev',
COUNT(*) AS total
FROM
person
INNER JOIN
depts ON (person.dept_id=depts.dept_id)
INNER JOIN
locs ON (locs.loc_id=person.loc_id)
INNER JOIN
countries ON (locs.country_id=countries.country_id)
GROUP BY
country, location
UNION
SELECT
'zzzz' AS country, 'zzzz' AS location,
COUNT(CASE WHEN dept = 'pers' AND gender = 'f' THEN id ELSE NULL END)
AS 'pers-f',
COUNT(CASE WHEN dept = 'pers' AND gender = 'm' THEN id ELSE NULL END)
AS 'pers-m',
COUNT(CASE WHEN dept = 'pers' THEN id ELSE NULL END)
AS 'pers',
COUNT(CASE WHEN dept = 'sales' AND gender = 'f' THEN id ELSE NULL END)
AS 'sales-f',
COUNT(CASE WHEN dept = 'sales' AND gender = 'm' THEN id ELSE NULL END)
AS 'sales-m',
COUNT(CASE WHEN dept = 'sales' THEN id ELSE NULL END)
AS 'sales',
COUNT(CASE WHEN dept = 'dev' AND gender = 'f' THEN id ELSE NULL END)
AS 'dev-f',
COUNT(CASE WHEN dept = 'dev' AND gender = 'm' THEN id ELSE NULL END)
AS 'dev-m',
COUNT(CASE WHEN dept = 'dev' THEN id ELSE NULL END)
AS 'dev',
COUNT(*) AS total
FROM
person
INNER JOIN
depts ON (person.dept_id=depts.dept_id)
ORDER BY
country, location
The first UNION creates the subtotals by country. I replaced
the location name with zzzz to be sure that it will get sorted at
the end of the normal rows for each country. Similarly, the second
UNION has both country and location replaced by zzzz,
so that they will be listed at the very end of the result set. The first
UNION subquery will create one row for each country. The last
subquery will create only one row, for the grand total. Notice that the last
UNION subquery doesn't have a GROUP BY clause. The
ORDER BY clause at the end of the query it will affect the whole
result set.
The result, nicely formatted, with the zzzz translated into a
more readable total, is this:
| country | location | pers | sales | dev | total | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| f | m | total | f | m | total | f | m | total | |||
| Germany | Berlin | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 2 |
| Bonn | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | |
| Munich | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 2 | |
| Total | 1 | 1 | 2 | 1 | 1 | 2 | 0 | 1 | 1 | 5 | |
| Italy | Rome | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 2 |
| Total | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 2 | |
| UK | London | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 2 |
| Total | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 2 | |
| Total | Total | 1 | 3 | 4 | 1 | 2 | 3 | 1 | 1 | 2 | 9 |
Automating this task is much harder than a simple crosstab, and the full explanation is beyond the scope of this article. Stay tuned to Perl Monks if you are interested in the implementation details.
|
A crosstab can do much more. When I worked for a large international
organization, I was often asked to provide counts, summaries, and averages
in the same table. If you are doing this manually, it just means
cutting and pasting the column calculation lines in the query, replacing
COUNT with SUM or whichever function is suitable
(MIN, MAX, AVG, STD) and
adding an appropriate item to the column name.
Here is an example of such a table.
| country | location | count | sum | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| pers | sales | dev | total | pers | sales | dev | total | ||||||||||||||
| f | m | total | f | m | total | f | m | total | f | m | total | f | m | total | f | m | total | ||||
| Germany | Berlin | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 2 | 5,500 | 0 | 5,500 | 0 | 0 | 0 | 0 | 6,000 | 6,000 | 11,500 |
| Bonn | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 5,000 | 5,000 | 0 | 0 | 0 | 5,000 | |
| Munich | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 2 | 0 | 5,000 | 5,000 | 5,500 | 0 | 5,500 | 0 | 0 | 0 | 10,500 | |
| Total | 1 | 1 | 2 | 1 | 1 | 2 | 0 | 1 | 1 | 5 | 5,500 | 5,000 | 10,500 | 5,500 | 5,000 | 10,500 | 0 | 6,000 | 6,000 | 27,000 | |
| Italy | Rome | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 2 | 0 | 6,000 | 6,000 | 0 | 0 | 0 | 6,000 | 0 | 6,000 | 12,000 |
| Total | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 2 | 0 | 6,000 | 6,000 | 0 | 0 | 0 | 6,000 | 0 | 6,000 | 12,000 | |
| UK | London | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 2 | 0 | 5,000 | 5,000 | 0 | 5,500 | 5,500 | 0 | 0 | 0 | 10,500 |
| Total | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 2 | 0 | 5,000 | 5,000 | 0 | 5,500 | 5,500 | 0 | 0 | 0 | 10,500 | |
| Total | Total | 1 | 3 | 4 | 1 | 2 | 3 | 1 | 1 | 2 | 9 | 5,500 | 16,000 | 21,500 | 5,500 | 10,500 | 16,000 | 6,000 | 6,000 | 12,000 | 49,500 |
Of course, we keep adding more and more lines to our query to achieve our goal. Even for our simple example database, using a three-level column header with all the available functions creates a query with more than 200 columns. That's not something you'd like to do manually.
There is more to the crosstab complexity. The method we've seen for
combining multiple-level column values won't work well if two or more levels
are linked together by some dependency. Let's suppose we want to partially transpose one of the examples above, using the department and category as row
headers, and country and location as column headers. A brute-force permutation
of the values would produce impossible combinations, such as
Germany#Rome, Italy#London, or
UK#Munich, which are both a waste of calculation resources and a
nuisance in the result set.
For such cases, a different strategy is necessary. Instead of gathering the columns separately, we need to collect them all at once:
SELECT DISTINCT
country, loc AS location, gender
FROM
person
INNER JOIN locs ON (person.loc_id=locs.loc_id)
INNER JOIN countries ON (locs.country_id=countries.country_id)
ORDER BY
country, location, gender
+---------+----------+--------+
| country | location | gender |
+---------+----------+--------+
| Germany | Munich | f |
| Germany | Munich | m |
| Germany | Bonn | f |
| Germany | Bonn | m |
| Germany | Berlin | f |
| Germany | Berlin | m |
| Italy | Rome | f |
| Italy | Rome | m |
| UK | London | m |
+---------+----------+--------+
Each row in the result represents a combination of values to create a column in our query. If subtotals are required, then the result is first loaded into a tree, which is the main mechanism used by DBIx::SQLCrosstab to produce the necessary permutations with subtotals as needed.
With so many issues to consider, you can still do your queries manually. Perhaps, having good organizational skills and a powerful editor, you may be able to create your queries in a few minutes. However, if your application needs crosstabs on demand, as a data warehouse would, then you should consider using a specialized tool that can create queries at runtime.
To use DBIx::SQLCrosstab, you need to download it from CPAN and install it in your system. Its only prerequisite is Tree::DAG_Node. If you want some specialized output, then you may also optionally install YAML and Spreadsheet::WriteExcel.
A DBIx::SQLCrosstab object requires an argument containing the parameters to build the query, passed as a hash reference.
my $xtab = DBIx::SQLCrosstab->new($params)
or die "error building object ($DBIx::SQLCrosstab::errstr)\n";
For every method in the module that is returning something, you can check
the variable $DBIx::SQLCrosstab::errstr, which will contain the
reason for failure.
A few parameters are required to build our query. Let's create a set of them to build the crosstab seen in the above "Multi-level cross-tabulations" section.
my $params = {
dbh => $dbh, # a database handler
op => 'COUNT', # the operation to perform
op_col => 'id', # which column to count
from => qq{
person
inner join locs ON (locss.loc_id=person.loc_id)
inner join countries ON (countries.country_id=locs.country_id)
},
# ...
};
The database handler identifies the database where we operate.
op is the operation to perform, such as COUNT,
SUM, or AVG. op_col is the column to
count. It could be any non-null column for a count, or an appropriate numeric
column for summaries and averages. The from item is the same thing
you would indicate in the FROM clause in a SQL statement. You
need to indicate any INNER or OUTER join as
necessary, so that all of the values you later indicate for columns or row headers
are properly referenced. Next comes the definition of rows and column headers.
Each one is an array reference, as follows:
my $params = {
# ...
rows => [
{ col => 'country', alias => 'country' },
{ col => 'loc', alias => 'location' }
],
};
For each header, only one item (col) is mandatory, while
alias is just for readability purposes. Let me remind you that such
columns are related to the from item, which must have the
appropriate JOIN parts to include the table containing such
columns.
$params = {
# ...
cols => [
{ id => 'dept_id', value => 'dept', from => 'depts' },
{ id => 'cat_id', value => 'category', from => 'categories' },
{
id => 'gender',
from => 'person'
col_list => [
{ id => 'f', value => 'f' },
{ id => 'm', value => 'm' }
],
}
],
};
For column headers, the mandatory fields are id and
from. Any value is used to create the final column
name. You may notice that while depts and categories are in the
column list, they are not in the main from description. This is an
optimization that is further described in "Inner JOINs" later. The description
of gender has another peculiarity. It provides a list of values to
be used by the engine without querying the database. This is also an
optimization, useful for such cases when you know the values in advance (for
example, when a CONSTRAINT in your query limits the values to a
given set).
Now, to get the exact result as our example below, we need to add a few optional parameters:
$params = {
# ...
col_total => 1, # adds a total column at the right end of the table
col_sub_total => 1, # adds sub-totals columns where appropriate
row_total => 1, # adds a grand total at the bottom end of the table
row_sub_total => 1, # adds sub-totals rows where appropriate
remove_if_zero => 1, # remove any column with all zeroes
add_colors => 1, # display the table with default colors
commify => 1, # adds thousand-separating commas to numbers
};
Spending a few minutes at the drawing desk before actually writing your query parameters could spare you time and headaches afterwards. If you are at liberty to choose, try to designate those columns with larger numbers of values as row headers, since the DBMS will handle them better. Assigning them as row headers will increase the number of final columns to generate, requiring more calculation effort and making the table less readable. It's also worth it to assign columns as row headers when the column depends on other columns. Although the module can handle these cases just fine, the strain on the DBMS is higher, because in several cases, you need to query the main data set twice: once to find the value and once to get the final results.
Sometimes, though, you need a JOINed query to find the
appropriate values and save resources. Consider the case of a lookup table for
categories, where you have a few hundred values. If you only need to use a
dozen, it would be better to define the column header with a from
clause, such as:
$params = {
# ...
cols => [
{
id => 'cat_id',
value => 'category',
from => qq{
categories
INNER JOIN person
ON (categories.cat_id=person.cat_id')
}
},
# ...
],
};
This way, only the relevant values are chosen, sparing the DBMS engine the useless calculation of a few hundred unused categories.
When the values of a column header come from the main data set, you may
consider running the query once, finding the distinct values, and then
providing them as a col_list for the subsequent runs. This
practice is also useful when you only need to crosstab a specific subset.
There are many parameters that can be passed to your DBIx::SQLCrosstab object, and you can use them to keep control of what is going to happen.
The most important options are where and having.
They will limit the recordset to the records you want. There is no limitation
to what you can require in these options, provided that they are legal SQL
syntax that apply to the tables being queried. You may want, for instance, to limit your query
for employees to the ones hired in the last month. If there is a field
identifying the start of contract date, and your database engine supports date
calculation, you can add a where clause saying "start_of_contract
>= now() - interval 30 days" or simply "start_of_contract >=
'YYYY-MM-DD' ", depending on the particular SQL dialect of your DBMS. With
having clauses, you can limit the result set by stating which
criteria the aggregated values must have, so you can say something like "having
=> pers > 10000" to define that you want only those rows where a sum of
salaries for pers is bigger than 10000.
$params = {
# ...
where => 'start_of_contract >= now() - interval 30 days',
having => 'pers > 10000',
};
When defining the column headers, we have already seen that we can explicitly define a list of values. Or we can explicitly exclude one or more values, to instruct the engine to retrieve all of them from the database, minus the ones that we list:
$params = {
# ...
cols => [
{ id => 'dept_id', value=> 'dept',
exclude_values => [ 'dev', 'pers' ]
},
#...
]
};
If we modify the previous request with this definition for the
dept header, the engine will only retrieve values for
sales department. Another method of excluding columns is to use
the general option col_exclude, which removes one column from the
query. You need to figure out how the query names would be created, and then
you may say:
$params = {
# ...
col_exclude => [ 'pers#contractor#f', "sales#consultant#m" ]
# ...
};
These rows were present in our example query, but their results were thrown out because all the values were zeroes. If we know that for sure, then we can improve the query execution speed by removing them in the first place.
To exclude a value for a row header, use a where clause. For
example, "where => country != 'Germany'".
When using a wrapper module to develop database applications, you usually trade ease of use for performance. DBIx::SQLCrosstab is not a true wrapper, because it is not a replacement for the whole database interface. Instead, it's a query builder — a very specific one, but a builder nonetheless. Most builders are peculiar in that they create queries according to some general rules, and the resulting statements often perform poorly. This module is no exception. You can use all of the options offered, but you need to understand the possible drawbacks.
A sensitive issue surrounds subtotals. Regarding column headers, a subtotal is not much of a problem, since it is only a few more calculated fields in the query. If your query grows to several hundred columns, though, it may become a problem. Keep this in mind. Also consider the purpose of your result. If you need to feed the result to a chart generator, you probably don't need subtotals. They could be misleading, anyway, if they are treated as normal columns and therefore generate wrong sums. The same principle applies if you want to export your results to a spreadsheet to perform further calculations.
Row totals and subtotals are a different case. While the same caveats apply
when you want to export to a chart, the performance penalty of calculating
subtotals can be huge, because DBIx::SQLCrosstab generates a
UNION query for each row header (if there are more than one), plus
another UNION query for the grand total. Any decent database
server can handle this task for a few hundred thousand records. When it comes
to millions, however, you may be kept waiting for much longer than you
expected. If you have very large databases, make some measurements (even better, do it when normal users are disconnected), and decide if you want to trade waiting time for the additional hassle of dealing with the subtotals outside of the server. I
have successfully tested subtotals in a four-million-record database, and the
results came in acceptable times. It could be that the design of my database is
better than average, or my DBMS is inherently optimized for these tasks. Either
way, I can't predict the effects of using subtotal queries on a huge database
with ten or twenty million records.
I'd like giving a piece of general advice for dealing with large
crosstabs, and this is to index all of the columns involved in the operation -- both the headers and the one being calculated. The optimization in most database
engines can take advantage of indexes in most of the operations related to
cross-tabulation: DISTINCT queries, GROUP BY clauses,
summaries, and JOINs.
JOINsIn the example given when explaining the parameters, I mentioned an internal
optimization in DBIx::SQLCrosstab to minimize the number of JOINs
necessary to perform the query. This feature is related to column headers whose
values come from a lookup table. If you indicate a key column that is in both
the main table and the lookup table, then you don't need to join the table in
the final query.
Let me explain better. In our example, we used a column header description for departments including:
$params = {
# ...
cols => [
{id => 'dept_id', value ='dept' },
# ...
],
};
The resulting line in the query is:
COUNT(CASE WHEN dept_id = 1 THEN id ELSE NULL END) AS 'pers',
Since dept_id is in both person and
depts, you don't need to add depts to your
from clause, because the values for depts are already stored in
the crosstab engine. I call this an "hidden join," because it performs the same
link as a JOIN but without its burden.
On the contrary, if your definition was:
$params = {
# ...
cols => [
{id => 'dept' },
# ...
],
};
Then the resulting SQL would have been:
COUNT(CASE WHEN dept = 'pers' THEN id ELSE NULL END) AS 'pers',
Since dept is not a column of person, the you
should have been forced to include depts in your from
clause, thus querying the table twice. Therefore, your final query, the
expensive one, would be slower. Think about this possibility whenever
circumstances allow it.
When talking about column interdependency, I said that DBIx::SQLCrosstab can
deal with these cases appropriately. You can instruct the module to use this
method by adding a group option to the first column definition.
When this option is selected, the crosstab engine uses the from
clause of the first column definition to build the entire headers tree.
$params = {
# ...
cols => [
{
id => 'country',
from => q{
person INNER JOIN locs
ON (person.loc_id=locs.loc_id)
INNER JOIN countries
ON (countries.country_id = locs.country_id)
},
group => 1,
orderby => 'country, loc, gender',
},
{ id => 'loc', from => '1' },
{ id => 'gender', from => '1' },
],
};
When the crosstab engine finds this definition, instead of trying to
retrieve the column values separately, it will create a single query. Notice
that the from clause in the second and third column definitions is
just 1, as a placeholder, because only the first definition is
used for that purpose.
SELECT DISTINCT
country, loc, gender
FROM
person
INNER JOIN locs ON (person.loc_id=locs.loc_id)
INNER JOIN countries ON (locs.country_id=countries.country_id)
ORDER BY
country, loc, gender
It will use the result as a list of permutations, from which it will create the different combinations.
The drawback of this method is that you might possibly query the main data
set twice. You need to balance the benefits of having just the right column
calculation against calculating all of the combinations and then removing the
unwanted ones using the option remove_if_zero.
You can also use this method when there is no dependency between column headers, but you just found out that the engine is creating more combinations than necessary.
The only useful results you can get from DBIx::SQLCrosstab are the query and the result set.
my $query = $xtab->get_query()
or die "Error $DBIx::SQLCrosstab::errstr\n";
my $records = $xtab->get_recs()
or die "Error $DBIx::SQLCrosstab::errstr\n";
Sometimes, this is just what you need. If you plan to use your query in a static environment, all you want is to create the query and then paste it into your application code. Or you may want to do some checking in the records and display what you want in a custom routine.
In most cases, though, you may need more than that. Therefore, instead of using DBIx::SQLCrosstab, you can use a derived class, DBIx::SQLCrosstab::Format, which has exactly the same interface, with additional capability to format your results in many different ways.
Thus, instead of creating the object as in the previous example, you'd do it using the child class:
my $xtab = DBIx::SQLCrosstab::Format->new( $params )
or die "...";
Then, you have the choice about the format of your results.
# first, be sure that everything is OK
if ( $xtab->get_query() and $xtab->get_recs() )
{
my $html = $xtab->as_html() or die "error ...";
print $html;
}
The method as_html() returns a well-formatted HTML table,
properly enriched with column and row spans, optionally colored with the colors
of your choice (see the module documentation for details).
HTML is not the only format. This module can also handle XML, YAML, XLS (Excel spreadsheet), and CSV formats, with the relative methods:
as_xml: produces a rich XML output. For a simpler
one, you could just do:
use XML::Simple;
my $simple_xml = XMLout(
$xtab->as_perl_struct('hoh'),
keeproot => 1,
noattr => 1 );as_yaml: requires YAML
as_xls: produces direct and transposed tables (requires Spreadsheet::WriteExcel)
as_csv
Being a Perl module, it is also quite easy to export Perl data structure, so
you can use the method as_perl_struct() with one of the following
parameters:
lol: a list of lists (as would be returned from the
get_recs() method)
losh: a list of simple hashes (one key per column)
loh: a list of tree-like hashes, with the tree from the
column headers for each row
hoh: a hash of hashes, or a full result tree that can
be easily passed to YAML or XML::Simple
The actual list of formats is over, but the possibilities are much wider. The module provides some handles that a skilled programmer can use to create more specialized views. I will describe these issues in another article.
To get an idea of the module potential, you can play with an interactive example, which can show you the parameters for complex crosstabs and some interesting output formats.
Then it's your turn. Do your own testing and put DBIx::SQLCrosstab to good use!
Thanks to Stefano Rodighiero (a.k.a. larsen) for proofreading this article and for his constructive criticism. Thanks also to chromatic for making a true article from a difficult draft.
Giuseppe Maxia is a QA developer in MySQL community team. A system analyst with 20 years of IT experience, he has worked as a database consultant and designer for several years. He is a frequent speaker at open source events and he's the author of many articles. He lives in Sardinia (Italy).
Return to ONLamp.com.
Copyright © 2007 O'Reilly Media, Inc.