|
Comparing Data Sets with SQLby Jonathan Gennick, co-author of Transact-SQL Cookbook04/08/2002 |
SQL is a set-oriented language. That's one of its great strengths.
Rather than write lengthy code to manipulate large data sets one record at a time, you can treat those data sets as individual units and accomplish a great deal with a small amount of code. An interesting set-oriented problem that I'll explore in this article is the comparing of two sets of rows, whether from the same table or from different tables, to see whether the two row sets are the same.
For this article I'll use a library-based scenario, and we'll examine the question of how to determine whether two branch libraries stock the same set of books. To begin with, let's assume that our library system uses the following table to track the books held by each branch:
CREATE TABLE branch_book_list (
branch_name CHAR(10),
book_ISBN CHAR (13),
book_name CHAR(40),
PRIMARY KEY (branch_name, book_ISBN)
)
I could easily have used a separate table for each branch, and, in fact, the techniques I show in this article apply equally to the case where you need to compare two tables for equality. A table, in this context, is merely a way to define a set of rows. For this article though, I'll compare two row sets from the same table.
Following is some data I've used to populate the branch_book_list table. Pay particular attention to the differences in the books held by the two branches:
branch_name book_ISBN book_name
----------- ------------- ----------------------------------------
Branch A 1-56592-401-0 Transact-SQL Programming
Branch A 1-56592-578-5 Oracle SQL*Plus: The Definitive Guide
Branch A 1-56592-756-7 Transact-SQL Cookbook
Branch B 1-56592-401-0 Transact-SQL Programming
Branch B 1-56592-756-7 Transact-SQL Cookbook
Branch B 1-56592-948-9 Oracle SQL*Loader: The Definitive Guide
The branch_book_list table has a primary key on the combination of the branch_name and book_ISBN columns. This primary key ensures that no two rows in the table are identical, and that constraint has a direct bearing on the solution that we can apply to compare two row sets.
First we'll look at a solution that applies when a primary key (or a unique index) prevents row duplication. Then we'll remove the primary key and see how the possibility of duplicate rows complicates our task.
To begin with, we might want to think about ways in which the two row sets can be different. Our example table has a primary key on branch_name and book_ISBN, and so we might think about differences in the following manner:
At first glance it appears we have three conditions to think about. In reality though, condition three is really a special case that combines the first two. Considering the following two rows:
branch_name book_ISBN book_name
----------- ------------- ---------------------------------------
Branch A 1-56592-756-7 Transact-SQL Cookbook
Branch B 1-56592-756-7 SQL Cookbook
Ignore for a moment that two different books can't share the same ISBN. One way to think about these rows is that each branch has the same book, but with a different name. Another way of thinking, and one that works better for the problem of comparing two row sets, is to think of each branch having a book that the other does not have. Branch A has the Transact-SQL Cookbook, and Branch B does not. Conversely, Branch B has the SQL Cookbook, and Branch A does not.
The bottom line is that when comparing two row sets we need only to look for rows in one set that aren't in the other.
Let's begin by writing a query to find all books held by Branch A that are not also held by Branch B. Using SQL Server, we can achieve this goal through the use of a subquery, as follows:
SELECT bbl1.*
FROM branch_book_list bbl1
WHERE branch_name = 'Branch A'
AND NOT EXISTS (
SELECT bbl2.*
FROM branch_book_list bbl2
WHERE branch_name = 'Branch B'
AND bbl1.book_ISBN = bbl2.book_ISBN
AND bbl1.book_name = bbl2.book_name)
branch_name book_ISBN book_name
----------- ------------- ----------------------------------------
Branch A 1-56592-578-5 Oracle SQL*Plus: The Definitive Guide
The outer query returns books held by Branch A. The subquery compares those books to the books held by Branch B. The branch_name column is used to identify the two sets we are comparing. The subquery's WHERE clause then checks all other columns for equality between the two sets, thus defining "equality" to mean that all columns except the branch name must be equivalent in order for two branches to have the same book.
Next, we turn our query around to identify those books held by Branch B that are not also held by Branch A:
SELECT bbl1.*
FROM branch_book_list bbl1
WHERE branch_name = 'Branch B'
AND NOT EXISTS (
SELECT bbl2.*
FROM branch_book_list bbl2
WHERE branch_name = 'Branch A'
AND bbl1.book_ISBN = bbl2.book_ISBN
AND bbl1.book_name = bbl2.book_name)
branch_name book_ISBN book_name
----------- ------------- ----------------------------------------
Branch B 1-56592-948-9 Oracle SQL*Loader: The Definitive Guide
If both queries return zero rows, there are no differences between the two sets, meaning that Branch A has the exact same set of books as Branch B. By taking advantage of SQL's UNION operator, we can combine the two queries into one as follows:
SELECT bbl1.*
FROM branch_book_list bbl1
WHERE branch_name = 'Branch A'
AND NOT EXISTS (
SELECT bbl2.*
FROM branch_book_list bbl2
WHERE branch_name = 'Branch B'
AND bbl1.book_ISBN = bbl2.book_ISBN
AND bbl1.book_name = bbl2.book_name)
UNION
SELECT bbl1.*
FROM branch_book_list bbl1
WHERE branch_name = 'Branch B'
AND NOT EXISTS (
SELECT bbl2.*
FROM branch_book_list bbl2
WHERE branch_name = 'Branch A'
AND bbl1.book_ISBN = bbl2.book_ISBN
AND bbl1.book_name = bbl2.book_name)
branch_name book_ISBN book_name
----------- ------------- ----------------------------------------
Branch A 1-56592-578-5 Oracle SQL*Plus: The Definitive Guide
Branch B 1-56592-948-9 Oracle SQL*Loader: The Definitive Guide
If this union query returns no rows, then the two sets of books are equivalent. If the two sets of books are not the same, this query will return the list of books not held by both branches. In this case, the two sets are not equivalent, and the result is the list of books not common to both Branch A and Branch B.
It's unfortunate that SQL Server (at least not as of SQL Server 2000) does not implement the ANSI SQL standard EXCEPT operator. EXCEPT greatly simplifies the task of finding the difference between two sets. Let's revisit the problem of finding books held by Branch A that are not also held by Branch B, but this time using EXCEPT:
SELECT *
FROM branch_book_list
WHERE branch_name = 'Branch A'
EXCEPT
SELECT *
FROM branch_book_list
WHERE branch_name = 'Branch B'
See how much easier this is to understand than the subquery solution shown earlier? The full solution for finding the difference between the two row sets then becomes as follows:
(SELECT *
FROM branch_book_list
WHERE branch_name = 'Branch A'
EXCEPT
SELECT *
FROM branch_book_list
WHERE branch_name = 'Branch B')
UNION
(SELECT *
FROM branch_book_list
WHERE branch_name = 'Branch B'
EXCEPT
SELECT *
FROM branch_book_list
WHERE branch_name = 'Branch A')
Not only is this solution for comparing two sets easier to write and understand at a glance, it won't fail if we add or remove a column from the branch_book_list table; the subquery solution requires that each column be listed in the subquery's WHERE clause. It's really too bad that Microsoft doesn't support the full range of ANSI standard set operators.
What if branches have multiple copies of the same book? The solution shown up to this point assumes no duplicates in the set of rows for a given branch. The primary key constraint in the branch_book_list table includes book_ISBN, making the assumption of no duplicates a valid one. Let's now remove that constraint, and assume that we have the following information in our branch_book_list table:
branch_name book_ISBN book_name
----------- ------------- ----------------------------------------
Branch A 1-56592-756-7 Transact-SQL Cookbook
Branch A 1-56592-401-0 Transact-SQL Programming
Branch A 1-56592-401-0 Transact-SQL Programming
Branch B 1-56592-756-7 Transact-SQL Cookbook
Branch B 1-56592-756-7 Transact-SQL Cookbook
Branch B 1-56592-401-0 Transact-SQL Programming
Branch A has just one copy of the Transact-SQL Cookbook, while Branch B has two copies. Likewise, Branch B has one copy of Transact-SQL Programming, whereas Branch A has two. The query shown earlier in this article that tests for equality will show the two sets of books as being equal because that query does not take into account the quantity of each book.
Because duplicates are now a possibility, we must write a query to check not only for the existence of each book in each set, but also to check the quantity of each book in each set. We can do this using GROUP BY to summarize each set by book. For example, the following query returns the list of books held by Branch A that are not held in identical quantities by Branch B:
SELECT bbl1.*, COUNT(*) DupeCount
FROM branch_book_list bbl1
WHERE bbl1.branch_name = 'Branch A'
GROUP BY bbl1.branch_name, bbl1.book_ISBN, bbl1.book_name
HAVING NOT EXISTS (
SELECT bbl2.*, COUNT(*)
FROM branch_book_list bbl2
WHERE branch_name = 'Branch B'
GROUP BY bbl2.branch_name, bbl2.book_ISBN, bbl2.book_name
HAVING bbl1.book_ISBN = bbl2.book_ISBN
AND bbl1.book_name = bbl2.book_name
AND COUNT(*) = COUNT(ALL bbl1.book_ISBN))
branch_name book_ISBN book_name DupeCount
----------- ------------- ---------------------------------------- -----------
Branch A 1-56592-401-0 Transact-SQL Programming 2
Branch A 1-56592-756-7 Transact-SQL Cookbook 1
Rows are now summarized by branch, ISBN, and name, allowing us to use COUNT(*) to get the quantity of each book at each branch. The WHERE clause still defines our two sets, restricting the outer query to Branch A books and the inner query to Branch B books. However, the work of comparing two sets for equality has been moved
into the HAVING clause of the subquery, and that work now includes a
comparison of the quantity held by each of the two branches.
Of course, the full solution to the problem of comparing books held by both branches involves both the previous query and its converse:
SELECT bbl1.*, COUNT(*) DupeCount
FROM branch_book_list bbl1
WHERE bbl1.branch_name = 'Branch A'
GROUP BY bbl1.branch_name, bbl1.book_ISBN, bbl1.book_name
HAVING NOT EXISTS (
SELECT bbl2.*, COUNT(*)
FROM branch_book_list bbl2
WHERE branch_name = 'Branch B'
GROUP BY bbl2.branch_name, bbl2.book_ISBN, bbl2.book_name
HAVING bbl1.book_ISBN = bbl2.book_ISBN
AND bbl1.book_name = bbl2.book_name
AND COUNT(*) = COUNT(ALL bbl1.book_ISBN))
UNION
SELECT bbl1.*, COUNT(*) DupeCount
FROM branch_book_list bbl1
WHERE bbl1.branch_name = 'Branch B'
GROUP BY bbl1.branch_name, bbl1.book_ISBN, bbl1.book_name
HAVING NOT EXISTS (
SELECT bbl2.*, COUNT(*)
FROM branch_book_list bbl2
WHERE branch_name = 'Branch A'
GROUP BY bbl2.branch_name, bbl2.book_ISBN, bbl2.book_name
HAVING bbl1.book_ISBN = bbl2.book_ISBN
AND bbl1.book_name = bbl2.book_name
AND COUNT(*) = COUNT(ALL bbl1.book_ISBN))
branch_name book_ISBN book_name DupeCount
----------- ------------- ---------------------------------------- -----------
Branch A 1-56592-401-0 Transact-SQL Programming 2
Branch A 1-56592-756-7 Transact-SQL Cookbook 1
Branch B 1-56592-401-0 Transact-SQL Programming 1
Branch B 1-56592-756-7 Transact-SQL Cookbook 2
SQL is a powerful set-manipulation language. The techniques shown in this article can be applied not only to two sets of rows from the same table, but also to two sets of rows from different tables, and beyond that to comparing all rows from two different tables. When unit-testing, for example, you might want to compare the results of a process with a set of known good results held in a table maintained for that purpose.
Many other set-oriented SQL solutions are possible. The comparison of two row sets is but one example of what you can do, and do easily, when you make the mental leap from using SQL as just a row-retrieval mechanism to using SQL as a set definition and manipulation mechanism.
Jonathan Gennick is an O'Reilly Media, Inc. editor specializing in database and programming titles.
Return to the O'Reilly Network.
Copyright © 2007 O'Reilly Media, Inc.