How many open source relational databases can you name? My friend Gabrielle recently sent me the links to two of them. However, if you’re like most technical people, you probably don’t know any — just as I didn’t until recently. I can already imagine many of you saying “bulls**t”, what about MySQL and PostgreSQL?” (to name just two), but those are just databases, not relational databases. Noted (and controversial) database experts Chris Date and Hugh Darwen, building on the work of the founder of relational theory, Edgar F. Codd, have tried to educate people about the actual relational model but with little success. It’s a strange world of relations, relvars, attributes and other things which sound familiar, but when you look closely, they’re not.
Recently I found myself quite proud of taking a moderately complex query and optimizing it so that it took about 5 seconds to run instead of 15 minutes. In running it on a small subset of data, it worked perfectly. When running it on a larger subset, though, it turned out to be broken. One number which should have been around 1000 was actually 22,000. Another programmer and I sat down with it and worked through everything and spent a couple of hours debugging it. After a while, the bug was embarrassingly obvious. One of the INNER JOIN statements was a one to many relationship instead of a one to one relationship. It was returning and summing duplicate rows. We’ll come back to this in a moment.
In the relational model, instead of tables, you have relations. There are a number of differences, but the key one here is that a relation does not allow duplicate tuples (analogous to “rows”). In SQL, you can also create views. Views sort of look like tables, but depending on the implementation, you have various limitations to how you can update, insert, and delete with them.
(I’ll start using the normal “table”, “row” and “column” terms now, but keep in mind that they’re not really the same thing)
In the relational model, based on set theory, a view is simply another table. In fact, when you select data, you’re simply returning a new table. This means that you can do anything with a view that you can with any other table. You can update, insert and delete from them and these changes will automatically propagate to other tables. What this means is that when you select data, you’re guaranteed not to get duplicates. My bug in the SQL I mentioned earlier? It would not be possible.
Interestingly, I’m told that Microsoft once tried to implement this with their Access database by having an implicit DISTINCT on all queries. As it turns out, this was a frequent source of bugs. Just because values are the same doesn’t mean that they represent the same thing. It’s not true that two guys named “John” are the same guy. Microsoft’s scheme might have worked if they had automatically removed duplicate values if and only if they came from the same row, but apparently they didn’t.
Another interesting feature of the relational model is that all columns must have their data types defined. You would have to specify their domain of values and the operators which can operate on them. What does this mean? Consider the following SQL statements.
SELECT first_name FROM customers WHERE age = id; SELECT first_name FROM customers c, orders o WHERE c.id = o.id;
Any of us can glance at those and know they probably don’t make a lot of sense. As queries get large, however, those types (no pun intended) of bugs can creep in. One system I worked on had exactly that type of bug in SQL which was generated on the fly from metadata and it was extremely painful to track down. We frequently had auto-generated 200 line SQL statements and wading through them was an exercise in masochism.
In the relational model, statements like those wouldn’t even compile. The age and id columns would all have their own data types and you simply wouldn’t define operators allowing them to be compared (c.id and o.id would also be different types). However, you could do the following because o.customer_id would have the same data type as c.id. You would define the ‘=’ operator but not ‘>’ or ‘<’.
SELECT first_name, last_name FROM customers c, orders o WHERE c.id = o.customer_id;
In normal SQL, there’s a good chance that query would return duplicate rows and if that’s a subquery, it could cause all sorts of hard-to-detect problems. But you already know what I’m going to say: that’s not possible in the relational model because duplicate rows can’t happen.
The two open source relational databases I mentioned earlier are Rel and LEAP RDB. Since SQL is not relational, neither supports it, but LEAP has plans to support a subset of SQL at some point in the future. There’s also a commercial product named Dataphor, offered by Alphora. I’ve heard good things about it but have not had the chance to use it even though they offer a fully functional evaluation copy.
The more I work with databases, the more I grow frustrated with not having the relational model available. Allegedly, when IBM finally started trying to follow this model, they built a query language named SEQUEL (sound familiar?) but the developers didn’t know much about Codd’s work and had little contact with him. Larry Elison then borrowed their ideas for Oracle and this has led us to the sorry state we’re in today. Fortunately, it seems that a number of people are realizing that the industry standard SQL query language and the databases they are built on top of them have significant limitations, so competitors are starting to sprout up. Competition is Good Thing.
For extra credit, you can read about why NULL values are a source of bugs in databases.