My friend Richard, who runs
the Business Intelligence SIG
for SDForum,
is putting together a slightly larger event; it’s a panel discussion
on relational databases.


He’s got a lot of interesting ideas in his head, and they got me to wondering: is the role of the relational database going to change dramatically in the next 10 years?


Right now, relational databases are used in almost every enterprise application. That’s a slight exaggeration, of course, but lots and lots of applications use them. And they’re convenient, and cheap, enough that they’re surreptitiously included in lots of client applications as well.


The key benefits that relational databases bring are persistence, transactions, reliability, and indexing. The price you pay is that your data has to be shoehorned into the relational model. Sometimes the shoehorning is gentle; sometimes it’s an act of violence. And you’ve either got to write the queries yourself, or use some sort of object-persistence tool. And you’ve got all the overhead of the RDBMS system itself.


I like using databases enough that, on occasion, I’ve advocated using them everywhere. For example, this snippet from 1997 on the original wiki.


I also think the use of embedded databases is going to skyrocket. Every time I look at them, I start drooling. Not because I have a compelling new application that requires them, but because I think they make a lot of currently-complex tasks a little easier. It’s going to require a mindset-change on the part of a lot of developers (using an RDBMS for persistent data is a lot different from using an embedded database within an application for non-persistent data), and it’s emphatically a step away from OO onto a much more declarative path, but I think the potential is mind-blowing.


What I was doing there was looking at Moore’s law, looking at how relational databases simplify my life in the enterprise application universe, and thinking “oooh. Moore’s law says that I can get me one of them database things in every process. Cool!”


But here’s what Moore’s law really says:


You can get you one of them indexing and persistence things in every process. And, as time goes on, you’ll be able to spend more and more cpu cycles on the indexing and persistence thing.

E.g. Moore’s law gives me permission to use indexing and persistence engines in every process, but it doesn’t insist that I use relational databases. And I’m starting to think I don’t really want to, for five reasons:

  • The fields in structured data are often disjoint sets (Peter Norvig first pointed this out to me). Suppose you have a database table for books, and it has the following columns: PUBLISHER, YEAR_PUBLISHED, AUTHOR, and PRICE. It’s fairly clear that “SELECT * where PRICE = $19.99″ is exactly the same query as “SELECT * WHERE [any column] = $19.99.”


    Side-note: does anyone have any references for this? It seems true enough, but have there been empirical studies done?

  • The fields in structured data are often enumerated types. Sometimes they’re numbers, and can take on any value. But they’re often one of a small list of nouns.

  • The world is full of semi-structured data (I count XML in here, but there’s lots more) that’s hard to fit into the relational model.

  • Text indexing systems like Lucene are pretty much available for free in any programming language you want to use. And they do a bangup job.

  • It sounds an awful lot like the next generation of operating systems are going to offer much better indexing into the file system as a matter of course.


If you’re reading this and thinking “all he’s saying is that lucene is useful for indexing xml fragments,” you’re halfway there. And If you’re an XML lunatic who then says “Hey! Wow! And the world, in its entirety, is entirely composed of XML (or possibly RDF) fragments,” then you’ve gone way too far. What I know is that the world is mostly made up of semi-structured data and I know that database schemas often evolve at a ferocious rate because, when we impose more structure, we often get it wrong.


And so now what I’m wondering is if I was completely off base in 1997. That is, I’m wondering if Moore’s law really says that relational databases are going to become vastly less important over time, because for most applications there’s a less-structured (and less efficient) way to do things that’s more convenient for the programmers.


In much the same way that we moved to “higher level” and “scripting” languages, I’m starting to think we’re going to move backwards, towards “more primitive” indexing systems where we just toss all the documents into the indexer and then pull out things based on text search.


The embarassing thing about this little essay (it’s too long to call it an entry) is that I think I might have just understood Perl for the first time.

What do you think? What’s the role of the relational database in 2006?