Also in aboutSQL:
We've spent quite a few columns working through the SQL
SELECT statement, and you may be thinking it's time for a break -- you're right!
For the next few columns, I'm going to back up and introduce you to database design -- specifically database keys and table relationships. After we're done with that, we'll jump back into ... you guessed it! The
SELECT statement. But by then you'll have the tools to seriously take a look at how to join tables together. So sit back, relax, and get ready to be given the "key" to understanding SQL.
Even though the focus of this column is SQL, it's difficult to discuss SQL without at least touching on basic database design and structure.
We've discussed using SQL
WHERE clauses to filter
DELETE statements, but we haven't spent a lot of time discussing how to operate on single database records using those statements.
There's no fundamental difference between using SQL to filter a single database record and using a broader filter, something like
WHERE STATE='VA'. In either case, the
WHERE clause is used to indicate which records we're interested in using for the SQL operation.
But what criteria differentiate any one database record from another? The answer is not only database table-dependent, but even data-dependent! For example, in a database of computer book authors, I'm most likely the only Ashenfelter that turns up if you searched by my last name. In this database, my last name can serve to uniquely identify me.
But if the database contains all authors of books in print, you'd also find several other Ashenfelters listed (the economist Orley Ashenfelter, in particular). As a fallback, you may try to use first and last names together to uniquely identify me in a
WHERE clause, but you can see that this method quickly breaks down when you choose other names like "John Smith" or "Anonymous".
The only way to make sure that each record in a given database table has a unique value is to designate a database field to contain a value that is unique across all of the records in that table. In some cases, you may choose an existing field in the database which you are guaranteed will be unique -- a social security number would work for a U.S. citizen and an ISBN would work for a book.
But in many cases, the best idea is to add a new field to the database to serve as the unique identifier. And in some cases, you may use a combination of two or more fields that, while not individually unique, are guaranteed to be unique when used together. Regardless of which method you choose, what you've created is called a database key.
Database keys are essential to good database design. In most cases, each table in the database should have its own key field. Any arbitrarily-created key field for a database table is typically called a primary key.
In the cases where there is another value that is also thought to be unique, such as a social security number or a order number, that value may be referred to as a secondary key.
And in the rare situation where you're using two or more fields in combination to represent a key, the resulting combination of fields is called a composite key.
You'll also hear about foreign keys, but we'll delay discussing them until next week. For our purposes today, we'll be focusing on primary keys.
Most database tools have ways to help you define a particular column as a primary key. Databases often have special fields or datatypes to help ensure that the field always has a unique, non-NULL value. Don't be afraid to use those tools!
In fact, the most common way to generate a primary key field is to choose an integer datatype for the column and use the "autonumber" feature of the database to increment the key field value as each record is added. More sophisticated databases guarantee that numbers are never reused, even if the record with that key value is deleted from the database.
Numeric, particularly integer, fields are the preferred primary key datatype because they require little storage space (typically 4-8 bytes) and can be searched much faster than text. Unless you have an extremely good idea about the size of your data, it is often a good idea to use an 8-byte
LONG INTEGER datatype which can handle around 2-4 billion records without running out of possible integer values. (The 4-byte integer datatype will max out at around 64,000 records or so depending on implementation.)
In most routine situations, you should automatically add a 8-byte integer primary-key field to your database table without even thinking about it, even if you have a text field that would work -- the storage space is relatively small and the performance of integer comparisons is much faster than text.
There are also situations where you need database values to be universally unique. One good example is in a system where there are two separate physical databases containing human resource data from two different physical locations of a company. To ensure that records are uniquely identified regardless of which database they are in, you can use a datatype called a
UUID or just
UID) which stands for Globally Unique Identifier.
These are 128-bit values that are generated from some sort of magical combination of starting values that are different for each computer in the world. The values are then represented as a 30-some character string of hexadecimal values and dashes. (Those of you who are Windows or ASP programmers will recognize them as CLSID values.) The downside of using
GUIDs is that they require more storage space and take more time to analyze when used in a
WHERE clause compared to integer values. But the upside is that your primary key will be unique both within and between any database anywhere, ever.
One final note about keys -- don't be stingy with them! They take up minimal space, and you'll save lots of time in the future if you already have them incorporated in the design. Don't believe me? Let's take something that seems like a really good primary key for a database full of people -- a social security number.
In a health insurance database, everyone should have a social security number, right? Maybe. But what about newborns? Or illegal aliens who bought health insurance? Or foreign nationals? Or even the folks that demand an identifier instead of their Social Security Number? You database will be a lot more robust if you just assign everyone a "customer number" or some other arbitrary value and use the SSN as a secondary key.
Now that we've got keys securely in our back pocket, the next step is to take a look at how to use those keys to relate database tables to each other. We'll then be using
JOINs to get some serious work done. Make sure you take a look at your database of choice this week to see what kind of primary-key datatypes and functions you can use. In the meantime, feel free to contact me with comments, questions, and criticisms aboutSQL.
John Paul Ashenfelter is president and CTO of TransitionPoint.com, a technology development and analysis group focusing on web database applications.
Return to ONLamp.com.
Copyright © 2009 O'Reilly Media, Inc.