Imagine you’re programming an application and you run across the following (pseudocode) method:

  double foo (int num) {
      return someNum/num;
  }

Any programmer who’s been programming for more than, oh, 3 days, is going to ask “what happens when you pass a zero?” What’s remarkable about this is not just that the original programmer forgot to consider this case, but that many times the original programmer will have all sorts of fascinating arguments about why they deliberately ignored this case. Why is it that many serious programmers would be aghast at these arguments but somehow accept them when it comes to programming a database?

Before considering the sad state of database programming, let’s take a closer look at the arguments for the above function’s skipping argument validation. If the programmer who wrote that function defends his or her decision to not test for zero, the argument is generally along the lines of “I wrote the entire system and I know it’s not possible for a zero to get passed”. This is sometimes followed up with “this function is called a lot and I don’t want the overhead of the extra conditional.”

Years ago, I was writing some COBOL for a system I had just started working on. When I was done, I handed it off to the lead programmer and she deleted a large chunk of my program before returning it to me. What did she delete? Virtually all of my data validation. Her reasoning was that no data could enter the system except through some very carefully designed “control” programs which would not allow any bad data through. As a result, I was testing for things which could never happen. This idea, which seemed reasonable to me at the time, has some serious potential problems.

First, it assumed that whoever was writing software for the system knew about the control programs and would ensure that they were always up to date. This was a large system, she was transferred to a different department and, as a result, I didn’t always know which control program was responsible for validating which data. The second problem is even more insidious. It means that whenever a maintenance programmer looks at a given function, they have to hope that the data validation is in place for it. Further, they have to hope that their understanding of the data validation matches what’s in the control programs. If they’re unsure, they need to wade through the code and hope they find all of the places where the data validation occurs to verify that it’s correct. For large systems, and particularly for programmers who are less familiar with the system, this can be a huge waste of programmer time and is fraught with errors.

The second argument, the one which cites performance concerns, can be a tough one to argue about. Personally, I’d much rather have a slower system which gives correct answers than a faster system which doesn’t. More than once I’ve asked a programmer if they’ve profiled their code to find out if the validation is really a problem only to be greeted with a blank stare. Make sure your software is correct and maintainable first. Then, and only if you have a known performance problem, should you start profiling your code and benchmarking alternate solutions.

What this all boils down to is a simple rule of thumb: keep your data validation close to your data. It’s easier to maintain and is more likely to be correct. If you start duplicating the validation, abstract it out into shared code which is easy to find and reuse.

This brings us back to the aforementioned link where the author concludes:

I’m sure there are a number of arguments that can be made for using foreign keys etc etc, but it’s just stuff that gets in the way in a framework like rails. Do we test w/ foreign keys in place before milestones(what I’m leaning towards)? Do we test w/o the fks in place(obviously not :])? Maybe somebody can come up with a better way to handle them, but I don’t really think you need fks if you’re using rails associations correctly. If you are forced to use them though, the [approach I’ve outlined] might be a good place to start.

While the author is specifically referring to foreign keys not being needed for Rails, the problem is more widespread. I’ve listened to people tell me that they prefer to handle data integrity code at the app level rather than the database level. Certainly there are times where this is more appropriate, but not for something as basic as foreign keys. Concluding that foreign keys are not necessary is very dangerous and short-sighted.

First, keep in mind that a database like MySQL or PostgreSQL is effectively a programming platform with a declarative language (SQL) sitting on top of it. If you have an SQL statement inserting data, you want the receiving table to have the data validation in the form of foreign keys, custom data types, enums, triggers and so on. If you’re writing code for personal use or you control an open-source project, skip this if you want. It’s your code; it’s your choice. However, if you’re writing code for a business, you don’t know who will be maintaining it tomorrow. Maybe they don’t know about your clever data validation routines which allow you to forego foreign keys. Maybe they’re finding your code too slow so they decide to write directly to the database themselves. Whatever the reason, by putting the data validation as close as possible to the data munging, you make it harder to circumvent.