I have been recently involved in trying to improve the performance of an application that primarily handles messages (more and more of them). It is a kind of state machine. Three different types of entities are managed (one of the types happens to be sets of another type), and messages are received that tell what is the most recent status of each entity. So far so good, except that everything revolves around the current status of each entity, and that the main bottleneck seems to be queries that hit repeatedly a table
MESSAGE_HISTORY(msg_id, entity1_id, entity2_id, entity3_id, status, changed_by, timestamp), finding out which is the most recent status for a given
entity3_id (as you may have guessed, only one of those contains a value in each row), associated with the greatest
msg_id for the enity in question.
The solution that looked best was what was presented to me as “denormalizing”, and associating its current status to each and every entity.
Oh, yes, but denormalizing somewhat implies that the database is normalized, isn’t it? As it happens,
MESSAGE_HISTORY is not normalized; the fact that only one of
entity3_id is not null violates the third normal form, that says that attributes only depend on the key, and cannot be derived from each other. If you tell me that
entity2_id is not null, then
entity3_id are null. Ooops.
What is the key, by the way?
msg_id? A typical identity/auto-increment/sequence generated (depending on your favorite DBMS flavor) column, that has no business-related meaning? To me, the real key should be an
entity_id, an entity type, and the timestamp (or
msg_id if the timestamp isn’t more precise than one second and a same entity can change status several times in under one second - but in that case
msg_id takes a totally different meaning). Give me a properly defined
MESSAGE_HISTORY, and it would suddenly become much easier to build indexes that would allow to get the current status really faster.
But the most interesting thing is probably that in the application the only status of interest is the current status! The historical data is a mere audit trail, just in case. Denormalizing? Not really … Just putting the focus where it belongs. The current status should really be an attribute of each entity.