Reading Jack Herrington’s article reminded me of an article I have been wanting to write for some time about improving scalability for data driven web sites. When I worked
a viewer, there are a lot of things that simply do not change from viewer to viewer. From the rss feed boxes to the polls to the stories, the number of common elements way
outnumber the customized components. But when taken as a full page, there is so much changing from user to user that it really doesn’t make a lot of sense to do much in
the way of page caching due to both the dynamic nature of the site and the number of user customizations that exist.
But when you look at how any one page is put together for a great number of viewers of any website, you see that most of the queries being presented to the database are
largely the same. Thanks to innovative server side caching by databases like mySQL and the rest, you find that these repetitive, non-unique queries are not as bad as they
could be if the db had to reexecute the queries every time rather than pull them from memory, so you might imagine that caching SQL on the client side would be a bad idea.
That is, unless of course you want your site to scale better horizontally. The problem with doing your caching on the SQL server is that you really want to leave the DB
alone as much as possible if you want your site to scale, and really only access that machine in the event of change in a table or row. This becomes all the more
important if you are using a commercial DB in which scaling matches with increased licensing costs.
How it works could almost be left as an exercise for the O’Reilly reader. Basically what you are doing is adding a “smart” memoization function to your SQL access library
in whichever language you are using. Including this on the client also means you’ll need to create a tracking table in the database to keep track of which database caches
have become dirty and when.
When constructing such a system, you can create pages that at most take 1 query per page (plus any logging queries which you can choose whether or not to dirty the caches
over) using the following method:
First thing, before any actual real queries happen, the page construction logic should query the tracking table to get a list of tables that have become dirty since the
last time a page was constructed, this can commonly become the only query that a page might see.
In the event that a list of “dirtied” tables has been returned, the routine should remove the cached tables from whatever storage you’re using. I choose disk based SQL
query caching as modern operating systems are better than I am at setting up disk caches for mostly read only data, a programmer could use a
HREF="http://www.danga.com/memcached/">memcachedkind of solution, but I don’t know how much benefit it would give you, a programmer should look at how the page logic
uses memory before choosing the exact method of storing data.
It wouldn’t be that much harder to implement a row level solution to this as well, but the less granular solution is much simpler, and since we’re talking about scaling
and not performance (although you do see some performance gains in this kind of thing) I don’t think it is worth spending the time implementing row level caching, although
if you did that would make a good case for doing it via a memory only mechanism as I can see running out of file descriptors very quickly.
Anyhow, in the event that the list is empty, you are in good shape. The page logic would continue on with its business, until a query occurs, at this point, the SQL access
level would check and see if the query has been cached before, and thus not erased previously, and would loaded that pickled data from the file and return that as the
query’s result, with no actual SQL query occurring.
In the event that the cached results do not exist, the query is run normally through the DB, and then the result is pickled onto the drive via whatever mechanism, then the
result is returned to the page assembly logic.
The DB is then given an explain query , or alternatively, a programmer can inspect the query via traditionally text processing methods, and then the pickled data is tagged
as having to do with those tables, so that when those tables are written to, they can be invalidated properly. Part of this analysis stage is used to examine whether the
query is the kind of query that would dirty the cache, like an Insert or Delete and if so those tables are marked as dirty in the tracking table, the nice thing about this
is that on the next page load this very same head would then go about the invalidation step, so don’t worry about calling the cache invalidation routines here.
Important things to keep in mind when implementing:
Allow for queries to assert that they cannot invalidate the cache. For instance, some queries are less important to the overall appearance of the site. Logging information
may be less important for the average end user to care about, for instance.
Allow for some queries to bypass the query mechanism. When I wrote my first implementation of this, I felt this was very important. In time I found that I used this option
less and less, for good reason, the dirtying step worked fine.
Allow for some queries to not drop caches in the filesystem, for initial user logins for instance, or cookie queries. In the event the machine goes down or enters into a
state where the DB is unavailable, this would make the machine continue to serve anonymous pages no problem, but unable to serve user specific data.
You might think this would only be good for generic, non-customized results. This is not the case as most queries are based on parameters that users commonly share, but
more importantly, there is little penalty for using the system. For instance, suppose you do a query that has no cached result, you are then looking at 1 extra query which
is bad, but when taken against the overall savings that such a system brings, is well worth it, as are the extra file descriptor searches.
The very cool thing about this is that you can derive some level of safe degradation of a web site with this mechanism. In the event that a DB goes down, the caching
system continues to serve pages from the cache, which can be desirable and even outrageously useful. It is worth pointing out before you even think about it that it is not wise to nfs share these cached data files, nor should you be tempted to assume that I’ve gone into every detail here, I mean I haven’t mentioned locking the files to ensure you are getting a good copy of the cached data, etc…
When I implemented such a system, I made some pretty ridiculous pages and only noticed real delays when I had over 300 simulated queries per page (with 2 real queries, 1 tracking and 1 log). An additional benefit of the system is that the tracking queries are really simple (something like: select tablename from sitetracking where trackid > lastidchecked) and are readily cached at the server.
I wrote some implementation code for a cms I was writing some time ago, but abandoned it and you are welcome to email me for a copy.
I should point out that I wrote a php version when I was not very familiar with that language, so no promises as to its overall phpness in the event you look at it. The
only real problem is that implementing this in your code can be very specific to your schema, the real answer is for me (or you) to reimplement this system as a shim ODBC
driver, so I’ve been looking up that spec, but in the meantime I thought I’d post it here for people to consider when constructing web sites and services with a mind
towards robust scalability.
In which Chris talks about caching methods designed to scale large sites without loading your database servers.