 |

Ten Tips for Building Cache-Friendly Web
Sites
by Duane Wessels
06/20/2001
In computer science lingo, caching refers to the process of storing
frequently used information. Caching is prevalent in many aspects
of computer systems and networks. CPUs, disks, file systems, and
routers all use caching. A Web cache is a server dedicated to the
task of storing Web pages as people surf the Internet.
Internet service providers, corporations, and universities often use Web
caches--also known as caching proxies--on their local networks to increase
download speeds and reduce network traffic. When people surf the
Web through these proxies, the cached pages load significantly
faster than pages that are not found in the cache. A 1999 study
by Zona Research
concluded that e-commerce sites may have been losing
up to $362 million per month (USD) due to page-loading delays and
network failures. Content providers and server administrators
should try to build cache-friendly Web sites. Your customers
and visitors will thank you.
This list describes the top ten steps you can take to build a
cache-friendly Web site. Don't feel like you have to implement all of
these steps. It is still helpful if you put just one or two of these
into practice. The most beneficial and practical ideas are listed first.
Avoid using CGI, Active Server Pages, and server-side includes unless
absolutely necessary.
In general, these techniques are bad for caches because they usually
produce dynamic content. Dynamic content is not a bad thing per se, but it
may be abused. CGI and ASP can also generate cache-friendly, static
content, but require special effort by the author and seem to occur
infrequently in practice.
The main problem with CGI scripts is that many caches simply do not store
a response when the URL includes cgi-bin or even cgi. The
reason for this heuristic is perhaps historical. When caching was first in use, this was the easiest way to identify dynamic content. Today, with HTTP 1.1, we
only need to look at the response headers to determine what may be cached.
Even so, the heuristic remains, and some caches might be hardwired
to never store CGI responses.
From a cache's point of view, Active Server Pages (ASP) are very similar to
CGI scripts. Both are generated by the server, on the fly, for each
request. As such, ASP responses usually have neither a Last-Modified
nor an Expires header. On the plus side, it is uncommon to
find special cache heuristics for ASP (unlike CGI), probably because ASP
was invented well after caching was in widespread use.
Finally, you should avoid server-side includes (SSI) for the same reasons.
This is a feature of some HTTP servers to parse HTML at request time, and
replace certain markers with special text. For example, with Apache you
can insert the current date and time or the current file size into an
HTML page. Because the server generates new content, the
Last-Modified header is either absent in the response, or set to the
current time. Both cases are bad for caches.
Use the GET method instead of the POST method, if
possible.
Both methods are used for HTML forms and query-type requests. With the
POST method, query terms are transmitted in the request body. A
GET request, on the other hand, puts the query terms in the URI
(Uniform Resource Identifier). It's easy to see the difference in your
browser's Location box. A GET query has all the terms in the box,
with lots of & and = characters. This means
POST is somewhat more secure because the query terms are hidden
in the message body.
However, this difference also means that POST responses cannot
be cached unless specifically allowed. POST responses may have
side effects on the server (e.g., updating a database), but those side
effects wouldn't be triggered if the cache gave back a cached response.
Section 9.1 of
RFC 2616 explains the important differences between
GET and POST. In practice, it is rare to find a
cachable POST response, so I doubt most caching products even
cache any POST responses at all. If you want to have
cachable query results, you certainly should
use GET instead of POST.
Avoid renaming Web site files; use unique filenames instead.
This might be difficult or impossible for some situations, but consider
this example: A Web site lists a schedule of talks for a conference. For
each talk there is an abstract stored in a separate HTML file. These files
are named to match the order of their presentation during the conference.
Something like talk01.html, talk02.html, talk03.html,
and so on. At some point, the schedule changes and the filenames are no
longer in order. If the files are renamed, so that they match the new
order of the presentation, Web caches are likely to become confused.
Renaming usually does not update the file-modification time, so an
If-Modified-Since request for a renamed file can have
unpredictable consequences. Renaming files in this manner is similar to
cache poisoning.
In this example, it is better to use a file-naming scheme that does not
depend on the order; perhaps base the file naming on the presenter's name.
Then, if the order of presentation changes, the HTML file with the
schedule must be rewritten, but the other files can still be served from the
cache. Another solution is to touch the files to adjust the time stamp.
Give your content a default expiration time, even if it is very
short.
If your content is relatively static, adding an Expires header can
significantly speed up access to your site. The explicit expiration time
means clients know exactly when they should issue revalidation requests.
An expires-based cache hit is almost always faster than a validation-based
near hit.
With Apache, you can use the mod_expires module to add expiration
times to your responses. After configuring and compiling the server with
mod_expires, you'll need to add the ExpiresActive
directive to your httpd.conf file:
ExpiresActive on
Then, you can use the ExpiresByType and ExpiresDefault
directives to control expiration values for different responses. For
example:
ExpiresByType text/html "access plus 12 hours"
ExpiresByType image/jpeg "access plus 1 day"
ExpiresDefault "access plus 6 hours"
If you have content that changes at regular intervals (for example, daily),
you can base the expiration time on the file-modification time:
ExpiresByType image/gif "modification plus 1 day"
For more information on the mod_expires module, take a look at
the Apache documentation.
If you have a mixture of static and dynamic content, it is helpful to
have a separate HTTP server for each.
This way, you can set server-wide defaults to improve the cachability
of your static content, without affecting the dynamic data. Since the
entire server is dedicated to static objects, you only need to maintain one
configuration file. A number of large Web sites have taken this approach.
Yahoo serves all its images from a server at images.yahoo.com, as
does CNN with images.cnn.com. Wired serves advertisements and
other images from static.wired.com, and Hotbot uses a server named
static.hotbot.com.
Don't use content negotiation.
Occasionally, people like to create pages that are customized for the
user's browser. For example, Netscape may have a nifty feature that
Internet Explorer does not have. An origin server can examine the
User-agent request header and generate special HTML to take advantage of
a browser feature. To use the terminology from HTTP, an origin server may
have any number of variants for a single URI. The mechanism for selecting
the most appropriate variant is known as content negotiation, and it
has negative consequences for Web caches.
First of all, if either the cache or the origin server does not correctly
implement content negotiation, a cache client might receive the wrong
response. For example, if an HTML page has something specific to Internet
Explorer and gets cached, the cache might send it to a Netscape user. To
prevent this from happening, the origin server is supposed to add a
response header telling caches that the response depends on the
User-agent value:
Vary: User-agent
If the cache ignores the Vary header, or if the origin server
does not send it, cache users can get incorrect responses.
Even when content negotiation is correctly implemented, it reduces the
number of cache hits for the URL. If a response varies on the
User-agent header, a cache must store a separate response for every
User-agent it encounters. Note, this is more than just Netscape or
MSIE. Rather, it is a string like Mozilla/4.05 [en] (X11; I; FreeBSD
2.2.5-RELEASE i386; Nav). Thus, when a response varies on the
User-agent header, we can only get a cache hit for clients running the
exact same version of the browser, on the same operating system.
Synchronize your system clocks with a reference clock.
This ensures that your server sends accurate Last-modified
and Expires time stamps in its responses. Even though newer versions of
HTTP use techniques that are less susceptible to clock skew, many Web clients
and servers still rely on the absolute time stamps. ntpd implements
the Network Time Protocol (NTP) and is widely used to keep clocks
synchronized on Unix systems. You can get the software and installation
tips from the Time
Synchronization Server Web site.
Avoid using address-based authentication.
Most caching proxies hide the addresses of clients. An origin
server sees connections coming from the proxy's address, not the client's.
Furthermore, there is no standard and safe way to convey the client's
address in an HTTP request.
Address-based authentication can also deny legitimate users access to
protected information when they use a proxy cache. Many organizations use a
DMZ network for the firewall between the Internet and their internal
systems. A cache that runs on the DMZ network is probably not allowed to
access internal Web servers. Thus, the users on the internal network cannot
simply send all of their requests to a cache on the DMZ network. Instead,
the browsers must be configured to make direct connections for the internal
servers.
Think Different.
Sometimes, those of us in the United States forget about Internet users in
other parts of the world. In some countries, Internet bandwidth is so
constrained that we would find it appalling. What takes seconds or minutes
to load in the U.S. may take hours or even days in some locations. I
strongly encourage you to remember bandwidth-starved users when designing
your Web sites, and remember that improved cachability speeds up your Web
site for such users.
Even if you think shared proxy caches are evil, consider allowing
single-user-browser caches to store your pages.
There is a simple way to accomplish this with HTTP 1.1. Just add the
following header to your server's replies:
Cache-control: Private
This header allows only browser caches to store responses. The browser may
then perform a validation request on the cached object as necessary.
Duane Wessels discovered Unix and the Internet as an
undergraduate student of physics at Washington State University. After
playing system administrator for a few years, he moved to Boulder,
Colorado to attend graduate school. In late 1994, he joined the
Harvest
project, where he worked on searching, indexing, and caching. From
1996 until 2000, he was co-principal investigator of the
NLANR Information
Resource Caching project (IRCache), which operated several large
caches throughout the U.S. During this time he and others developed and
supported the Squid caching proxy.
Currently, he is co-owner and president
of The Measurement
Factory, a company that specializes in evaluating the performance and
behavior of HTTP-aware devices. Like many other Colorado residents, he enjoys
hiking, bicycling, and snowboarding.

|
 |
Sponsored by:
|