A New Visualization for Web Server Logsby Raju Varghese
There are well over a hundred web server log analyzers (Google Directory for Log Analysis) or web statistics tools ranging from commercial offerings such as WebTrends to open source ones such as AWStats. These take web server logfiles and display numbers such as page views, visits, and visitors, as well as graphs over various time ranges. This article presents the same data in those logfiles in a very different way: as a 3D plot. By the end of this article, I hope you will agree with me that the visualization described herein is a novel and useful way to view the content of logfiles.
The logfiles of web servers record information on each HTTP request they receive, such as the time, the sender's IP address, the request URL, and the status code. The items in each request are fairly orthogonal to one another. The IP address of a client has no relation to the URL that it requests, nor does the status code of the request to the time of the request. If that is the case, what could be a better way to display these n columns from the logfiles than an n-dimensional plot?
When an administrator observes anomalous behavior on a web server, she reaches out for web statistics reports, as they are usually all there is as a record of past activity. These often prove fruitless, mainly because web statistics is primarily a marketing-oriented view of web server activity. The next step is to take the raw logfiles apart with ad hoc scripts. The sheer mass of data makes it difficult to reduce it to a few numbers that reveal the cause of the problem. Another complication is that you may not quite know what you are looking for other than that it is abnormal behavior. The path this article takes is to provide a visualization of raw data such that cause or causes make themselves visible. This comes from the real-life experience of a client, where crippling performance problems appeared out of nowhere.
The scatter plot in Figure 1 shows more than half a million HTTP page requests (each request is a dot) in 3D space. The axes are:
- X, the time axis--a full day from midnight to midnight of November 16.
- Y, the requester's IP address, with the conventional dotted decimal format sorted and given an ordinal number between 1 and 120,000, representing the number of clients that accessed the web server.
- Z, the URL (or content) sorted by popularity. Of the approximately 60,000 distinct pages on the site, the most popular URLs are near the zero point of the Z-axis and the least popular ones at the top.
Figure 1. Scatter plot showing HTTP requests
If the plotted parameters were truly orthogonal, you could expect a random distribution: a flat featureless plot. The parameters, however, are not completely independent of one another. For example, the IP ranges for Italy may prefer the Italian pages on the website. Therefore instead of a random plot, there are clusters in the 3D space. If you think about it, that does not seem unreasonable: the home page is probably the most visited page on a website. Studies (especially Jakob Nielsen on website popularity and Jakob Nielsen on traffic log patterns) argue convincingly that popularity closely follows Zipf's law: a log curve with a long tail. Hence the dense horizontal layer at the bottom in Figure 1. The vertical rectangular planes are search crawlers. They request pages over the whole content space from a small number of IP addresses and do that over the whole day. Therefore, clustering along each of the three dimensions is common.
The Case Study
The website of a client grew inexplicably sluggish one day. Since the web server, CMS, and auxiliary servers had run well for the preceding months, the only rational explanation pointed to an unusual request pattern. The web log-analysis reports showed nothing out of the ordinary. Command-line scripts (with awk, sort, grep, and friends) running over the logfiles also revealed no anomalies. I used Gnuplot to graph the requests in 3D space. (See also an excellent Gnuplot introduction) Some time later, the 3D plot made the culprit evident.
Figure 2. Scatter plot of a bad day.
The thick pillar in the plot stands out like a sore thumb. This is a dense set of requests in a short time (about 100 minutes on the X-axis, which represents 24 hours) from a single IP address (Y-axis) and going over the whole content space (Z-axis). Why should it cause trouble? Large-scale CMS servers generate content on-the-fly from a database. Caches usually handle most requests, so only the small number of requests that are not currently in the cache should require database activity. On this particular CMS, the caches keep content for 15 minutes. When the client requested all of the pages in a short time, the high number of cache misses placed a heavy load on the database. This resulted in deteriorated performance. Search crawlers such as Yahoo Slurp and Googlebot do pretty much the same thing, but they spread the load over a much longer period.
Now that you have seen the output, here's how to generate it. The input is, of course, an access logfile that has lines of data, one per HTTP request. A typical line from an Apache server conforms to the NCSA combined access logfile standard. (See the Combined Log Format description at apache.org.) Note that I've wrapped the long line:
220.127.116.11 - - [15/Jan/2006:21:12:29 +0100] "GET /index.php?level=2 HTTP/1.1" 200 5854 "http://www.sphere.li/index.php" "Mozilla/5.0 (X11; U; Linux i6 86; en-US; rv:1.7.3) Gecko/20040914"
The Perl script at the end of the article takes sequences of these lines and condenses them to just what Gnuplot needs. Run it with an access logfile and redirect it to an output file, such as gnuplot.input, from the command line:
$ perl prepare-for-gnuplot.pl access_log > gnuplot.input
The output will be a series of lines matching those of the access logfile. For the previous line from the access log, the corresponding output is:
15/Jan/2006:21:12:29 906 41 200
The fields in gnuplot.input, the output file of the Perl script, are date/time, ip rank (906), url rank (41), and status code.
To display the sequence of lines in Gnuplot, give it the commands:
$ gnuplot set style data dots set xdata time set timefmt "%d/%b/%Y:%H:%M:%S" set zlabel "Content" set ylabel "IP address" splot "gnuplot.input" using 1:2:3
Pages: 1, 2