In my last article I showed how web server logfiles can be visualized as a 3D plot with the help of Perl and gnuplot. In this article we will enhance the plot in several ways. The main things we will introduce are color and evening out of the plot.
Access logfiles from a web server need to be filtered before the data is passed on to gnuplot. Listing 5, the Perl script that was used in the previous article, can be used in this one as well. Each line in the access logfile produces one line of output; of the many items in a line of the access logfile, four items are extracted: the timestamp, URL, IP address, and status code of the request. The URL in the output is not the actual URL but its rank in the list of URLs in the file. Similarly, the IP address is the rank of the actual IP address in a sorted list. Both of these are integer numbers. The output file so created can be read directly by gnuplot, as you will see later in this article.
The 3D plots in the previous article were bland monochrome; the version of gnuplot at that time could not handle multiple colors for scatter plots. With the release of gnuplot 4.2 on March 3, 2007 the possibilities have increased. We will display the status code as the fourth dimension of the data in color.
Figure 1. Color scatter plot showing HTTP requests
Code Listing 1 shows the gnuplot commands used to generate Figure 1. All commands except the last two should be self-explanatory. The penultimate command defines a function that returns the color of the dot depending on the status code. It is a nested ternary conditional statement in the syntax needed for gnuplot.
rgb(r) = (r<200)? (000000): (r<300)? (12632256): (r==304)? (10526880): (r<400)? (238): (r<500)? (15631086): (16711680)
In pseudo-code it could be written as:
if (statusCode < 200) # 1XX return black else if (statusCode < 300) # 2XX return gray else if (statusCode == 304) # Not modified return darkgray else if (statusCode < 400) # other 3XX or redirects return blue else if (statusCode < 500) # 4XX including the infamous page-not-found return violet else # 5XX return red end
The status code 304 (Not Modified) deserves special treatment because even though it is in the 3XX group it is not a redirect. It states that the content was not modified and that the client can continue to use the cached copy. I have therefore considered it similar to the 2XX status code but given it a different shade of gray. The table below shows the HTTP status codes and the corresponding color codes as integer and hex numbers.
|Status code||Color||Color as integer||Color as hex||Comment|
The last line in Code Listing 1, at the end of this article, is the actual command to draw the scatter plot. It specifies the input file (
gnuplot.inp20070123.txt) where the four dimensions for each dot are specified and the order of the four values that are to be used. The fourth dimension is calculated according to the function
splot 'gnuplot.inp20070123.txt' using 1:2:3:(rgb($4)) with dots lc rgb variable
For the benefit of those who have not read the first article, each dot in Figure 1 corresponds to a line (i.e., one HTTP request) from the access logfile of a web server. The three axes are time, IP address, and content. The status code, which is also in every line of the access logfile, is represented by the color of the dot in 3D space. This particular plot looks featureless, but Figure 2 looks sinister and could give a sysadmin sleepless nights. It shows a spider attack; the tall pillar is a concentrated salvo of requests over the whole content space—one that is guaranteed to make the database, where the content is stored, break into sweat.
Figure 2. Spider attack in color
The left flank of the pillar is gray because all the components could handle the onslaught. The barrage of requests, however, soon brings the database to its knees and the status code changes to red (5XX). It affects all requests at that time, as the long shadow of the pillar shows. Nevertheless, the system recovers quite quickly after the attack and the red color fades away with time (increasing x-axis).
The representations of these plots were inspired by Edward Tufte's credo: simple design, intense content (Reference 1). I hope that this visualization is in that spirit. Three dimensions for three columns of the logfile and color for the fourth; about a million data points. The simplicity of the representation requires, however, that the interpretation of the plots be relegated to the viewer. Clustering, as we have seen in Figure 2, is a sure sign of forces at work, and this behooves attention.
We will now polish the plot in two ways: by jittering each point by a random value in the range ± 0.5 to give a nonintegral value for the z-axis (see
splot command in Listing 3) and by using a log 2 scale. Each of these will spread the data along the z-axis. Converting integer values to real numbers has the effect of spreading the data points from the point 12.0 (for example) to the range 11.5 to 12.5. Using a log scale has the effect of compressing the data as the data begins to thin out at the high z values. In the previous article we touched upon Zipf's Law and noted that hits on a web server follow it. By spreading out data near the origin and compressing it as it goes further, the result is that the dense plot near the origin is evened out. The disadvantage is that less prominent features may hide others, as can be seen by comparing the visibility of the red pillar in Figures 2 and 3.
Figure 3. Improved plot by spreading the data
Specifying a log scale for an axis is accomplished by the simple gnuplot command
set logscale z 2. Adding a random offset to the integer values is done by substituting the plain, unadorned column indicator 3 in the
splot command with
($3+rand(0)-0.5). The expression adds a random value in the range of 0 to 1 and then subtracts 0.5, which, in effect, means adding a random value of ± 0.5 to the third dimension.
Authors of articles on 3D plots face the dilemma of showing one through print or web media. Though constrained by the characteristics of the Web, we can use animation to convey the 3D structure. With the GIF animation feature of gnuplot 4.2 one can rotate the 3D plot and make it visible from several angles. Listing 4 has the commands to move the view point and replot the graph. Each plot command results in a frame of the animated sequence.
Figure 4. Rotating the 3D plot
Before concluding this article, I would like to motivate you to use the methods described here for your own needs. For example, you could take other items from the access logfile instead of the ones I chose. Likely candidates are bytes transferred, time to serve a request, etc. You are not restricted to web server logs; syslog (or event logs on Windows), mail logs, performance data, or mixtures of the these are suitable candidates for such analysis. A picture is worth a thousand words.
The basic tenet of my two articles has been to plot three or four seemingly unrelated parameters on orthogonal axes to visually ascertain clustering and, therefore, relations among those parameters. This, I believe, is one useful way to visualize web server logs that can consist of hundreds of thousands of lines, each of which has a number of items. Other logfiles, when tortured in a similar fashion, may squeal about the inner workings of otherwise hidden processes.
Raju Varghese has a Bachelors in Electrical Engineering from BITS, Pilani (India) and a Masters in Computer Science from the University of Texas, San Antonio.
Return to SysAdmin.
Copyright © 2009 O'Reilly Media, Inc.