O'Reilly    
 Published on O'Reilly (http://oreilly.com/)
 See this if you're having trouble printing code examples


A New, Improved Visualization for Web Server Logs

by Raju Varghese
03/29/2007

Introduction

In my last article I showed how web server logfiles can be visualized as a 3D plot with the help of Perl and gnuplot. In this article we will enhance the plot in several ways. The main things we will introduce are color and evening out of the plot.

Access logfiles from a web server need to be filtered before the data is passed on to gnuplot. Listing 5, the Perl script that was used in the previous article, can be used in this one as well. Each line in the access logfile produces one line of output; of the many items in a line of the access logfile, four items are extracted: the timestamp, URL, IP address, and status code of the request. The URL in the output is not the actual URL but its rank in the list of URLs in the file. Similarly, the IP address is the rank of the actual IP address in a sorted list. Both of these are integer numbers. The output file so created can be read directly by gnuplot, as you will see later in this article.

Technicolor

The 3D plots in the previous article were bland monochrome; the version of gnuplot at that time could not handle multiple colors for scatter plots. With the release of gnuplot 4.2 on March 3, 2007 the possibilities have increased. We will display the status code as the fourth dimension of the data in color.

Color plot
Figure 1. Color scatter plot showing HTTP requests

Code Listing 1 shows the gnuplot commands used to generate Figure 1. All commands except the last two should be self-explanatory. The penultimate command defines a function that returns the color of the dot depending on the status code. It is a nested ternary conditional statement in the syntax needed for gnuplot.

rgb(r) = (r<200)? (000000): (r<300)? (12632256): (r==304)? (10526880): (r<400)? (238): (r<500)? (15631086): (16711680)

In pseudo-code it could be written as:

if (statusCode < 200) # 1XX
   return black
else if (statusCode < 300) # 2XX
   return gray
else if (statusCode == 304) # Not modified 
   return darkgray
else if (statusCode < 400) # other 3XX or redirects
   return blue
else if (statusCode < 500) # 4XX including the infamous page-not-found 
   return violet
else # 5XX
   return red
end

The status code 304 (Not Modified) deserves special treatment because even though it is in the 3XX group it is not a redirect. It states that the content was not modified and that the client can continue to use the cached copy. I have therefore considered it similar to the 2XX status code but given it a different shade of gray. The table below shows the HTTP status codes and the corresponding color codes as integer and hex numbers.

Status code Color Color as integer Color as hex Comment
1XX Black 0 0x0 Informational
2XX Gray 12632256 0xC0C0C0 Successful
304 DarkGray 10526880 0xA0A0A0 Not Modified
3XX Blue 238 0x0000EE Redirection
4XX Violet 15631086 0xEE82EE Client Error
5XX Red 16711680 0xFF0000 Server Error

The last line in Code Listing 1, at the end of this article, is the actual command to draw the scatter plot. It specifies the input file (gnuplot.inp20070123.txt) where the four dimensions for each dot are specified and the order of the four values that are to be used. The fourth dimension is calculated according to the function rgb.

splot 'gnuplot.inp20070123.txt' using 1:2:3:(rgb($4)) with dots lc rgb variable

For the benefit of those who have not read the first article, each dot in Figure 1 corresponds to a line (i.e., one HTTP request) from the access logfile of a web server. The three axes are time, IP address, and content. The status code, which is also in every line of the access logfile, is represented by the color of the dot in 3D space. This particular plot looks featureless, but Figure 2 looks sinister and could give a sysadmin sleepless nights. It shows a spider attack; the tall pillar is a concentrated salvo of requests over the whole content space—one that is guaranteed to make the database, where the content is stored, break into sweat.

3D plot of a bad day (spider attack)
Figure 2. Spider attack in color

The left flank of the pillar is gray because all the components could handle the onslaught. The barrage of requests, however, soon brings the database to its knees and the status code changes to red (5XX). It affects all requests at that time, as the long shadow of the pillar shows. Nevertheless, the system recovers quite quickly after the attack and the red color fades away with time (increasing x-axis).

The representations of these plots were inspired by Edward Tufte's credo: simple design, intense content (Reference 1). I hope that this visualization is in that spirit. Three dimensions for three columns of the logfile and color for the fourth; about a million data points. The simplicity of the representation requires, however, that the interpretation of the plots be relegated to the viewer. Clustering, as we have seen in Figure 2, is a sure sign of forces at work, and this behooves attention.

Other Improvements

We will now polish the plot in two ways: by jittering each point by a random value in the range ± 0.5 to give a nonintegral value for the z-axis (see splot command in Listing 3) and by using a log 2 scale. Each of these will spread the data along the z-axis. Converting integer values to real numbers has the effect of spreading the data points from the point 12.0 (for example) to the range 11.5 to 12.5. Using a log scale has the effect of compressing the data as the data begins to thin out at the high z values. In the previous article we touched upon Zipf's Law and noted that hits on a web server follow it. By spreading out data near the origin and compressing it as it goes further, the result is that the dense plot near the origin is evened out. The disadvantage is that less prominent features may hide others, as can be seen by comparing the visibility of the red pillar in Figures 2 and 3.

Color plot
Figure 3. Improved plot by spreading the data

Specifying a log scale for an axis is accomplished by the simple gnuplot command set logscale z 2. Adding a random offset to the integer values is done by substituting the plain, unadorned column indicator 3 in the splot command with ($3+rand(0)-0.5). The expression adds a random value in the range of 0 to 1 and then subtracts 0.5, which, in effect, means adding a random value of ± 0.5 to the third dimension.

Animation

Authors of articles on 3D plots face the dilemma of showing one through print or web media. Though constrained by the characteristics of the Web, we can use animation to convey the 3D structure. With the GIF animation feature of gnuplot 4.2 one can rotate the 3D plot and make it visible from several angles. Listing 4 has the commands to move the view point and replot the graph. Each plot command results in a frame of the animated sequence.

Rotating the 3D plot
Figure 4. Rotating the 3D plot

Further Experimentation

Before concluding this article, I would like to motivate you to use the methods described here for your own needs. For example, you could take other items from the access logfile instead of the ones I chose. Likely candidates are bytes transferred, time to serve a request, etc. You are not restricted to web server logs; syslog (or event logs on Windows), mail logs, performance data, or mixtures of the these are suitable candidates for such analysis. A picture is worth a thousand words.

Conclusion

The basic tenet of my two articles has been to plot three or four seemingly unrelated parameters on orthogonal axes to visually ascertain clustering and, therefore, relations among those parameters. This, I believe, is one useful way to visualize web server logs that can consist of hundreds of thousands of lines, each of which has a number of items. Other logfiles, when tortured in a similar fashion, may squeal about the inner workings of otherwise hidden processes.

References

  1. Scientific American April 2005

Code Listings

  1. Gnuplot commands for Figure 1
  2. Gnuplot commands for Figure 2
  3. Gnuplot commands for Figure 3
  4. Gnuplot commands for Figure 4
  5. Perl script to convert access logfiles to input suitable for gnuplot for Figure 4; unchanged from the previous article

Raju Varghese has a Bachelors in Electrical Engineering from BITS, Pilani (India) and a Masters in Computer Science from the University of Texas, San Antonio.


Return to SysAdmin.

Copyright © 2009 O'Reilly Media, Inc.