A New, Improved Visualization for Web Server Logsby Raju Varghese
In my last article I showed how web server logfiles can be visualized as a 3D plot with the help of Perl and gnuplot. In this article we will enhance the plot in several ways. The main things we will introduce are color and evening out of the plot.
Access logfiles from a web server need to be filtered before the data is passed on to gnuplot. Listing 5, the Perl script that was used in the previous article, can be used in this one as well. Each line in the access logfile produces one line of output; of the many items in a line of the access logfile, four items are extracted: the timestamp, URL, IP address, and status code of the request. The URL in the output is not the actual URL but its rank in the list of URLs in the file. Similarly, the IP address is the rank of the actual IP address in a sorted list. Both of these are integer numbers. The output file so created can be read directly by gnuplot, as you will see later in this article.
The 3D plots in the previous article were bland monochrome; the version of gnuplot at that time could not handle multiple colors for scatter plots. With the release of gnuplot 4.2 on March 3, 2007 the possibilities have increased. We will display the status code as the fourth dimension of the data in color.
Figure 1. Color scatter plot showing HTTP requests
Code Listing 1 shows the gnuplot commands used to generate Figure 1. All commands except the last two should be self-explanatory. The penultimate command defines a function that returns the color of the dot depending on the status code. It is a nested ternary conditional statement in the syntax needed for gnuplot.
rgb(r) = (r<200)? (000000): (r<300)? (12632256): (r==304)? (10526880): (r<400)? (238): (r<500)? (15631086): (16711680)
In pseudo-code it could be written as:
if (statusCode < 200) # 1XX return black else if (statusCode < 300) # 2XX return gray else if (statusCode == 304) # Not modified return darkgray else if (statusCode < 400) # other 3XX or redirects return blue else if (statusCode < 500) # 4XX including the infamous page-not-found return violet else # 5XX return red end
The status code 304 (Not Modified) deserves special treatment because even though it is in the 3XX group it is not a redirect. It states that the content was not modified and that the client can continue to use the cached copy. I have therefore considered it similar to the 2XX status code but given it a different shade of gray. The table below shows the HTTP status codes and the corresponding color codes as integer and hex numbers.
|Status code||Color||Color as integer||Color as hex||Comment|
The last line in Code Listing 1, at the end of this article, is the actual command to draw the scatter plot. It specifies the input file (
gnuplot.inp20070123.txt) where the four dimensions for each dot are specified and the order of the four values that are to be used. The fourth dimension is calculated according to the function
splot 'gnuplot.inp20070123.txt' using 1:2:3:(rgb($4)) with dots lc rgb variable
For the benefit of those who have not read the first article, each dot in Figure 1 corresponds to a line (i.e., one HTTP request) from the access logfile of a web server. The three axes are time, IP address, and content. The status code, which is also in every line of the access logfile, is represented by the color of the dot in 3D space. This particular plot looks featureless, but Figure 2 looks sinister and could give a sysadmin sleepless nights. It shows a spider attack; the tall pillar is a concentrated salvo of requests over the whole content space—one that is guaranteed to make the database, where the content is stored, break into sweat.
Figure 2. Spider attack in color
The left flank of the pillar is gray because all the components could handle the onslaught. The barrage of requests, however, soon brings the database to its knees and the status code changes to red (5XX). It affects all requests at that time, as the long shadow of the pillar shows. Nevertheless, the system recovers quite quickly after the attack and the red color fades away with time (increasing x-axis).
The representations of these plots were inspired by Edward Tufte's credo: simple design, intense content (Reference 1). I hope that this visualization is in that spirit. Three dimensions for three columns of the logfile and color for the fourth; about a million data points. The simplicity of the representation requires, however, that the interpretation of the plots be relegated to the viewer. Clustering, as we have seen in Figure 2, is a sure sign of forces at work, and this behooves attention.
Pages: 1, 2