The idea is to look at frequency of repetition of groups of lines in a text
file. This is a script that I slapped together to do just that:
#!/usr/bin/env python
'''Silly log parsing script
usage:
silly_log_parser.py [options] [logfilename]
If logfilename is not specified, this script reads the log from stdin.
options:
-s numsplits - defaults to 3
numsplits is the number of whitespace separated "columns" this will
split off the front of each line of the log file
-m max_chain - defaults to 2
-t - disable displaying summary stats
-l - disable displaying line stats
-h - print this help message
'''
import sys
import getopt
num_splits = 3
max_chain = 3
display_summary_stats = True
display_line_stats = True
def usage():
print __doc__
try:
opts, args = getopt.getopt(sys.argv[1:], "s:m:tlh")
except getopt.GetoptError:
usage()
sys.exit(2)
for o, a in opts:
#print o, a
if o == "-h":
usage()
sys.exit()
if o == "-s":
num_splits = int(a)
if o == "-m":
max_chain = int(a) + 1
if o == "-t":
display_summary_stats = False
if o == "-l":
display_line_stats = False
try:
infile = open(args[0], "r")
raw_lines = infile.readlines()
except IndexError:
raw_lines = sys.stdin.readlines()
except IOError:
print "File does not exist"
usage()
sys.exit()
lines = [l.split(None, num_splits)[num_splits] for l in [ll.strip() for ll in raw_lines] if not l == ""]
chain_dict = {}
for i in range(len(lines)):
try:
for j in range(1, max_chain):
chain_tuple = tuple(lines[i:i+j])
if (i + j) > len(lines):
continue
chain_dict[chain_tuple] = chain_dict.setdefault(chain_tuple, 0) + 1
except IndexError:
continue
#print chain_dict
if display_line_stats:
for i in range(len(lines)):
#display line stats
chain_list = []
try:
try:
for j in range(1, max_chain):
chain_tuple = tuple(lines[i:i+j])
if (i + j) > len(lines):
continue
#chain_list += [(chain_dict[chain_tuple], chain_tuple)]
chain_list += [chain_dict[chain_tuple]]
print "%s-> %s" % (chain_list, lines[i])
except IndexError:
print "%s-> %s" % (chain_list, lines[i])
continue
except KeyboardInterrupt:
sys.exit()
except IOError:
sys.exit()
except OSError:
sys.exit()
if display_summary_stats:
chain_stats = [((len(c) -1) * (chain_dict[c] - 1), c) for c in chain_dict]
chain_stats.sort()
chain_stats.reverse()
try:
#display summary stats
for c in chain_stats:
print "*" * 40
print chain_dict[c[1]]
print "-" * 40
print "n".join(c[1])
print "*" * 40
except KeyboardInterrupt:
sys.exit()
except IOError:
sys.exit()
except OSError:
sys.exit()
Basically, it takes in a text file, splits off the date, iterates through the
lines in the file, groups together the current line with the next line, then
the next two lines, up to the next “n” lines, and counts how many times those
groups of lines appear in the file. I call this a quasi-Markov analysis,
because it reminds me of a couple of the Markov chain programs that I’ve
written to piece together words and groups of words from a text file. It will
then, depending on command line arguments, either iterate through the lines in
the file again and print out each line with a sequence group list in front of
it or it will print out the most occurring groups of lines, sorted in
descending order.
The “most occurring groups of lines” sorting algorithm is a product of the
length of the sequence (the number of sequentially re-occurring lines) minus
one and the number of occurrences of the group of lines minus one. I figured
that the larger a sequence was and the more times it appeared, the more
interesting it may prove to be. I also figured that single lines were probably
not so interesting. But maybe they are. Maybe I’ll change it so that the
algorithm is a product of the length of the sequence and the number of
occurrences of the group of lines minus one. I think I will still leave the
number of occurrences minus one; I’m not too interested in sequences that only
show up once.
I’m still not really sure how useful this will prove and I do have just a
little more work to do with it. Another feature I’m going to add is the
ability to specify a regex and have it print summaries on only blocks which
have a line that matches that regex. This could prove useful to see the block
of re-occurring lines surrounding, for example, error messages.
This is an example of the beauty of Python. This code isn’t particularly
beautiful. Maybe it will be more so when I clean it up. What’s beautiful is
that I was able to throw this together in no time. I had an idea, started
coding, and with very little effort and in very little time, I had a working
piece of code. And, it was fun!
Comments, suggestions, criticisms of the code? Any examples of being able to
slap together a piece of Python code quickly that saved you time?

