Wednesday, March 24, 2010

AWK'ward logs: Finding errors by type or time with bash

Just in case I ever need to do this again, a couple of notes about quickly summarizing error logs using Bash. Script examples all run on Fedora 6.

Error By Time
Recently I ran into a couple of log files (multiple nodes and log rolling) with more errors than I cared to examine one by one.

2010-03-10 06:32:57,982 ERROR CasLoginView:0 - CAS Login failedin btnLogin_Click. Root error = Could not send Message.
  System.Web.Services.Protocols.SoapHeaderException: Could not send Message.
  at System.Web.Services.Protocols.SoapHttpClientProtocol.ReadResponse(SoapClientMessage message, WebResponse response, Stream responseStream, Boolean asyncCall)
  at System.Web.Services.Protocols.SoapHttpClientProtocol.Invoke(String methodName, Object[] parameters)
  at ...etc...
2010-03-10 07:47:21,900 ERROR HBXTaggingControl:0 - Error initializing
  System.Collections.Generic.KeyNotFoundException: The given key was not present in the dictionary.
  at System.ThrowHelper.ThrowKeyNotFoundException()
  at System.Collections.Generic.Dictionary`2.get_Item(TKey key)
  at ActiveFoundation.ConsumerUI.PageTypeCore.HBXTaggingControl.PrepareHBXScript()
  at ActiveFoundation.ConsumerUI.PageTypeCore.HBXTaggingControl.OnInit(EventArgs e)
2010-03-10 07:50:23,759 ERROR HBXTaggingControl:0 - Error initializing
  System.Collections.Generic.KeyNotFoundException: The given key was not present in the dictionary.
  at System.ThrowHelper.ThrowKeyNotFoundException()
  at System.Collections.Generic.Dictionary`2.get_Item(TKey key)
  at ...etc...

In order to start making sense of it all I rapidly found I needed summaries: errors by time, by type, and so on.

The first and most basic need was to filter down to just the errors per hour for a specific day. The desired result is date, hour, number of errors with entries for each hour, not just the ones that have errors.

2003-03-10 00 0
2003-03-10 01 12

If we just wanted an entry for every hour with more than 0 errors we could just grep and count. To get entries for hours that have no entries in our log it is easiest to just loop through the days/hours desired. Looping through days and hours desired is easy enough in a Bash script.

The dates in the log are padded with leading zeros so to quickly get a similar effect in Bash we can use printf to produce the statement we want. Putting this together, we start by writing a simple loop script that builds up roughly the statement we want and stores it into a variable:

#!/bin/bash

wd=`pwd`
for day in {23..23}
do
   for hr in {0..23}
   do
      cmd=`printf "grep \"^2010-03-%02d %02d\" ${wd}/*.log  | grep ERROR | wc -l" ${day} ${hr}`
      echo ${cmd}
   done
done

Note that the cmd= line uses backticks (old school syntax for command substitution) to execute printf and capture the result in a variable. It would be more modern of us to use $(printf...) instead. If we run this script it will print the series of commands we wish to execute to find the error count for each hour:


[build@localhost dm-logs]$ ./blog.sh 
grep "^2010-03-23 00" /home/build/dm-logs/ConsumerUI1.log /home/build/dm-logs/ConsumerUI2.log /home/build/dm-logs/ConsumerUI3.log | grep ERROR | wc -l
grep "^2010-03-23 01" /home/build/dm-logs/ConsumerUI1.log /home/build/dm-logs/ConsumerUI2.log /home/build/dm-logs/ConsumerUI3.log | grep ERROR | wc -l
...etc...


So far so good, but we'd really much prefer to actually execute the command! It turns out we'd really like to print the day and hour, then the grep command output (eg the error count for that hour) so we'll modify the command a little bit while we're at it:

#!/bin/bash

wd=`pwd`
for day in {23..23}
do
   for hr in {0..23}
   do
      eval $(printf "echo \"2010-03-%02d %02d \" \$(grep \"^2010-03-%02d %02d\" ${wd}/*.log  | grep ERROR | wc -l)" ${day} ${hr} ${day} ${hr})
   done
done

What's going on here is that we're building up a command using printf, capturing the result by way of command substitution $(printf ...blah...), then executing the result using eval. The statement we run eval on will look like this:

echo "2010-03-23 23 " $(grep "^2010-03-23 23" /home/build/dm-logs/ConsumerUI1.log /home/build/dm-logs/ConsumerUI2.log /home/build/dm-logs/ConsumerUI3.log | grep ERROR | wc -l)

We're going to echo the date echo "2010-03-23 23 " followed by the output (courtesy of command substitution) of the command that counts up error entries for the date $(grep "^2010-03-23 23" /home/build/dm-logs/ConsumerUI1.log /home/build/dm-logs/ConsumerUI2.log /home/build/dm-logs/ConsumerUI3.log | grep ERROR | wc -l).

The result looks might look something like this:

2010-03-23 00  0
2010-03-23 01  0
2010-03-23 02  0
2010-03-23 03  0
2010-03-23 04  0
2010-03-23 05  0
2010-03-23 06  54
2010-03-23 07  304
2010-03-23 08  236
2010-03-23 09  320
2010-03-23 10  175
2010-03-23 11  199
2010-03-23 12  109
2010-03-23 13  0
2010-03-23 14  0
2010-03-23 15  0
2010-03-23 16  0
2010-03-23 17  0
2010-03-23 18  0
2010-03-23 19  0
2010-03-23 20  0
2010-03-23 21  0
2010-03-23 22  0
2010-03-23 23  0

From such a result we might infer that we had an exciting morning! This type of output works very nicely in Excel as a series so we can get a pretty picture out of it, graph and compare multiple series, and so on.

Error By Type
In addition to wanting to know what is failing by time, we may wish to know where the failures are coming from. Our log format gives this to us, so lets dig up the source components and count up how many errors each one is filing. The source is the bit highlighted in bold below:

2010-03-10 06:32:57,982 ERROR CasLoginView:0 - CAS Login failedin btnLogin_Click. Root error = Could not send Message.

So, first step is how do we pull out the source for each error. Our tokens are roughly space-separated so they'll fit perfectly into awk (book, brief introduction). For our purposes it is sufficient to yank the fourth token out, then throw away the duplicates:

grep ^2010-03-23 *.log | grep ERROR | awk '{ print $4 }' | sort | uniq -c

This will give us a nice list of error count by source, similar to:

.
      7 CasLoginView:0
     34 SomeOtherComponent:0

This is nice, but maybe what we'd like is to see the count of each exception message from each source. We can just loop through the sources, grab the errors from each source, drop the time part (by substituting "" for those fields using awk), then count it all up:

#!/bin/bash

wd=`pwd`
for source in `grep ^2010-03-23 *.log | grep ERROR | awk '{ print $4 }' | sort | uniq`
do
  grep "^2010-03-23 " *.log  | grep "ERROR ${source}" | awk '{v=$0; gsub($1" "$2" "$3" "$4" - ", "", v); print $4" "v; }' | sort | uniq -c
done

This will give us something like:

.
      3 LoginControl:0 Some clever failure message
     54 LoginControl:0 Another clever failure message
      1 LoginControl:0 A rare but still fascinating failure

Like most things in Bash there are nearly endless ways to make such little scripts faster, shorter, and so on.

No comments:

Post a Comment