SANS Digital Forensics and Incident Response Blog

Forensics and Perl-Fu: Reducing Data and Cleaning Up Log Files

By: Mike Worman

Perl's simplicity and its raw power may seem paradoxical but this is simply a clever ruse. There is a lot going on behind the scenes when using Perl, which has often been described as the scripting language that attempts to figure out exactly what the developer wants in as little code as possible?and it usually succeeds. Even when it doesn't, another possible approach is usually immediately apparent. Never forget the Perl motto: TIMTOWTDI!

My pieces won't dwell on the deeper mysteries of Perl which are way beyond the scope of our blog (or my own brain). My intention is to start small and illustrate how I have been using Perl to meet some very puzzling challenges over the course of my career. Before we start, keep in mind there will always be people that can do what I am doing in a single, complex set of shell commands or command line Perl (using —e). These articles are not for you but if there is an interest in more advanced command line stuff, please let me know.

Consider the following code which is a basic wrapper script I wrote called LogCleaner.pl. When dealing with log files, especially large ones, you occasionally run into the "this log file is full of blank lines" problem. The following snippet simply opens a file given on the command line (e.g., C:\LogCleaner.pl test.txt) or exits with a proper usage message ("Usage: LogCleaner.pl <filename>) if no such file exists. It then opens a second, new file with the name of the first, with ".cleaned.txt" appended. Then, for each line in the old file, it rewrites that same line into the new, clean file, as long as that line is not simply a blank line (signified by a single newline character, "\n"). Finally, the script closes both the old and the new files, ensuring synchronization with the file system.

#!/usr/bin/perl
# LogCleaner.pl
# Removes blank lines from text logs

open INPUT_FILE, "<$ARGV[0]" or die "Usage: LogCleaner.pl <filename>\n";
open OUTPUT_FILE, ">$ARGV[0].cleaned.txt";
while ($line = <INPUT_FILE>) { print OUTPUT_FILE $line unless ($line eq "\n"); }
close INPUT_FILE, OUTPUT_FILE;

It may not look like much but I have used very similar code to clean hundreds of cluttered, malformed, and otherwise borked log files full of whitespace and extra lines since the 1990s. For some reason Cisco log files seem to come to mind most prevalently. This script (which works both in Linux and Windows) can form the basis of almost any data processing program you can think up. Instead of creating a new file without blank lines, you could create a new file that contains only certain columns, or certain characters, or certain patterns of characters (regexps) from the old file. To some people this will seem like old hat, but others may see the ramifications. This script can pull a few bits of useful data out of a 1TB file otherwise filled with junk, such as newlines or whitespace. Whitespace in log file processing is a REAL PROBLEM for analytics, especially when that data is moved around in different import/export functions such as those found in Microsoft Excel. The popular "strings" program works in a similar fashion: if I have a 1TB file of binary data, how can I pull out things that resemble ASCII, stuff that a human can actually read and comprehend?

The answer is by going through the file, line by line, attempting to match patterns (regular expressions) that look like ASCII in the same way the short script above goes through a file line by line and builds a brand new file, skipping over lines of data that are not useful or wanted. While the conditional above ($line eq "\n") is not a standard Perl regular expression, it does the same job: it defines what we want to discard from our working set.

The basic premise here is (and Perl is by no means an exclusive solution to these problems, just a powerful one) that when dealing with very large sets of data, before you even begin analytics you should ask yourself two questions: "Do I need all this data to solve my problems", and "If not, how can I reduce the size of my working set?". How these questions are answered can mean worlds of difference in forensics and IR, from whether or not you have the time to complete lengthy keyword searches to whether or not you can afford enough disk space to store the data at all. You will hopefully find, as I have, that Perl is a key solution in dealing with both questions. Perl can help you use data you already have more effectively, and it can help you reduce/modify your data set into a more practical form. More practical translates into faster and more timely searching, faster transfer from point A to point B, less storage requirements, and ultimately, less burnout among IR and forensics professionals.

Mike Worman, GCFA Gold #124, GCIA Gold #282, is an incident response, forensics, and information security subject matter expert for an international telecommunications carrier. He holds a BS in Computer Systems Engineering from the University of Massachusetts, an MS in Information Assurance from Norwich University, and is CISSP-certified.

2 Comments

Posted April 23, 2009 at 2:54 PM | Permalink | Reply

larrymcd

How about a nice simple "grep -v "^$" test.txt>test.txt.out"

Posted April 23, 2009 at 3:13 PM | Permalink | Reply

mworman

Hi Larry. Grep certainly works for the purpose of the tiny example I gave but it ends there for all but the Unix command line crowd. Inline regular expression matching can be performed with grep or with perl -e and of course all of the various mutations of grep out there (egrep etc.). My example above is just the starting point for something a little more elaborate, but you're following the Perl motto even with grep: TMTOWTDI!
Systematically, there are some major differences to point out with regards to your solution:
First, grep give you no direct control over IO streams (opening, closing, manipulating bytes through file handles in the operating system). This effects things like processing compressed files, streamed encryption etc. You may be able to "chain" UNIX commands together to accomplish what you want, but you are always more limited at the command shell than you are with a full scripting language. That is kind of where I am heading with all of this.
The other issue is that while grep has a strong regular expression interface, Perl's is much more powerful (Perl includes, btw, a full implmentation of the grep() function). While I have done many a regexp in grep, there are always limitations you run into at the command line, most notably surrounding issues of scale. Try to, for instance, perform a recursive regular expression matching (finding a pattern inside a pattern inside a pattern) is something much easier to do in Perl than in grep.
Thanks for the comment!