SANS Digital Forensics and Incident Response Blog

Perl-Fu: Regexp log file processing

Remember that with Perl the key benefit is the ability to easily implement almost any kind of input/output processing system one might need or conceive, without the need for a lot of code or time in development. When you are faced with massive amounts of data and a small amount of analytical time, this agility is critical. I will not be teaching regular expression syntax but there are countless primers and resources on the web for this, and they almost universally apply to languages/interpreters other than Perl, including our favorite command line tool, grep. Consider the following code:

# Creates user-specific files from a single log file based on the field "User="
$logfile = $ARGV[0];
open(LOG, "<$logfile") or die "Usage: <filename>\n";
print "Processing $logfile...\n";
while (<LOG>) {
if (/User\=(\w+)/i) {
open (USERFILE, ">>$1.$logfile.log");
print USERFILE $_;
close LOG;

It accepts a log file at the command line (e.g. " logfile.20090511") and for each time the string "User=" appears in the file it will initially create and then append to a new log file all the original log entries specific to that username. For example, if my username in these logs were "mworman", a file would be created named mworman.logfile20090511.log. This new file would contain every instance of "User=mworman" as well as any words/strings that appears after it, for every line in the original log file. I've now accomplished multiple things at once:

  • I've calculated the unique number of usernames in the file, i.e. the number of new files created.
  • I've created a re-usable, "greppable" file for each user, allowing me to perform calculations for any subset of them. For example I can immediately see which users had the most/least activity based on file sizes, I can use "diff" to compare any subset, etc.
  • I've ensured date synchronization between the original log file and the new set of files by re-using the date. When grinding data down from the petabyte and terabyte levels to something more manageable, this kind of thing becomes really important for maintaining your sanity as well as the integrity of your analysis.
  • I can reuse this code for patterns other than "User=" simply by altering the regular expression.

It may not look like much, but this little script is very useful and is something I wrote to separate the User fields in the logs of a Symantec appliance into a set of user-specific activity files. Other than the regular expression in the IF statement, this script is very similar to the one I posted a few weeks back. While one reader correctly pointed out that that script could have been replaced with single grep command (and in most cases, depending on your command line powers this is always possible but not always practical or wise), this script is just as simple but far more powerful and extensible for analytical purposes. Again, the matching pattern ("User=") could literally be changed on the fly to any regular expression, including

  • IP addresses:


  • Visa Credit Card Numbers:


  • Valid Social Security Numbers:

(^(?!000)([0-6]\d{2}|7([0-6]\d|7[012]))([ -]?)(?!00)\d\d\3(?!0000)\d{4}$)

What else can we do? The sky is the limit, folks. Does the input to this process have to be a static text file? No, in fact I have a similar script (barely twice the size of this one) that scans a given list of IP addresses/hostnames and generates a text file with the hostname->Windows username pairs for each system with a currently logged on user (this uses a Perl NetBIOS module available from, your best one-stop repository for Perl development).

Adding simple TCP/IP functionality to scripts like this starts to move us into the area of "banner grabbing" and network scanning, and sure enough, many popular network scanners (e.g. Nessus) began as glorified Perl scripts that iterated pattern matching across lists of network hosts or subnets.

Once you get the basics of Perl and regular expressions down, a trip to will show you just how much re-usable code is out there: modules for everything from Windows to Samba to UNIX to things like

"Provides an interface for turning the front LED on Apple laptops running Linux on and off. The user needs to open /dev/adb and print either ON or OFF to it to turn the led on and off, respectively."

Whether or not turning the LED on Apple laptops is a forensic necessity is an exercise left to the reader. Or, maybe someone now sees Perl as an instrument they can use to access and analyze devices in ways they hadn't thought possible before. The sky is the limit folks.

A gracious thanks to GCFA Rod Caudle who just reminded me of the the BEST tool for regular expression development (which is really an art of it's own) I have ever used. RegexCoach is a tool someone introduced to me years ago and it is priceless for playing with regexps and tweaking them to get the right one you are looking for. It includes the ability to provide test strings to ensure your regexp matches and will sytax-color portions of the test string that do match, etc, greatly speeding up development time. Having easily spent more time figuring out complex regular expressions than actually writing Perl code wrapping them, I couldn't plug this utility enough even though I'd forgotten about it the last ten years or so. Thanks Rod!

Mike Worman, GCFA Gold #124, GCIA Gold #282, is an incident response, forensics, and information security subject matter expert for an international telecommunications carrier. He holds a BS in Computer Systems Engineering from the University of Massachusetts, an MS in Information Assurance from Norwich University, and is CISSP-certified.


Posted May 26, 2009 at 10:45 AM | Permalink | Reply


hi Michael,
Nice posting, I wrote about regex tools previously ''"
Regex is a real tool when comes to perform analysis, it must be a subject to learn for whoever want to be capable analyst.

Posted June 1, 2009 at 3:32 AM | Permalink | Reply


I believe slightly more is needed to process log files effectively. Different daemons have different outputs, some are even multi-lined. In the case of some unix daemons writing to syslog they are prepended with a PID which can be used to track a specific session, rather than just single entries.
And a logfile containing CC# and SSN#? Please let's hope that people don't do that anymore, in most logging cases these would be going over the wire without any form of obfuscation or encryption.

Posted June 1, 2009 at 3:58 PM | Permalink | Reply


Sure thing, this example was specific to the log format of a specific type of Symantec appliance (whose logging is relatively simple). These entries are meant to be simple skeleton scripts off which to build more elaborate ones.
As far as logfiles containing sensitive data, these do exist, whether you want them to or not they abound in almost any environment you look. But again the given regexps are meant to be examples of some data an investigator might search for. Phone numbers, email addresses, employee ID formats''.the sky is the limit :)

Posted June 1, 2009 at 4:06 PM | Permalink | Reply


Also, the title of the article is a bit misleading, because my focus is on applying Perl to not just log files but any data file. The script above will work (or could be modified) to review many items in /proc as well as /var, and of course /dev, where most software/hardware interaction occurs.
It's also relatively simple to enhance popular programs like md5 and dd with a little Perl by wrapping them together in some Perl glue to perform some function. Typically I find Perl does its best when you have a monumental amount of data to sift through and you need to pare it down some how. There are plenty of programs written to deal with certain types of files, but when faces with some data format with no specialized software, Perl allows you to make one of your own in a short period of time.

Posted June 9, 2009 at 4:14 PM | Permalink | Reply


very useful.. excellent article ..!
thank you

Posted June 12, 2009 at 4:00 PM | Permalink | Reply


Terse. Makes the point and minimal overhead and includes useful examples. Excellent.