SANS Digital Forensics and Incident Response Blog

Perl Fu: Email Discovery

Hal Pomeranz, Deer Run Associates

I hope Mike Worman doesn't hate on me for stealing his "Perl Fu" idea, but I recently have been dealing with a task that is perfect for Perl. One of my customers is having to do a laborious discovery process through a huge email archive that is in "Unix mailbox format"- meaning large text files with the email messages all concatentated togther. They need to find any one of a list of relevant keywords in messages stored in these hundreds of gigabytes of large text files and output the entire text of the matching email messages.

Unix mailbox format is a file format that I've dealt with a lot, and I've written many scripts to parse these kinds of files. So it probably took me less time to write the script to do this than it's going to take me to write this blog post. But I figured this is a task that other readers of the blog might encounter from time to time, so here's the code:

#!/usr/bin/perl
# mgrep -- match patterns and output messages from Unix mailbox files
# Usage: mgrep [-i] [-f file] [pattern] file1 ...

use strict;
use Getopt::Std;

my %opts = ();
getopts('if:', \%opts);

my $pattern = undef;
if (length($opts{'f'})) {
open(FILE, "< $opts{'f'}") ||
die "Can't open pattern file $opts{'f'}: $!\n";
my @lines = <FILE>;
close(FILE);
chomp(@lines);
$pattern = '(' . join('|', @lines) . ')';
}
else {
$pattern = shift(@ARGV);
}
$pattern = "(?i)$pattern" if ($opts{'i'});

my $message = undef;
while (<>) {
if (/^From\s/) {
print $message if ($message =~ /$pattern/s);
$message = undef;
}
$message .= $_;
}
print $message if ($message =~ /$pattern/s);

The actual meat of the program is the "while (<>) ..." loop down in the bottom third of the code. We spend more code processing arguments and setting up the pattern match than on actually processing the input files. But here are some notes to help you make sense of what's happening in the program:

  1. First we "use strict" to have Perl help us enforce good programming practice in our script, like pre-declaring variables with "my" to help prevent typos and other errors.
  2. Then we incorporate the standard Perl command line argument processing library ("use Getopt::Std") and call getops() to process the command line arguments. Here we're specifying that our program accepts both "-i" (case insensitive matching) and "-f" to specify a file name containing a list of patterns to match against. The ":" after the "f" in the getops() string means that"-f" expects an argument, namely the file name. Any options that getopts() finds will be stored in the "%opts" array.
  3. Next our "if" block checks to see if the "-f" option was set. If so, then we attempt to open the specified file name and read in its contents ("die" causes the program to abort if the file can't be opened). We use chomp() to remove the newlines from the lines we read in and then we concatenate all of the patterns together to form a pattern string like "(pattern1|pattern2|...)" ("pattern1 or pattern2 or ..."). Note that if "-f" was not set, then we just read the pattern in from the command line like the normal Unix grep program (that's the "else { ... }" block).
  4. Next we check to see if the "-i" (case-insensitive match) option is set. If so, then we add "(?i)" at the front of our pattern. In a Perl pattern match, this is one way to express case-insensitive matching.
  5. Now we're finally ready to start processing our input files. The "while (<>) { ... }" construct is a useful bit of Perl shorthand that emulates the standard Unix command-line processing. Specifically it means that if there are any remaining command-line arguments, they should be treated as file names and opened sequentially and all lines processed one at a time from each file. If there are no unused arguments on the command line after our argument processing, then the program should look for its input from the standard input.
  6. Within the body of the loop, we're processing our input one line at a time. At the end of the loop we're simply concatenating the lines we read into the "$message" variable that holds our message text. "$_" is the magic Perl variable that represents the text of the line we're currently processing, and "$message .= $_" means "append $_ to the text already in $message".
  7. Now for the uninitiated, Unix mailbox format is nothing but a large text file with messages concatenated one after the other. You can recognize the start of each new mail message when you find a line that begins "From<whitespace>". Our "if { ... }" block at the top of the loop matches this pattern as an indication that we've reached the end of one message and are starting in on another. If the message we've collected so far matches the pattern specified by the user then we print the entire contents of the mail message. Then we empty our "$message" variable and so we can start collecting the next mail message.
  8. After we've processed all of our input files, we still need to determine whether or not we should output the last message from the last file we processed. That's why there's one more print statement after the end of the loop.

Whew! That's a lot of words for a simple script, but I hope it helps you wrap your head around some of the more obscure bits of Perl syntax and gives you some ideas for writing your own scripts. By the way, because I chose to use Perl for this task, one of the happy accidents is that we can actually use the Perl regular expression syntax for the patterns we give as input to the program (whether we put them in a file or specify them on the command line). This is good news because Perl's pattern matching syntax is much more flexible and expressive than the one used by the regular Unix grep command.

Happy email hacking!

Hal Pomeranz is an independent IT/Computer Security Consultant and a SANS Faculty Fellow. He is available as a strolling Perl programmer for weddings and bar mitzvahs.

1 Comments

Posted July 24, 2009 at 11:33 PM | Permalink | Reply

Mike Worman

Hal, I would never challenge your Perl Fu. You taught several of my GCIA classes in 2000 and I count you amongst my very best teachers.