SANS Digital Forensics and Incident Response Blog

Perl scripts for parsing PDFs, MACs, IPs, URLs, etc.

By Michael Cloppert

I hoped to be writing to you about how I found a great chi-square technique to identify trojaned PDF's (we've certainly seen our share - 8.1, 8.1.1, and now 8.3/9.0...). Sadly, it's not so. I couldn't even get as far as rejecting my null hypothesis since component bytes, as random variables, are - no surprise - not normally distributed and therefore chi-square isn't really applicable. Should've seen that one coming a mile away. Stand by, we'll keep trying to find some sort of classification technique to identify "interesting" PDF's for manual inspection with a low false-positive rate irrespective of exploit.

In the meantime, I threw together a few neat perl scripts I figured I'd share here that may be of broader general interest. I'm also going to include a few unrelated gems that have proven helpful for me over the years. I know hating on Perl scripts is a common past-time for some with nothing better to do, so I'll issue this disclaimer: these scripts have been accurate enough, fast, elegant, and readable for my purposes. As with anything Perl, there are 1,000 ways to accomplish anything. If you've found / know of a better way, good for you. There are certainly conceivable conditions in which these scripts may not function as designed - input is a terrifying thing sometimes. This is just food for thought, not an authoritative entry. That said, let's get to it.

Decompress streams in a PDF file (from STDIN)

Yes, I know pdftk will accomplish this just peachy, but I needed to decompress only - pdftk will normalize data. Possibly a limited application, but hey, maybe someone can use it :-)

#!/usr/bin/perl -w
# Blindly decompresses all PDF streams

use Compress::Zlib;

binmode STDIN;
binmode STDOUT;
my $buf;

while ((read STDIN, $block, 4096) != 0) {
$buf .= $block;

$buf =~ s/stream\r\n(.*?)\r\nendstream/"stream\r\n".&inflate("$1")."\r\nendstream"/gems;

print $buf;


sub inflate{
my $input = shift;
my $output;
my $status;
$x = inflateInit() or die "Cannot create a inflation stream\n" ;
($output, $status) = $x->inflate($input) ;
return $output;

Create histogram of a file's constituent bytes (from STDIN)

# Output a histogram of byte frequencies

my %histogram;
my $byte;

while (read STDIN, $byte, 1) { $histogram{unpack "C", $byte} += 1; }

foreach (keys %histogram) {
print "$_,$histogram{$_}\n";

Convert 6-byte integer into MAC address string

$ echo 256136729009152 |perl -e 'print unpack "H12", pack "Q", <>'

Note: 'Q' may only work on 64-bit systems; YMMV.

Convert 4-byte integer into dotted-decimal string

$ echo 3232253438 |perl -e 'print unpack "C4", pack "N", <>'

Okay, not quite as readable as a 6-byte hex value without a separator, per above. Put some dots in there.

$ echo 3232253438 |perl -e 'print join ".", unpack "C4", pack "N", <>'


$ echo "" |perl -e 'print unpack "N", pack "C4", split /\./, <>'

Rip a space-separated well-formed URL out of a line of arbitrary text

$ echo 'a b c asdfaf3243$$#[ ] fdaajjjf' \
|perl -pe 's/.*\W([a-z]+:\/\/[^\s\t]+)\W.*/$1/g'

Finally, as a postscript, I'd be remiss if I took full credit for all of the above. Special thanks to Eric, Zach, and Jason for various Perl insights that helped me build these.

Michael is a senior member of an incident response team for a large defense contractor. He has lectured for various audiences from IEEE to DC3, and teaches an introductory class on cryptography. His current work consists of security intelligence analysis and development of new tools and techniques for incident response. Michael holds a BS in computer engineering and has earned GCIA (#592) and GCFA (#711) gold certifications alongside various others.


Posted March 4, 2009 at 10:48 PM | Permalink | Reply


Great post mate. I do have a question''I tried using your Perl implementation to decompress pdf's but had not luck. Like you mentioned pdftk will do it, but I find it more useful to be able to do it in Perl.
I just copied your script and ran it with < filename.pdf.
The problem is when the stream is shown it's all garbled and unreadable. I am guessing this is not what you are seeing on your end?
Here is a sample of the output that I am seeing. I was wondering if you had seen this before or if you could hopefully shed some light/point me in the right direction?
Thanks again and great blog!!
7$`X]oB61"Zng&mdash;jUIx9 Qb aqw2

Posted March 5, 2009 at 11:00 PM | Permalink | Reply


Thanks for the feedback!
"I just copied your script and ran it with < filename.pdf."
Since I don't have a copy of the file you're trying to parse, I can only guess, but my first suggestion would be to cat the PDF and pipe it to the script, e.g. `cat filename.pdf |`. Also, remember that only properly-formed blocks of zip-compressed data in streams will be affected. If you're not sure, you can carve this block out of the file with ''dd' and see if it will decompress on its own using a tool that supports LZW or DEFLATE algorithms, whichever is applicable.
Finally, if you're interested in reading more, Adobe has a great reference to the PDF file format on their website ( Compression is discussed in S7.4.4.