SANS Digital Forensics and Incident Response Blog

Data reduction redux and map-reduce

A few days ago I wrote a post about applying the principle of least frequent occurrence to string searches in forensics. This post will discuss how long that process may take and at the end, will show some significant ways to speed up the process.

In the previous post I used the following compound command line statement to create a list of strings ordered from most frequent occurring to least frequent:

cat sda1.dd.asc | awk '{$1=""; print}' | sort | uniq -c | sort -gr > sda1.dd.asc.lfo

One thing worth noting is that the sort command is invoked twice. As you can imagine, sorting the strings from a large disk image takes time, to say nothing of doing it twice, though the second sort is much faster as it is initially sorting on the count value that uniq -c tacks onto the front of each line of text.

For example, I've got an ASCII strings file that is approximately 19GB. Below I've split out the command above and used time to measure the time required to complete each step in the process on my i7 quad-core, 64-bit Ubuntu VM with 7GB of RAM:

time cat sda1.dd.asc | awk '{$1=""; print}' | sort > sda1.dd.asc.sorted

real126m57.159s
user112m59.670s
sys10m51.150s

Running uniq against that file, again with time:

time uniq -c sda1.dd.asc.sorted > sda1.dd.asc.sorted.uniq

real6m51.088s
user2m4.820s
sys4m45.170s

My 19GB strings file has been reduced to 2.9GB by removing duplicate lines of text. We can do better.

I can further reduce it by removing the garbage strings using the technique outlined in the previous post; namely grep and a large dictionary file. (Remember to clean up your dictionary file, no blank lines, no short strings).

time grep -iFf /usr/share/dict/american_english sda1.dd.asc.lfo > sda1.dd.asc.lfo.words

real4m9.105s
user3m56.350s
sys0m12.250s

Now my 2.9GB strings file is 574MB.

Now we're getting somewhere, albeit slowly — not to mention the time it will take to review 574MB of text.

How can we speed this up? My initial thought was to try creating a ramdisk to see if that made things any faster. I tried that using a smaller file of course. Much to my surprise, the difference between using a ramdisk and not using one was hardly noticeable. I mentioned this on Twitter and strcpy stepped into the conversation. He's a brilliant developer whom I've talked to a few times online over the last year or so.

The next day, he sent me the following tweet:

I owe strcpy lunch

Figure 1: I owe strcpy lunch

I grabbed a copy of strcpy's script and looked it over. It was brilliant and simple and I was disappointed that I hadn't thought of it.

In a nutshell, the script counts of the number of processors in the system then divides the data set to be sorted by the number of processors minus one, then assigns each processor (minus one) with the task of sorting a chunk of the data set. It also tells the sort command how much of the system's memory it can safely use for buffering, by default sort will use 1MB, which can result in the creation of lots of temporary files (more than 1000 for my 19GB file). strcpy's script specifies 30% of RAM can be used by sort, reducing the number of temporary files used. Lastly the script merges the three sorted files back into one file.

Here is a screen shot of the System Monitor that I took during one of the previous sort commands shown above:
System Monitor: single processor sorting

Figure 2: System Monitor: single processor sorting

Notice that only one processor is being used for the heavy lifting of sorting the data set.

Here is a screen shot of System Monitor that I took while using strcpy's script:System Monitor: Three processors pegged

Figure 3: System Monitor: Three processors pegged

Notice that three of the four system processors are pegged, or nearly so and the amount of system memory in use has dramatically increased. Memory utilisation is configurable from within strcpy's script.

How much faster was it to run the sort this way than the previous way? Here's the output from time:

time ~/ellzey_sort.sh sda1.asc
19717183333
6572394445.0
Splitting sda1.asc into 3 pieces.
sda1.asc.0016572394445 bytes
sda1.asc.0026572394445 bytes
sda1.asc.0036572394443 bytes
Done!
Running sort on sda1.asc.001 on cpu 0x00000001
Running sort on sda1.asc.002 on cpu 0x00000002
Running sort on sda1.asc.003 on cpu 0x00000004
Finalizing sort into sda1.asc.sorted
Removing temporary files

real72m28.051s
user52m32.580s
sys75m5.760s

If my math is correct, this sort runs about 42% faster than the single processor version shown above. Why only 42% faster if it's using three times the processing power? This is due to the overhead of having to split the data set and then merge it. For systems with six cores and beyond, the speed increase will be even greater. This is the future of digital forensics. Today many of our tools do not take advantage of multi-core systems. Tomorrow they will.
I was praising strcpy's script on Twitter and codeslack smartly replied: That's the essence of map-reduce.

strcpy had said as much to me, "... you've been introduced to mapreduce 101."

One day in the not too distant future we'll be able to distribute these CPU intensive processes across dozens of cores.

Check out strcpy's script. It's easy to understand and can be easily modified. One thing worth noting, in my testing, the script rounded down to the nearest byte when splitting files, resulting in a loss of one byte (i.e. a 100 byte file split in three parts should be 33.33333 bytes, strcpy's script rounds this down to 33 bytes, so you'll end up losing a byte of data). I modified line 62 of the script and notifed strcpy of my change, here's what I ended up with:

...
split_size=`echo "scale=10; $file_size / $proc_count" | bc | python -c "import math; print math.ceil(float(raw_input()))"`
...

This change fixes the rounding issue.

Go forth and sort across all cores.

Dave Hull is a wanna be when compared to the likes of strcpy and codeslack, but he's learning. He's an incident responder and digital forensics practitioner with Trusted Signal. He'll be teaching SANS Forensics 508: Advanced Computer Forensic Analysis and Incident Response in Baltimore, MD, May 15 - 20, 2011. If you'd like to learn powerful forensics techniques like these, sign up for the course.