SANS Digital Forensics and Incident Response Blog

Least frequently occurring strings?

My phone rang. It was a small business owner looking for some help. He had a system he wanted me to take a look at, but was light on specifics. I asked to speak to his IT person. He laughed and said he was the IT person and that he knew next to nothing about computers. An hour later I was sitting in his office filling out a chain of custody form and trying to get more information out of him.

"I can't really tell you much about it or what to look for, the system was just acting strange," he said.

"How so?" I asked.

"It seems slower than normal and I notice the lights on the back are blinking more than usual," he offered.

"When did this start?" I asked.

"Hard to say for sure, last week maybe," he said.

The conversation wasn't progressing as I'd hoped. Nevertheless, I told him I'd take a look and see what I could find out. I figured I'd start with a time line, paying careful attention to file system activity from the last few weeks and go from there.

Nothing in the time line stood out and I had no keywords or phrases of interest, no indicators of compromise to search for. I scanned an image of the drive with a couple different anti-virus tools and found nothing. Maybe there wasn't anything to find. Maybe I needed a new approach. I could build a Live View image of the system, boot it up and monitor the network traffic for anything noteworthy, a method I'd used before when in similar situations.

I collected the strings from the image using the old standby:

strings -a -t d sda1.dd > sda1.dd.asc

This collected ASCII strings and their byte offsets in the disk image.

Then I ran

strings -a -t d -e l sda1.dd > sda1.dd.uni

to gather Unicode strings and their byte offsets.

I quickly took a look at each file with less:

9000 /dev/null
9096 1j :1j :1j :1j :
9224 !j :1j :
9235 81j :
9352 !j :1j :
9363 81j :
264224 lost+found
264244 boot
264256 homeA/
264292 proc
264388 root
264400 sbinR
264412 floppy
264428 .bash_history


I knew paging through looking for evil would be inefficient. Then it occurred to me that I could apply the Least Frequency of Occurence principle that Peter Silberman spoke about a few years ago at the SANS Forensics Summit. It would at least reduce the size of the data I was looking at and data reduction is good strategy for digital forensics practitioners.

So I ran the following commands:

cat sda1.dd.asc | awk '{$1=""; print}' | sort | uniq -c | sort -gr > sda1.dd.asc.lfo
cat sda1.dd.uni | awk '{$1=""; print}' | sort | uniq -c | sort -gr > sda1.dd.uni.lfo

Let me explain the purpose of this compound command. The cat command dumps the contents of a file to standard output (usually your screen). The pipe (|) that follows causes the output to be passed to the awk command. AWK is a powerful utility for processing text and while I'm far from an expert in awk, I know enough to get some useful things done. In this case, awk is removing the first field in the .asc and .uni files. The first field is the byte offset where the subsequent string occurs in the original image file. Awk assigns each field a numeric value, in this case the byte offset is $1 so setting $1 to "" effectively removes that value. The print command sends the line of text to standard output where it is piped to sort. sort does what you expect. Next the uniq -c command removes duplicate lines and counts the number of occurrences of any duplicate lines, this data is then piped again to the sort command and this time we tell sort to do a numeric sort and to reverse the output. The results of all of this is redirected to a new file called sda1.dd.asc.lfo and sda1.dd.uni.lfo respectively.

Now when I look at sda1.dd.asc.lfo, I see something like this:

3703 GCC: (GNU) egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)
1268 return;
1116 else
757 done.
755 Disabling CPUID Serial number...
734 .text

Field one in this file is a number of times the string after it occurs in the disk image. So far so good. If an attacker has placed code on this system, I would expect it to be one of the least frequently occurring items on the system. Let's jump to the bottom of the file and look at the least frequently occurring strings:

1 `|~\
1 `<%
1 `~+<

Hm, well this is certainly ugly and not very useful. I needed to further reduce the data set. I sent out a call for a new command like strings called English that would be smart enough to discern English text from garbage.

Within minutes people were replying that I could use grep and a dictionary file. On my Ubuntu box there are several dictionary files including one in /usr/share/dict called american_english that contained 98K+ words. I decided to clean it up a little by running strings against it to remove short words that may yield false positives and I removed all words containing apostrophes just because. The result was a file containing just under 73K words.

I was afraid that grep would take forever to search for matches in my .asc and .uni files, but figured I'd try it and leave it running overnight. To my surprise, it finished quickly, but didn't appear to eliminate the garbage at the end of the file. I thought grep may be silently failing due to the size of the dictionary file, but after a little trial and error, I discovered the problem was that the first line of the dictionary file was blank. Blank lines in your indicators of compromise of keyword file is a known problem, get rid of them.

Because the dictionary file is a list of "fixed strings" and not regular expressions, I used the -F flag (nod to ubahmapk for that) to tell grep to interpret the strings as such, this dramatically improves performance and that is a huge understatement. The commands I used were:

grep -iFf american_english_short sda1.dd.asc.lfo > sda1.dd.asc.lfo.words
grep -iFf american_english_short sda1.dd.uni.lfo > sda1.dd.uni.lfo.words

The -i tells grep to ignore case when matching and the -f tells grep that the "patterns" are to be read from a file.

Now the least frequently occurring lines in sda1.dd.asc.lfo.words looked like this:

1 %02(hour{date}):%02(min{date}) \
1 02 4 * * * root run-parts /etc/cron.daily
1 01 * * * * root run-parts /etc/cron.hourly
1 # 0 1 0 1 1200 baud

Gone was the garbage, granted there was still plenty of useless info to wade through, but at least now there was less of it. And within minutes of careful review of the least frequently occurring text, I noted the following:

1 else if $HISTFILE has a value, use that, else use ~/.bash_history.
1 } else if (!(get_attr_handle(dcc[idx].nick) & (USER_BOTMAST | USER_MASTER))) {
1 } else if (get_assoc(par) != atoi(s)) {

That middle line looked awfully suspicious. I went back to my original strings file and grepped for USER_BOTMAST:

grep USER_BOTMAST sda1.asc
8819144 if ((atr & USER_BOTMAST) && (!(atr & (USER_MASTER | USER_OWNER)))
8820316 if ((get_attr_handle(dcc[idx].nick) & USER_BOTMAST) &&
8823918 if ((get_attr_handle(dcc[idx].nick) & USER_BOTMAST) &&

Now I had the byte offsets where the USER_BOTMAST string occurred in the disk image. I recovered the file using the techniques we teach in SANS Forensics 508 and saw the following:

This file is part of the eggdrop source code
copyright (c) 1997 Robey Pointer
and is distributed according to the GNU general public license.
For full details, read the top of 'main.c' or the file called
COPYING that was distributed with this code.

#include "eggdrop.h"
#include "users.h"
#include "chan.h"
#include "tclegg.h"

Of course this approach will not be appropriate in many cases. Thankfully, we almost always have more useful information to go on, but it's fun to explore new techniques and think about new ways of tackling cases and who knows, you may be faced with a situation where looking for least frequently occurring artifacts will yield useful information. The other point of this post is, ahem, pedagogical. That is to say, the information presented here is not meant to be applied exactly as it has been in this post, it is meant to expose less experienced Linux users to some powerful command line tools and to spur thought and conversation about unorthodox approaches to investigations.

I'll follow this post in a few days with another that is orthogonal. It will be less about forensics specifically, but for those who use the Linux command line for forensics, it may prove useful.

Dave Hull is an incident responder and forensics practitioner for Trusted Signal. He'll be teaching SANS Forensics 508: Advanced Computer Forensic Analysis and Incident Response at Cyber Guardian, May 15 - 20, 2011 in Baltimore, MD. Forensics 508 takes practitioners beyond commercial tools, giving them a greater understanding of the forensics process and moving them beyond push-button forensics.


Posted April 23, 2011 at 3:45 PM | Permalink | Reply


excellent post!

Posted April 23, 2011 at 4:13 PM | Permalink | Reply

Andrew C

Was the system still on when you arrived? Also you mentioned Live View, but did you actually use it? Seems like memory analysis would have quickly found the eggdrop process and the connections it had open.

Posted April 23, 2011 at 4:42 PM | Permalink | Reply

Dave Hull

Andrew ''" You are absolutely correct, memory analysis would have quickly found the malware.
The case data presented in this post is theoretical, but the technique is one I recently used in an actual case. The actual case involved a dead drive and very little information, in fact, the attacker's running code was installed in a tmpfs file system that went away when the system was shutdown, but the source was recoverable from unallocated space elsewhere on the system. Live View is something I use often and with great results, I wanted to mention it in the post in case people aren't aware of it or aren't using it. In the real life case that inspired this post, I was unable to get the image running under Live View in a timely manner and thus pursued other methods.
But yeah, memory analysis and Live View are both for the win.

Posted April 27, 2011 at 12:38 PM | Permalink | Reply


this is great stuff.
I just ran a test against an image on my Debian machine and got the following error message:
awk: program limit exceeded: maximum number of fields size=32767
FILENAME="-" FNR=3208668117 NR=3208668117
so my lfo.words output is not complete. Haven't figured out why, though.

Posted April 27, 2011 at 1:45 PM | Permalink | Reply

Dave Hull

I have run into this on my 32-bit system, but not on my 64-bit system. I think awk on 32-bit systems is hitting its upper limit on the number of records it can handle. You may be able to split the strings file using lxsplit, then parse each split file independently and then recombine them later. Our 32-bit systems are nearing end of life in terms of being useful for forensics, our data sets are just getting too big.

Posted April 28, 2011 at 12:59 PM | Permalink


Dave, actually the system is a brand new (really big) 64-bit system so that's not the cause of the problem. Since I haven't been able to solve the problem I used ''cut' instead of ''awk' Cheers, Stefan.

Posted April 28, 2011 at 1:31 PM | Permalink | Reply

Dave Hull

Stefan, from what I'm seeing in other threads, looks like this record limit is in awk, but not in gawk. If I get some time to play around with it, I'll look into it more, but since you found a solution that works, I'm not too concerned.

Posted April 28, 2011 at 1:25 PM | Permalink | Reply

Dave Hull

Ahh, very nice. I've been digging into this a little further. I see other people running into the same error with awk and I'm still not exactly clear on what's causing it. Nice that you found another solution that works. Can you post your command line using cut?

Posted April 29, 2011 at 9:43 AM | Permalink | Reply


I simply replaced
awk ''{$1=""; print}'
cut -c11-
Cheers, Stefan.

Posted May 1, 2011 at 3:19 AM | Permalink

Dave Hull

I played around with this after you mentioned it. I had good results with cut -d " " -f 2- This tells cut to use a space character as the delimiter and then to output fields 2 through the end of the line. Field one will be the byte offset where the string occurs in the original image and we don't want that for calculating frequency of occurrence. Thanks for the heads up on using cut.

Posted May 2, 2011 at 12:51 AM | Permalink | Reply

Tom webb

Great article. I've been playing with this idea for a while,but haven't had the time to test it. One thing that will greatly speed up your grep is to use fgrep instead. It uses the same syntax as grep. I've had great success when speeding my analysis on large multi-gig files.

Posted May 2, 2011 at 9:27 PM | Permalink | Reply

Dave Hull

Great point. On my system, fgrep is a symbolic link to grep. However, running grep with a -F as mentioned in the article, causes grep to interpret the patterns supplied as "fixed-strings" rather than as regular expressions. The performance increase from using -F is dramatic. I tried running it without, but gave up on waiting for it to finish, there's simply no comparison. So yeah, use fgrep or grep -F'' or wait forever.