SANS Digital Forensics and Incident Response Blog

Outlier analysis in digital forensics

In my previous post, Atemporal time line analysis in digital forensics, I talked about using the inodes of a known piece of attacker code as a pivot point to discover previously unknown attacker code on a system. In this post, I want to point out another interesting thing about these inodes.

Recall that I'm using the word "inode" in a generic, non-filesystem specific way, to refer to the numeric value that is assigned to a file's metadata attributes (i.e. time stamps, permissions, cluster/block runs, etc). On NTFS the first file on the file system is assigned inode zero and is always the $MFT. As more files are created on the system, the next file on the system would get inode one, and so on. On EXT2/3/4 file systems, the root directory (/), is assigned inode two and things increase from there as additional files are stored on the system.

A typical file system has hundreds of thousands of files. Each file has its own inode. Because of the way operating systems are installed, it's normal to see entire directory structures written to disk with files having largely sequential inode values, for example, here's a partial directory listing from a Windows NTFS partition's system32 directory:

davehull@64n6:/cases/sample$ grep system32 sample.timeline.csv | awk -F, '{print $1, $7, $3, $NF}' | sort -g | egrep "\.\.\.b" | tail -n 26
2003 03 07 Fri 10:38:56 20702-128-4 ...b C:/WINDOWS/system32/drivers/ati2mtag.sys
2003 03 07 Fri 10:38:56 20703-128-4 ...b C:/WINDOWS/system32/ati2dvag.dll
2003 03 07 Fri 10:38:56 20704-128-4 ...b C:/WINDOWS/system32/Ati2mdxx.exe
2003 03 07 Fri 10:38:56 20705-128-4 ...b C:/WINDOWS/system32/ati3d1ag.dll
2003 03 07 Fri 10:38:56 20706-128-4 ...b C:/WINDOWS/system32/ati3d2ag.dll
2003 03 07 Fri 10:38:56 20707-128-4 ...b C:/WINDOWS/system32/ati3duag.dll
2003 03 07 Fri 10:38:56 20709-128-4 ...b C:/WINDOWS/system32/atioglxx.dll
2003 03 07 Fri 10:38:56 20710-128-4 ...b C:/WINDOWS/system32/atiicdxx.dll
2003 03 07 Fri 10:38:56 20711-128-4 ...b C:/WINDOWS/system32/atiicdxx.vxd
2003 03 07 Fri 10:38:56 20713-128-4 ...b C:/WINDOWS/system32/atiiiexx.dll
2003 03 07 Fri 10:38:57 20708-128-4 ...b C:/WINDOWS/system32/atitvo32.dll
2003 03 07 Fri 10:38:57 20712-128-4 ...b C:/WINDOWS/system32/atricdxx.enu
2003 03 07 Fri 10:38:58 20725-128-4 ...b C:/WINDOWS/system32/drivers/DP83815.sys
2003 03 07 Fri 10:39:00 20745-128-4 ...b C:/WINDOWS/system32/drivers/HSFHWALI.sys
2003 03 07 Fri 10:39:00 20749-128-4 ...b C:/WINDOWS/system32/drivers/hpm0850.cty
2003 03 07 Fri 10:39:00 20754-128-4 ...b C:/WINDOWS/system32/carpserv.exe
2003 03 07 Fri 10:39:00 20755-128-4 ...b C:/WINDOWS/system32/carpdll.dll
2003 03 07 Fri 10:39:01 20744-128-4 ...b C:/WINDOWS/system32/drivers/HSF_CNXT.sys
2003 03 07 Fri 10:39:01 20746-128-4 ...b C:/WINDOWS/system32/drivers/HSF_DP.sys
2003 03 07 Fri 10:39:01 20756-128-4 ...b C:/WINDOWS/system32/hsfinst.dll
2003 03 07 Fri 10:39:02 20747-128-4 ...b C:/WINDOWS/system32/drivers/mdmxsdk.sys
2003 03 07 Fri 10:39:02 20748-128-4 ...b C:/WINDOWS/system32/drivers/strmdisp.sys
2003 03 07 Fri 10:39:02 20750-128-4 ...b C:/WINDOWS/system32/mdmxsdk.dll
2003 03 07 Fri 10:39:03 20719-128-4 ...b C:/WINDOWS/system32/drivers/atisgkaf.SYS
2003 03 07 Fri 10:39:05 20686-128-4 ...b C:/WINDOWS/system32/drivers/aliirda.sys
2003 03 07 Fri 10:39:06 20764-128-4 ...b C:/WINDOWS/system32/drivers/Express.sys

This partial listing is sorted by date, note the inode values that follow the date stamps are largely sequential, there are some exceptions, but for the most part the order of inode numbers aligns with the file's creation times. As file systems are used over the years and new patches are applied causing files to be backed up and replaced, the ordering of these files by inode numbers breaks down, but surprisingly, this ordering remains intact enough on many systems, even after years of use, that we can use them to spot files that may be of interest.

Outliers in digital forensicsOutliers in digital forensics -- photo from http://www.flickr.com/photos/malcolmtredinnick/

Going back to the case from my previous post, let's take a look at the inode values of the few of the files that were malicious:

davehull@64n6:/cases/F10K/OS/mnt/etc/cron.daily$ ls -lint | sort -g
total 80
754007 -rwxr-xr-x. 1 0 0 104 Aug 2 2007 rpm
754483 -rwxr-xr-x. 1 0 0 2133 Dec 1 2004 prelink
756140 -rwxr-xr-x. 1 0 0 276 Sep 28 2004 0anacron
756311 -rwxr-xr-x. 1 0 0 180 May 2 2006 logrotate
756441 -rwxr-xr-x. 1 0 0 286 Aug 13 2004 tmpwatch
756896 lrwxrwxrwx. 1 0 0 28 Mar 26 2008 00-logwatch -> ../log.d/scripts/logwatch.pl
756897 -rwxr-xr-x. 1 0 0 418 Jun 6 2007 00-makewhatis.cron
756917 -rwxr-xr-x. 1 0 0 121 Aug 22 2007 slocate.cron
756974 -rwxr-xr-x. 1 0 0 1042 Jan 9 2007 certwatch
756984 -rwxr-xr-x. 1 0 0 135 Aug 18 2004 00webalizer
4572392 -rwxr-xr-x 1 1000 100 351 Apr 15 2011 dnsquery

Recall that dnsquery was the malicious cron script on the system. Note how its inode value is a significant outlier in relation to the inodes of other files in the directory. (Also worth noting are the file's uid and gid values.)

Let's look at another of the malicious files in context:

davehull@64n6:/cases/F10K/OS/mnt/usr/lib$ ls -lint | sort -gr | head -n 10
4572390 -rwxr-xr-x 1 1000 100 388262 Apr 15 2011 popauth
2179232 -rwxr-xr-x 1 0 0 32712 Oct 12 2007 libutil1.2.1.2.so
1247326 drwxr-xr-x. 3 0 0 4096 Jan 8 2008 wireshark
1247315 drwxr-xr-x. 2 0 0 4096 Mar 26 2008 sa
1247196 drwxr-xr-x. 2 0 0 4096 Mar 26 2008 mc
1246563 drwxr-xr-x. 3 0 0 4096 Mar 26 2008 zsh
1245298 drwxr-xr-x. 2 0 0 4096 Mar 26 2008 octave-2.1.57
1230465 drwxr-xr-x. 3 0 0 4096 Mar 26 2008 lam
1230431 drwxr-xr-x. 3 0 0 4096 Mar 26 2008 im
1230254 drwxr-xr-x. 2 0 65 4096 Jan 3 2008 cmpi

In this listing, popauth, a malicious file is a significant outlier where inodes are concerned, note again its uid and gid are both outliers as well.

But relying on inodes to find malicious files isn't 100%. Take a look at httpd.log in context:

davehull@64n6:/cases/F10K/OS/mnt/usr/lib$ ls -lint | sort -gr | grep -C5 httpd.log
706562 drwxr-xr-x. 10 0 0 4096 Mar 26 2008 perl5
704967 drwxr-xr-x. 2 0 0 4096 Mar 26 2008 tc
704676 drwxr-xr-x. 2 0 0 4096 Mar 26 2008 sse2
704665 drwxr-xr-x. 2 0 0 4096 Mar 27 2008 pkgconfig
688206 drwxr-xr-x. 2 0 0 12288 Mar 26 2008 gconv
670494 -rw——- 1 0 0 8192 Mar 18 2011 httpd.log
670445 -rwxr-xr-x 1 0 0 37068 Sep 22 2004 libieee1284.so.3.2.0
670438 -rwxr-xr-x 1 0 0 230676 Sep 22 2004 libcroco-0.6.so.3.0.0
670437 -rwxr-xr-x 1 0 0 27820 Aug 6 2004 libIIOP.so.0.5.17
670432 -rwxr-xr-x 1 0 0 732484 Oct 16 2004 libmcop.so.1.0.0
670427 -rwxr-xr-x 1 0 0 357876 Sep 28 2004 libbonobo-2.so.0.0.0

Here we see that its inode address does not stand out, however its permissions and ctime date stamp are both outliers.

Based on this concept of outlier analysis, I wrote a Python script called body-outliers that parses the contents of an fls bodyfile, calculating the average and standard deviations for pairs of metadata elements, currently just macb times and inodes, and returns a list of files from the bodyfile that exceed a threshold value provided by the user.

Here is sample output from body-outliers being run against the bodyfile for the case question:

davehull@64n6:/cases/F10K/OS$ ~/code/python/body-outliers/body-outliers.py --file slash.bodyfile --devs 10 --mode or | head -n 40
[+] Checking command line arguments.
[+] Outlier threshold is 10.0
[+] slash.bodyfile may be a bodyfile.
[+] Discarded 5775 files with 0 for meta_addr.
[+] Discarded 249 files with 0 for ctime.
[+] Discarded 0 files named .. or .
[+] Discarded 1 bad lines from slash.bodyfile.
Metadata meta_addr or ctime outliers that are more than 10.00 standard deviations from average values for their respective paths.
=====================================================================================================================================

Path avg meta_addr: 2176494 std dev: 55113.19 avg ctime: 2010 11 07 04:54:16 std dev: 23338643.10 path: /sbin
file meta_addr: 1051052 devs: -20.42 ctime: 2008 03 26 09:01:57 devs: -3.54 file: P^^ (deleted-realloc)

Path avg meta_addr: 853640 std dev: 145.00 avg ctime: 2008 03 26 08:58:53 std dev: 3.30 path: /usr/X11R6/lib/X11/fonts/100dpi
file meta_addr: 852235 devs: -9.69 ctime: 2008 03 26 08:57:49 devs: -19.39 file: fonts.alias

Path avg meta_addr: 854044 std dev: 163.00 avg ctime: 2008 03 26 08:58:56 std dev: 3.35 path: /usr/X11R6/lib/X11/fonts/75dpi
file meta_addr: 852237 devs: -11.09 ctime: 2008 03 26 08:57:49 devs: -20.00 file: fonts.alias

Path avg meta_addr: 1050657 std dev: 185.34 avg ctime: 2008 03 27 00:00:30 std dev: 36233.93 path: /usr/X11R6/lib/xscreensaver
file meta_addr: 1052864 devs: 11.91 ctime: 2008 03 27 06:46:25 devs: 0.67 file: anemone

Path avg meta_addr: 667578 std dev: 123965.62 avg ctime: 2010 09 28 03:46:36 std dev: 28230281.96 path: /usr/bin
file meta_addr: 6258822 devs: 45.10 ctime: 2011 01 22 11:36:33 devs: 0.36 file: checkmirror

Path avg meta_addr: 795282 std dev: 216888.35 avg ctime: 2008 04 03 15:17:52 std dev: 8001447.12 path: /usr/include
file meta_addr: 670496 devs: -0.58 ctime: 2011 03 18 17:46:00 devs: 11.65 file: glob2.h
file meta_addr: 670495 devs: -0.58 ctime: 2011 01 22 11:37:22 devs: 11.06 file: shup.h

Path avg meta_addr: 681805 std dev: 116862.43 avg ctime: 2008 04 27 13:07:42 std dev: 15502625.69 path: /usr/lib
file meta_addr: 2179232 devs: 12.81 ctime: 2011 01 22 11:37:22 devs: 5.57 file: libutil1.2.1.2.so
file meta_addr: 4572390 devs: 33.29 ctime: 2011 01 22 11:37:22 devs: 5.57 file: popauth

Notice shup.h, glob2.h (scroll to the right) and popauth all appear as outliers in the output, you may need to scroll to see it all. I've cut off the line at the end of the output that tells how many files met the provided outlier threshold (-devs 10), 69 out of more than 225K. Of those 69 files three were malicious, so we've reduced the size of the haystack quite a bit and of course in the process we've filtered out some needles (not all attacker files were outliers), but once we've identified one of these three malicious files, pivoting to find more evil is trivial as described previously.

body-outliers.py requires Python 2.7 or at least the argparse module. You can install this in SIFT by first installing Python's pip via sudo apt-get install python-pip. You may also need to do sudo apt-get install python-setuptools, it's been a few months since I went through this so I'm working from faulty memory. Once these modules are installed, you can do pip install argparse and argparse should install, then you should be able to run body-outliers.py.

Of course this method won't be 100% effective, there's plenty of malicious code on systems that by coincidence will not be an outlier, but in my testing, it has helped to locate some malicious files and has greatly reduced the overall data set and once I've found one piece of malicious code, I can pivot to find others using techniques described in my previous post.

I'm working on a new version of body-outliers that I hope will yield better results. The new version will look at file permissions, uids and gids and will perform its calculations in a non-directory context. If the file system being analyzed is available, it will check to see if outliers are packed as well as check against the csv output from autoruns on Windows systems, providing that file is available. I will only commit working code to the git repository, so feel free to grab the new versions as they become available.

I have to mention that there's nothing new about this technique. Keying off of out of sequence inodes is something I picked up from Rob Lee years ago. I've only tried to automate the process. Of course after I submitted my SECTor talk on this topic, I decided it would be worth checking to see if anyone else had looked into this before. I should have known, Brian Carrier and Gene Spafford had gone down this path before. My method is not quite the same as the one Carrier's paper describes, but the concept is similar enough that I'd be remiss if I didn't mention his prior art.

If you read the metadata from Dave Hull's inode, you'll find he's a significant outlier in some ways and average in others. He's unique, just like everyone else. Hull is an active member of a CIRT and is the forensics lead in a Fortune 500 corporation. In addition, he is the principal consultant for Trusted Signal, a boutique information security consulting firm specializing in IR and forensics.