SANS Digital Forensics and Incident Response Blog

Extracting Known Bad Hash Set From NSRL

Hash filtering is a time-saving technique for a computer forensics examiner when working on a huge disk image. In a nutshell, this technique can filter out all those files in your image that belong to the operating system or well-known software packages. This will let the examiner focus on unknown files, reducing the scope of the investigation. After all, there's no point in spending time checking files we already know.

This filtering operation is based on hashes. Usually, we calculate the hash for every file in the image and check it against a list of hashes previously calculated over known good files. We call this list the known good hash set. All files with hashes matching the list are filtered out.

On the other hand, we would like to know if there are malicious files in our computer forensics case image. Again, the technique works by calculating the hash for every file in the image, looking for matches in a list containing pre-calculated hashes for known malicious files, viruses, cracker's tools, or anything you judge to be a malicious file. We call this list the known bad hash set and we want to be alerted when matches occur.

It's not an easy task to keep such hashsets, and they need to be huge in order to be effective. Thankfully, others are collecting files and calculating hashes for us. The National Institute for Standards and Technology maintains the National Software Reference Library or NSRL, which is one of the best hashset libraries available, it's public and free!

Unfortunately, life is not a bed of roses.

Practically all tools that use hash sets for filtering have a way to say "this is my known good hash set, ignore everything found here" and "this is my known bad hash set, ring all bells when something matches here". The SleuthKit tool SORTER does that using -x (for known good) and -a (for known bad). However, the NSRL hash set contains both good and bad files. If we use it as known good, there's a risk of ignoring malicious files in the image. If we use it as known bad, we will have thousands of false positives. What to do?

Doug White, from NIST, has given me good advice.

The NSRL file that correlates hashes and file names is NSRLFile.txt while NSRLProd.txt softs the files by classification. The known bad files belong to products classified as "Hacker Tool". So, we can separate them. You can use MS LogParse, AWK or any programming language. I prefer Perl and here is the code:

#!/usr/bin/perl -w
# Extracts known good and known bad hashsets from NSRL
# uso: nsrlext.pl -n <nsrl files comma separated> -p <nsrl prod files comma separated> -g <known good txt> -b <known bad txt> [-h]
#
# -n :nsrl files comma separated. Ex: -n c:\nsrl\RDA_225_A\NSRLFile.txt,c:\nsrl\RDA_225_B\NSRLFile.txt
# -p :nsrl prod files comma separated. Ex: -p c:\nsrl\RDA_225_A\NSRLProd.txt,c:\nsrl\RDA_225_B\NSRLProd.txt
# -g :known good txt filename. Ex: -g good.txt
# -b :known bad txt filename. Ex: -b bad.txt
# -h :help
#
#
use Getopt::Std;

my $ver="0.1";

#opcoes
%args = ( );
getopts("hn:p:g:b:", \%args);

#help
if ($args{h}) {
&cabecalho;
print <<DETALHE ;
uso: nsrlext.pl -n nsrl_files_comma_separated -p nsrl_prod_files_comma_separated [-g known_good_txt] [-b known_bad_txt] [-h]

-n :nsrl files comma separated. Ex: -n c:\nsrl\RDA_225_A\NSRLFile.txt,c:\nsrl\RDA_225_B\NSRLFile.txt
-p :nsrl prod files comma separated. Ex: -p c:\nsrl\RDA_225_A\NSRLProd.txt,c:\nsrl\RDA_225_B\NSRLProd.txt
-g :known good txt filename. Ex: -g good.txt
-b :known bad txt filename. Ex: -b bad.txt
-h :help

DETALHE
exit;
}

die "Enter the NSRL hashset file list (comma delimited)\n" unless ($args{n});
die "Enter the NSRL product file list (comma delimited)\n" unless ($args{p});

die "Enter known good and/or known bad output filenames\n" unless (($args{g}) || ($args{b}));

my %hack;

&cabecalho;

#Prod files
my @prod = split(/,/, $args{p});

foreach $item (@prod) {
open(PRODUCT, "< $item");

while (<PRODUCT>) {
chomp;
my @line = split(/,/, $_);

#create a hash of hacker tool codes
$hack{$line[0]} = $item if ($line[6] =~ /Hacker Tool/);
}

close(PRODUCT);
}

#hashset files
my @hset = split(/,/, $args{n});

open(BAD, "> $args{b}") if ($args{b});

open(GOOD, "> $args{g}") if ($args{g});

my $i=0;

foreach $item (@hset) {
open(NSRL, "< $item");

while (<NSRL>) {

#stdout feedback
print ">" if (($i % 10000) == 0);

my @line = split(/,/, $_);

if ($hack{$line[5]}) {
#is a hacker tool
print BAD $_ if ($args{b});
}
else {
print GOOD $_ if ($args{g});
}

$i++;
}

close(NSRL);
}

print "\nDone !\n";

close(BAD) if ($args{b});
close(GOOD) if ($args{g});

### Sub rotinas ####

sub cabecalho {
print <<CABEC;

nsrlext.pl v$ver
Extracts known good and known bad hashsets from NSRL
Tony Rodrigues
dartagnham at gmail dot com
--------------------------------------------------------------------------

CABEC

}

#-----EOF-------

Usage: nsrlext.pl -n c:\nsrl\RDA_225_A\NSRLFile.txt,c:\nsrl\RDA_225_B\NSRLFile.txt -p c:\nsrl\RDA_225_A\NSRLProd.txt,c:\nsrl\RDA_225_B\NSRLProd.txt -b NSRLBad.txt -g NSRLGood.txt

This script runs in both Windows and Linux, it just requires Perl.

After this, we can use both hash sets in Autopsy, TSK Sorter or even with md5deep/sha1deep.

Tony Rodrigues has over 20 years of IT experience and 7 years in Information Security management. He currently holds CISSP, CFCP and Security+ certifications and has been in charge of several corporate digital investigations in Brazil. He loves CAINE Live CD and writes about Computer Forensics/Incident Response for forcomp.blogspot.com.

10 Comments

Posted February 22, 2010 at 7:17 PM | Permalink | Reply

raffael

congratulations Tonny.

Posted February 23, 2010 at 8:09 AM | Permalink | Reply

Nanni Bassetti

Good job! We'll put this script in the bash script tools directory in Caine live :-)

Posted February 23, 2010 at 9:01 AM | Permalink | Reply

Anders Thulin

I can't make out from the script if it does the right thing or not, so here is a warning.
The media classed as Hacker Stuff in NSRL contain much that is not properly hacker tools, and which will caused false postive alarms. I remember a number of Microsoft redistributables, and the Gnu GPL, for instance, but there is much else.
To avoid many false positives, I think it is advisable to drop all hashes that appear both as Hacker Tools and elsewhere: these are more likely to be safe files than unsafe. (Note the many duplicates with Microsoft and MSDN CDs) Even so, there are some NSRL hashes that appear only on Hacker Tools but that are still ''good': they just happen to be unusual.
Me, I see so many false positives when I use even the reduced hacker tools extracts that I won't use it anymore, except in very special circumstances. A session with a reliable antivirus/antimalware scanner is almost always more productive.

Posted February 23, 2010 at 1:09 PM | Permalink | Reply

Tony Rodrigues

Thanks, Nanni !

Posted February 23, 2010 at 1:40 PM | Permalink | Reply

Tony Rodrigues

Hi, Anders. Thanks for your comments.
I have seen the same behavior with AV. They detect "hacker tools" and alert/delete even with some sysinternals/nirsoft ones, very useful in IR ''
Maybe we will always have to deal with false positives. We must consider, case to case, if a technique would be useful or not. A good way to make NSRL hashset produce more accurated results could be split the files even more, giving your own sub-classification, and use only those ones related to the case. Uploading them to a database would be faster to do this.

Posted February 24, 2010 at 10:53 PM | Permalink | Reply

Andrew Hoog

If you are copying the code from the website and get the error:
Can't find string terminator "CABEC" anywhere before EOF at /home/ahoog/nsrlext. pl line 103
remove the whitespace infront of "CABEC" on line 111, it then works. I tested this on Ubuntu 8.

Posted February 24, 2010 at 11:47 PM | Permalink | Reply

Tony Rodrigues

Yes, Andrew, sometimes this problem happens when we copy the source code from perl editor to blog editor. I published this code and others in https://sourceforge.net/projects/byteinvestigato/
Maybe is easier to download it from this sourceforge project.

Posted February 25, 2010 at 5:07 AM | Permalink | Reply

Dave Hull

Thank you Andrew. I have removed the leading spaces.

Posted February 26, 2010 at 2:43 PM | Permalink | Reply

Julio Carvalho

Congratulations Tony.
Great text.

Posted June 9, 2013 at 5:54 PM | Permalink | Reply

Mika

hmmpf.. I copied the perl script and ran it, but it generates two empty files ? If I grep or seach the files I can find "Hacker Tool" from the NSRLProd.txt'' but no extraction ?