SANS Digital Forensics and Incident Response Blog

Forensic Mining

The creation of Quantitative models and techniques in Information Systems Security, let alone Digital Forensics, is a field in its infancy. The prediction of threats is oft touted as being too difficult due to a shortage of data and the costs associated with collecting and analyzing data for a site. What we are missing is the capability to create associative rules that will enable this field to be correctly perceived as a science and not an art.

A key feature in the acceptance and uptake of digital forensics is the ability to replicate the results of an engagement.

It has been deduced that three main problems exist within the analytical process involved with Information Systems security (Valentino, 2003):

(1) A failure to utilize all available information sources,

(2) An inability to correctly verify the validity of a suspected computer system intrusion, and

(3) The lack of a standard process.

Statistical and algorithmic processes are going to feature more and more within both systems security and risk as well as in a digital forensic examination. There are numerous types of algorithms that can be used. Some of these include:

  • Descriptive Modeling
  • Predictive Modeling
  • Classification
  • Regression
  • Combinatory algorithms
  • Multi-layer perceptrons (MLP) and Neural networks

These information processing methods include data mining and neural networks. The sub-fields that will most directly influence digital forensics in the years to come will be text and image mining.

The growth of Text mining as a sub-discipline has done little as yet to impact digital forensics, but this is something that will change as the number of skilled professionals with this class of skills increases. One of the primary differences between the generalized field of data mining and text mining comes from the presumption that data sets used in data mining exercises will be stored in a structured format. Pre-processing operations in text mining are generally focused on transforming unstructured textual data into a format that is more readily interrogated. Additionally, text mining relies heavily on the field of computational linguistics (Fayyad & Stolorz,1996).

The introduction of statistical methods into digital forensics is a necessity. The growth of computer storage is far exceeding the capacity of any manual technique and this trend will only get worse as time progresses. At best, humans learn linearly. This is the failing of an art. As technology develops across an exponential path, we are going to have to introduce new methodologies that replace many of the manual tasks that are currently undertaken.

Image Mining / retrieval

A related set of techniques to text data mining are developing in support of image retrieval. These are not covered in any detail here, but some common techniques include:

  • Query by example

o Use a query image

o User sketch as a query

  • Query directly on features (such as)

o Find images that are 50% red

o Find images that are blue at the top

These methods generally produce a similar structure of score function as for text retrieval. Though in their infancy, the ability to limit the number of images to a small fraction of the total discovered on a system improves the efficiency and cost effectiveness of these techniques. This allows a human practitioner to search through a small fraction of the total images on a system freeing them for more productive tasks.

Text is generally considered to be unstructured (Cherkassky, 1998). However, nearly all documents demonstrate a rich amount of semantic and syntactical structure that may be used to form a framework in structuring data. Typographical elements such as punctuation capitalization white space carriage returns for instance can provide a rich source of information to the text miner (Berry & Linoff, 1997). Used together, these methods can enable the searching of far more source information in a reduced time-frame.

Data Mining answers the question, "How can we extract useful information from all this data?" When used in digital forensics, it means that we can create a scientific framework that will enable us to move the discipline from an art to a science.

  1. Berry, Michael & Linoff, Gordon (1997) "Data Mining Techniques (For Marketing, Sales, and Customer Support)", John Wiley & Sons.
  2. Cherkassky V. & Mulier, F. (1998) "Learning From Data", John Wiley & Sons.
  3. Fayyad, Usama; Haussler, David & Stolorz, Paul (1996) "Mining Scientific Data", Communications of the ACM, vol. 39, no. 11, pp. 51-57, November 1996
  4. Valentino, Christopher C. (2003) "Smarter computer intrusion detection utilizing decision modelling" Department of Information Systems, The University of Maryland, Baltimore County, Baltimore, MD, USA

Craig Wright, GFCA Gold #0265, is an author, auditor and forensic analyst. He has nearly 30 GIAC certifications, several post-graduate degrees and is one of a very small number of people who have successfully completed the GSE exam.