Blog: SANS Digital Forensics and Incident Response Blog

Blog: SANS Digital Forensics and Incident Response Blog

Office 2007 Metadata

Metadata information from documents can be a great source of information for investigators and it's value has often been discussed before. Documents created using Microsoft Office often come up during investigations. There are several scripts and tools out there to read the proprietary binary format of Office documents created using Office 2003 and earlier versions so there is not more to add to those tools. Yet there aren't that many tools out there that can list the metadata information from the new format that Office 2007 uses, OpenXML. So I decided to examine it a bit further.

Microsoft has already published a good enough document describing the structure of OpenXML [1]. Essentially a document created in the OpenXML document format is a compressed file, using the well known ZIP format, so it can be easily opened using any ZIP tool (for instance by modifying the name of the document from document.docx to document.zip and using a standard ZIP tool).

Inside the ZIP file are predefined structures of files, mostly XML files that describe the document and it's content. So it can be easily read using standard available libraries in scripting languages such as Perl.

According to Microsoft a folder is created inside the ZIP archive called "_rels". This folder contains a file named ".rels" which defines the root relationships within the package. This should be the first place to be able to parse the content of the document. Whithin the .res file you find tags that define the relationship of the document:

<Relationship Id="someID" Type="relationshipType" Target="targetPart"/>

Metadata is stored in files that contain a type of "*properties", most notable the "core-properties" and "extended-properties". These files are usually stored in the following location:
  • docProps/core.xml
  • docProps/app.xml
These files then contain the actual metadata information, such as document creator, last saved by information, etc. To be able to display the metadata information it is necessary to extract and parse these documents.

To do this I wrote the script read_open_xml.pl that parses the contents of the .rels file to locate metadata information from the document and then extracts the metadata and prints it to the screen. Example usage is:

./read_open_xml.pl test.docx

==========================================================================
cmd line: ./read_open_xml.pl test.docx
==========================================================================

Document name: test.docx
Date: Tue Jun 9 16:51:23 GMT 2009

--------------------------------------------------------------------------
File Metadata
--------------------------------------------------------------------------
title = my company template
subject = Document template
creator = Kristinn Gudjonsson
keywords = template, word
description =
lastModifiedBy = Kristinn Gudjonsson
revision = 3
lastPrinted = 2008-08-15T10:14:00Z
created = 2008-08-15T10:14:00Z
modified = 2008-08-15T10:14:00Z
category = template
--------------------------------------------------------------------------
Application Metadata
--------------------------------------------------------------------------
Template = my_template.dot
TotalTime = 0
Pages = 2
Words = 159
Characters = 908
Application = Microsoft Word 12.1.2
DocSecurity = 0
Lines = 7
Paragraphs = 1
ScaleCrop = false
Manager = Some dude
Company = My Company
LinksUpToDate = false
CharactersWithSpaces = 1115
SharedDoc = false
HyperlinksChanged = false
AppVersion = 12.0258

copyright, Kristinn Gudjonsson, 2009


The script also reads the character encoding of the XML documents and encodes the output accordingly.

The script was created to be used in Linux however I modified the script slightly so it should work on a Windows OS (tested on a Win XP SP3 using ActivePerl 5.10). You can get the Windows version here. The Windows version has not been tested as well as the Linux one, so it might still be little bit more unstable (there are some installation information contained within the script itself)

Kristinn Gujnsson, GCFA #5028, works as a forensic analyst and incident handler as well as being a local mentor for SANS.

1 Comments

Posted August 29, 2010 at 4:06 PM | Permalink | Reply

Kala

I tried using this .pl at the command prompt. It returned an error message "./read_open_xml.pl" followed by the name of the docx file name as : ./read_open_xml.pl is not a internal or external command, operable program or batch file.

How do I actually execute this so that I can read the meta data of the document? Kindly help me with this.
Kala

Post a Comment






* Indicates a required field.