5 April 2020

PDF Document Structure & Analysis

by Coleman Kane

PDF documents have long been one of the oldest and most prolific attack vectors for user-targeted exploits. At one time it was a broader targeting vector than Microsoft Office, due to the premium cost of Office which prohibited wider adoption on PCs. PDF documents offer an interesting case study in exploits, as the underlying file structure is built from a markup language that was derived from the PostScript Language. The specification contains many elements which are similar to programming languages, while other features (like flow control) are absent. Additionally, though PDF itself is an ASCII-based syntax, it has features built into it that allow for embedding of binary data within an otherwise-ASCII encoded document. This flexible &qmp; complex syntax, the support for embedding binary elements, and the architecture that PDF code is executed in order to render a page for viewing, all combines to build the foundation for a powerful piece of software (no doubt aiding in its adoption) that also has long been a ripe territory for exploitable software flaws.

This is not meant to call Adobe’s PDF out as a unique problem. In fact, as JavaScript was added to Web Browsers, Adobe Flash, and Visual Basic for Applications (VBA macros) was added to MS Office, these powerful yet complex elements have often been the source of exploitable software vulnerabilities. We will get into some of these examples later on this week, as well.

PDF is registered as ISO Standards 32000-1 (PDF 1.7) and 32000-2 (PDF 2.0). It is recommended that you keep a bookmark to these two specifications for future reference. Unfortunately, the 2.0 specification costs money to fetch a copy of the standard specification, but the 1.7 version (which is over 90% of PDF documents out there), is freely available and more than covers what we will go over in class:

Additionally, I have some prior years’ content at the following PDF links:

Also, the following lecture on file types as object containers is important background information

2018: Container Model of Data Files (PDF)

PDF Documents

By and large, PDF documents often serve as a final “presentation ready” document format. For most users, the intended audience is a read-only consumer of the content in the PDF. Often, a document which has been authored using various editing tools is finally produced for electronic shipment via “saving as a PDF”. Due to this, many users can easily misunderstand the threat vectors possible via PDF - which often stems from believing the document to be akin to another image format, rather than a description language that their computer must execute in order to present its content to the user.

Some interesting features within the PDF specification:

JavaScript (PDFjs, ECMA) interpreter
Forms UI support (XFA, FDF, XFDF)
U3D/PRC 3d-model embedded support
Embedded Flash
Inline HTML
Numerous embedded image formats using external or embedded 3rd-party libraries
PDF-within-PDF
Multi-layer Encoded/encrypted stream data

A great example to begin with are the following two posts by Brendan Zagaeski:

Additionally, the following post walks through some common examples you’re likely to see in the wild:

PDF File Format Basic Structure

If you are a novice to the PDF language syntax, review both of the above, and then come back here.

PDF Structure

From reading the above, you will recognize that a PDF document is a sequence of arbitrary-length sections, most of which define the objects that comprise the document. There is an index that is curiously placed at the end of the file. The objects may reference one another, and form a tree-like structure depending upon which objects contain other objects, or which objects simply define metadata or supporting data for other objects.

A rough diagram of this structure is below. It is an ordered structure, meaning that all of the document components have a specific ordering PDF expects them to adhere to.

entity identifier	contents	closure	description
%PDF-N.N	header data	N/A	Identifies the document type. “%PDF” can be preceded with up to 1024 bytes of non-document comment data
X Y obj	object data	endobj	Define an object via unique id (two numbers). The contents of the object follow
W Z obj	object data	endobj	Define an object via unique id (two numbers). The contents of the object follow
# # obj	.repeats.	endobj	Continue repeating `# # obj`/`endobj` sequences until all document objects are defined
xref	xref table	endobj	Continue repeating `# # obj`/`endobj` sequences until all document objects are defined
trailer	trailer data	startxref XXXNNN	Continue repeating `# # obj`/`endobj` sequences until all document objects are defined
%EOF			Marks end of PDF file. Anything past this will be ignored

Another key point to consider is that, though the xref table needs to exist and have an entry for every object, it doesn’t need to be correct. It also needs to have a unique offset defined for each object for most readers to load the document. When the PDF reader determines that an offset or object size in the xref table is incorrect, it fixes this at run-time, at the cost of a small per-object delay. Likewise, the startxref needs to exist as well.

Finally, each object can contain an optional stream using the stream and endstream tags. The stream keyword tells the PDF reader to interpret all of the following data (up to the next endstream tag) as raw data content that the obj encapsulates. This is how things like fonts, images, etc. are embedded within documents. stream must end with a whitespace, and then endstream must also have whitespace around it.

Thus, an object definition may look something like this:

4 0 obj
<</BBox[ 0 0 117.181 14.338]/Filter/FlateDecode/Length 75/Matrix[ 1 0 0 1 0 0]/Resources<</Font 8 0 R >>/Subtype/Form/Type/XObject>>stream
...
endstream
endobj

The above defines an object with id “4 0”, and then defines a stream within the object that contains some data. /Filter/FlateDecode tells the PDF reader that the embedded stream needs to be passed through the FlateDecode filter (one of the compression algorithms supported by PDF) before it can be interpreted properly. The /Length 75 tells how long the embedded data stream is (in bytes).

Analyzing the PDF

The PDF structure really outlines for us that the core components of the document structure are obj/endobj pairs and the optional stream/endstream that occur within them.

Below is a simple script that can walk through the object entities (ignoring the xref table) by giving it a PDF name:

list_objects.py [Python 3.x Script]

#!/usr/bin/env python
import argparse
import re
#
# Short script to discover and list the objects from a PDF
#

# Set up the cmdline args
ap = argparse.ArgumentParser(description="Lists objects from PDF")
ap.add_argument('-f','--filename', required=True, action='store', help='PDF File name')
args = ap.parse_args()

# Define a global regex to look for an object definition
objdef_re = re.compile(br'(\d+) (\d+) obj')

def get_next_object(fh):
    buf = b''
    while True:
        # Read 4096 bytes at a time
        tmpbuf = fh.read(4096)

        # Add to the leftover content from prior read
        buf += tmpbuf

        # Find all matches within this buffer, and provide an iterator allowing us to step
        # through them
        matches = objdef_re.finditer(buf)
        next_end = 0
        for m in matches:
            cut_tmp = False

            if m.start() < next_end:
                continue

            s = b''

            # Build a data structure keeping track of the parsed object ids
            # as well as the raw content from the PDF
            obj_item = {'raw': buf[m.start():m.end()],
                        'id_0': int(m.group(1)),
                        'id_1': int(m.group(2))}

            # Find the next endobj
            next_end = buf.find(b'endobj', m.start())
            if next_end == -1:
                # There is no more additional object content in what we read, need to pull
                # in more data
                while True:
                    cut_tmp = True
                    nbuf = fh.read(4096)
                    buf += nbuf
                    if not nbuf:
                        next_end = len(buf)
                        break
                    next_end = buf.find(b'endobj', m.start())

                    if next_end == -1:
                        next_end = len(buf)
                    else:
                        break

            # Grab the actual object data contents
            objdata = buf[m.start() + len(m.group(0)):next_end]
            obj_item['size'] = len(objdata)

            # But if we pulled in more data, we need to discard the older data, but keep
            # the newer data after the "endobj", for the next match search
            if cut_tmp:
                buf = buf[next_end + 6:]
                next_end = 0

            # Return the parsed object contents
            yield obj_item

        # Clear the already-processed objs out of the buf to return memory
        if next_end > 0:
            buf = buf[next_end+6:]

        # If we get an empty read, it means we reached EOF
        if not tmpbuf:
            return

def process_object(o):
    print("Object {id0}:{id1}, len={size} >>> {content}".format(id0=o['id_0'], id1=o['id_1'],
                                                                size=o['size'],
                                                                content=o['raw']))

try:
    with open(args.filename, 'rb') as pdf_fh:
        for obj in get_next_object(pdf_fh):
            # Called once for each object
            process_object(obj)
except EOFError:
    pass # Silently exit if we hit EOFError
except IOError:
    print("There was an IO Error")

The above code steps through the document one object at a time, and collects the identifier numbers as well as the object contents size for each. Additionally, the object contents are available for analysis as well in the objdata object within the get_next_object function.

Authoring tools like this to analyze new file types and malware samples is a common exercise in malware analysis, so it is worth studying to understand the machanics that are employed above.

Security researcher Didier Stevens is well known for his work on PDF malware and format analysis. He has published a large number of tools out there which can be employed to perform even more in-depth PDF malware analysis:

Didier Stevens PDF Tools page

PDF Tools

Stevens is known for putting together a couple of tools that are focused on analysis of suspect PDF files. One of these, pdf-parser.py is a much more complex implementation of a PDF parser than what I demonstrated above. Additionally, Stevens offers a tool named make-pdf which can be used to create a PDF that auto-executes some JavaScript (remember, a full JavaScript engine lives in most PDF readers) upon opening the document. The toolset further contains an mPDF library that can be used for constructing a PDF by calling functions and classes that create the individual objects. Finally, a utility named pdfid is available which doesn’t really do parsing, but walks through the file and generates a histogram-like report of a long list of features that PDF implements. It can be possible, in some situations, to use classification strategies with this data to help identify PDFs that may have a higher likelihood of being malicious.

Stevens offers a number of PDF videos on his YouTube Channel:

PDF Parser

The pdf-parser tool is provided as part of the modified Kali Linux VM that I put together for class. Below is the help output from runnign the tool with -h:

Usage: pdf-parser.py [options] pdf-file|zip-file|url
pdf-parser, use it to parse a PDF document

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -m, --man             Print manual
  -s SEARCH, --search=SEARCH
                        string to search in indirect objects (except streams)
  -f, --filter          pass stream object through filters (FlateDecode,
                        ASCIIHexDecode, ASCII85Decode, LZWDecode and
                        RunLengthDecode only)
  -o OBJECT, --object=OBJECT
                        id(s) of indirect object(s) to select, use comma (,)
                        to separate ids (version independent)
  -r REFERENCE, --reference=REFERENCE
                        id of indirect object being referenced (version
                        independent)
  -e ELEMENTS, --elements=ELEMENTS
                        type of elements to select (cxtsi)
  -w, --raw             raw output for data and filters
  -a, --stats           display stats for pdf document
  -t TYPE, --type=TYPE  type of indirect object to select
  -O, --objstm          parse stream of /ObjStm objects
  -v, --verbose         display malformed PDF elements
  -x EXTRACT, --extract=EXTRACT
                        filename to extract malformed content to
  -H, --hash            display hash of objects
  -n, --nocanonicalizedoutput
                        do not canonicalize the output
  -d DUMP, --dump=DUMP  filename to dump stream content to
  -D, --debug           display debug info
  -c, --content         display the content for objects without streams or
                        with streams without filters
  --searchstream=SEARCHSTREAM
                        string to search in streams
  --unfiltered          search in unfiltered streams
  --casesensitive       case sensitive search in streams
  --regex               use regex to search in streams
  --overridingfilters=OVERRIDINGFILTERS
                        override filters with given filters (use raw for the
                        raw stream content)
  -g, --generate        generate a Python program that creates the parsed PDF
                        file
  --generateembedded=GENERATEEMBEDDED
                        generate a Python program that embeds the selected
                        indirect object as a file
  -y YARA, --yara=YARA  YARA rule (or directory or @file) to check streams
                        (can be used with option --unfiltered)
  --yarastrings         Print YARA strings
  --decoders=DECODERS   decoders to load (separate decoders with a comma , ;
                        @file supported)
  --decoderoptions=DECODEROPTIONS
                        options for the decoder
  -k KEY, --key=KEY     key to search in dictionaries

PDF parser has the ability to do the basic logic that I performed with the earlier tool. Additionally, it has many other features, including:

It can decode encoded streams (ones that have one or more /Filter properties)
You can search within streams for content matching strings, regular expressions, yara signatures
You can select to extract the content of individual objects - navigate the document
Perform a statistical analysis of PDF keywords and output a summary view
Present a summary of each of the objects, similar to mine, but including an MD5 checksum of each object’s content

Of note is that pdf-parser.py only supports being run by Python 2.7

So we can begin with some simple examples, using the malware.pdf from Lab 8. You may want to set the permissions of the pdf-parser.py binary so that is is executable by your user. This has already been done in the Kali VM.

Viewing Stats

To view the statistics of the PDF file, run pdf-parser.py -a malware.pdf:

Comment: 3
XREF: 1
Trailer: 1
StartXref: 1
Indirect object: 6
  3: 1, 2, 6
 /Catalog 1: 3
 /Page 1: 5
 /Pages 1: 4
Search keywords:
 /AcroForm 1: 3
 /XFA 1: 2

The above gives us numerous counts of various attributes of the file. We can see that there are 6 indirect objects, 3 comments, and within the Indirect object section, you can see that there are a list of various PDF commands. Each of the rows consists of a well known indirect object element, followed by a count of how often it appears, followed by a colon, and finally the list of objects the attribute appears in. For instance, we have learned from looking at this that there’s a /Catalog type object in the object titled 3 0 obj. The object titled 4 0 obj contains data of type /Pages.

There are also some keyword searches that are performed to look for objects that might have malicious intent in them. In this case, good places to begin are listed as /AcroForm and /XFA, in 3 0 obj and 2 0 obj, respectively.

Selecting objects

Selecting objects from the document can be performed with the -o flag. From the above, the /XFA object was particularly interesting, and also was the first numbered object identified by the tool, so it’s a good place to begin. I can run pdf-parser.py -o 2 malware.pdf and get the following output:

obj 2 0
 Type:
 Referencing: 1 0 R

  <<
    /XFA 1 0 R
  >>

The above attempts to extract a type, which it was not able to derive a known type for this indirect object. If you recall in the list of indirect objects from the earlier output, there was one curious row that I didn’t discuss:

  3: 1, 2, 6

Turns out this is just reporting the count of the “NULL” or “untyped” objects from the file, and saying that there are 3 such objects (of the 6 total), and these are 1 0 obj, 2 0 obj, and 6 0 obj.

This is reflected in the object we are looking at:

obj 2 0
 Type:
...

Another thing to notice is:

 Referencing: 1 0 R

This tells us that this object contains a reference to another object. In this case, the other object contains a stream of XFA data, which this object is encapsulating. Syntactically, this is apparent in the object’s code:

  <<
    /XFA 1 0 R
  >>

However, this is often not that obvious, so pdf-parser.py provides helpful discovery of references. What this is telling us also is that the data we really want to look at is present in object 1 0 obj. So, we run pdf-parser.py -o 1 malware.pdf and look at its output:

obj 1 0
 Type:
 Referencing:
 Contains stream

  <<
    /Filter [/Fl /Fl]
    /Length 8792
  >>

Simple enough, we can see that this contains no references, but a new note tells us that a stream is embedded. By default, the stream’s contents are not displayed automatically, because often streams contain binary data that would not be easily printable.

 Contains stream

The stream has been filtered and the dictionary of the object shows us that the filter is two /FlateDecode operations (/Fl for short):

/Filter [/Fl /Fl]

The other key/value in the dictionary tells us that the length of the stream data is 8792:

/Length 8792

Extracting Stream Data

We can also extract stream data, so long as we give the object number to pdf-parser.py, and the object we provide contains stream data. So, let’s run pdf-parser.py -d stream1.dat -o 1 malware.pdf to dump the stream to disk. You’ll note that the object content (minus the stream) is displayed on the console, still.

Once done, we can ls -l stream1.dat:

-rw-rw-r--  1 root  root  8793 Apr  6 20:12 stream1.dat

You’ll notice it is about the same size as the /Length said it would be. If you look at the data with hexdump -C, at the end you’ll notice that a newline (0x0a) was added on by Stevens’ tool. For this example that should be harmless, but it is worth keeping in mind that you may want to validate the sizes of the extracted data if you do run into problems. You should at least get the stream data, but there might be an additional whitespace added on like this.

If we look at the data, it’s all binary and looks pretty random.

00000000  78 9c ed dc 79 34 d5 f9  03 ff f1 66 ab 66 90 34  |x...y4.....f.f.4|
00000010  45 08 53 99 54 44 ab 7d  69 19 24 c5 68 b1 bb 86  |E.S.TD.}i.$.h...|
00000020  42 f6 25 fb 72 55 8a c9  cd 32 24 ca 36 75 c5 d4  |B.%.rU...2$.6u..|
00000030  b5 94 3d 5c 77 8a 6b bb  21 74 71 09 21 24 2e d9  |..=\w.k.!tq.!$..|
00000040  b7 eb de df f4 fd fb f7  e7 ef fc fa a3 d7 e7 9c  |................|
...

This is because our command merely extracted the content that was in the PDF. This content was compressed (with FlateDecode!), so it isn’t much use to us unless we findd another tool to decode. However, pdf-parser has a -f option that supports many of the common filters, and can extract this for us if we run the command like so: pdf-parser.py -f -d stream1_filt.dat -o 1 malware.pdf. Running the ls -l stream*.dat we can see both outputs now:

-rw-rw-r-- 1 root  root      8793 Apr  6 20:12 stream1.dat
-rw-rw-r-- 1 root  root  91034818 Apr  6 20:20 stream1_filt.dat

Now we have the compressed stream, which is about 8.7kB, while the decompressed stream is roughly 90MB in size. Looking at the code, it is XFA that contains JavaScript, and even has the following interesting comments within the JS:

...
 //ROP0
 //7201E63D    XCHG EAX,ESP
 //7201E63E    RETN
 //ROP1
 //7200100A    JMP DWORD PTR DS:[KERNEL32.GetModuleHandle]
 //ROP2
 //7238EF5C    PUSH EAX
 //7238EF5D    CALL DWORD PTR DS:[KERNEL32.GetProcAddress]
 //7238EF63    TEST EAX,EAX
 //7238EF65    JNE SHORT 7238EF84
 //7238EF84    POP EBP
...

Integrating Yara

From the above, I can build a really simple yara signature to look for the presence of the ROP# comments. For example:

rule find_rop_comments {
 meta:
  author = "Coleman Kane <kaneca@mail.uc.edu>"
  url = "https://class.malware.re/2020/04/05/pdf-document-analysis.html"
  version = 1

 strings:
  $rop0 = "ROP0"
  $rop1 = "ROP1"
  $rop2 = "ROP2"

 condition:
  all of them
}

I can run pdf-parser.py with the above signature provided to the -y option and it will scan all streams and display the object summary for all objects who’s streams match my signature. By default, it scans any object data after applying filters, so you don’t need to specify the -f option. If you do, pdf-parser.py will output the stream contents to the screen as well as the object summary.

Running pdf-parser.py -y find_rop_comments.yar malware.pdf:

YARA rule: find_rop_comments (find_rop_comments.yar)
obj 1 0
 Type:
 Referencing:
 Contains stream

  <<
    /Filter [/Fl /Fl]
    /Length 8792
  >>

The -y feature also has the capability of accepting a folder instead of a file name, in which case it will consume all of the rules files present in the folder. The report will publish a line like the following for every matching object:

YARA rule: find_rop_comments (find_rop_comments.yar)

The above states that it was a YARA match, what rule within the signature(s) matched, and also which filename the matching rule was present in.

If you want more details to report the matching content, you can add the --yarastrings argument, e.g.: pdf-parser.py -y find_rop_comments.yar --yarastrings malware.pdf

The output then looks like this, including the offsets and hexadecimal as well as ASCII content of the strings:

YARA rule: find_rop_comments (find_rop_comments.yar)
0007f2 $rop0:
 524f5030
 'ROP0'
000827 $rop1:
 524f5031
 'ROP1'
000868 $rop2:
 524f5032
 'ROP2'
obj 1 0
 Type:
 Referencing:
 Contains stream

  <<
    /Filter [/Fl /Fl]
    /Length 8792
  >>

If you try matching against unfiltered streams, you’ll get no output: pdf-parser.py -y find_rop_comments.yar --unfiltered malware.pdf. However, this feature is very useful for cases where an exploit attempts to attack the filter itself, or other times where the exploit is elsewhere, the stream contains data that can be accessed in memory upon successful exploit, but the author chose to lie about which filter(s) to apply - possibly to confuse tools such as pdf-parser.py. Thus, it is advisable to try both approaches and use the sum of both sets of data to guide your investigations.

Other searching

Similarly, you can run simple string searches using the --searchstream=XXX argument. For example, on malware.pdf the following will also match 1 0 obj: pdf-parser.py --searchstream=ROP0 malware.pdf

Likewise, you can use a regular expression if you also provide the --regex argument. E.g.: pdf-parser.py --regex --searchstream="ROP(0|1|2)" malware.pdf

The --unfiltered argument works in the same fashion to not try matching against filtered stream content.

mPDF Python Script Generation

Finally, if you use the -g option, the script will change its behavior so that it will process the PDF and then output the Python source code that can use the mPDF module to genrate the PDF programmatically. This would enable you, for instance, to be able to add additional content to the PDF for output. For instance, you could change some of the JavaScript in the XFA stream, or possibly replace an embedded EXE with an EXE of your own choosing.

Similar to the -o option, pdf-parser.py -g malware.pdf will generate Python code where any streams are embedded in their encoded form. If the command is called as pdf-parser.py -f -g malware.pdf, the Python code will contain the filtered streams, enabling the viewing and editing of the underlying data (such as the XFA-embedded JavaScript in the earlier example).

Remember that the mPDF module doesn’t come with pdf-parser.py, it is actually in the separate make-pdf package.

Analysis Summary

This lecture discussed using both summary analysis of the PDF, as well as deep analysis of individual components of the PDF. It’s important to keep in mind that both of these approaches should coexist as part of document analysis, as they help paint different parts of the analytical picture. Additionally, this example provided the instructions for re-construction of the document. Often another challenging aspect of malware analysis can be trying to reconstruct the attack, and in doing so, modifying arbitrary variables of the attack to see how it changes things.

home

tags: malware yara pdf lecture

CS6038/CS5138 Malware Analysis, UC

Course content for UC Malware Analysis