PDF Document Structure & Analysis
by Coleman Kane
PDF documents have long been one of the oldest and most prolific attack vectors for user-targeted exploits. At one time it was a broader targeting vector than Microsoft Office, due to the premium cost of Office which prohibited wider adoption on PCs. PDF documents offer an interesting case study in exploits, as the underlying file structure is built from a markup language that was derived from the PostScript Language. The specification contains many elements which are similar to programming languages, while other features (like flow control) are absent. Additionally, though PDF itself is an ASCII-based syntax, it has features built into it that allow for embedding of binary data within an otherwise-ASCII encoded document. This flexible &qmp; complex syntax, the support for embedding binary elements, and the architecture that PDF code is executed in order to render a page for viewing, all combines to build the foundation for a powerful piece of software (no doubt aiding in its adoption) that also has long been a ripe territory for exploitable software flaws.
This is not meant to call Adobe’s PDF out as a unique problem. In fact, as JavaScript was added to Web Browsers, Adobe Flash, and Visual Basic for Applications (VBA macros) was added to MS Office, these powerful yet complex elements have often been the source of exploitable software vulnerabilities. We will get into some of these examples later on this week, as well.
PDF is registered as ISO Standards 32000-1 (PDF 1.7) and 32000-2 (PDF 2.0). It is recommended that you keep a bookmark to these two specifications for future reference. Unfortunately, the 2.0 specification costs money to fetch a copy of the standard specification, but the 1.7 version (which is over 90% of PDF documents out there), is freely available and more than covers what we will go over in class:
- ISO 32000-1:2008 PDF Standard 1.7
- PDF Standard 1.7 errata
- ISO 32000-2:2017 PDF Standard 2.0 (portal, no free download)
Additionally, I have some prior years’ content at the following PDF links:
- 2017 lecture on PDF documents (slides & video)
- 2018 lecture on PDF and other documents (PDF slides only)
Also, the following lecture on file types as object containers is important background information
PDF Documents
By and large, PDF documents often serve as a final “presentation ready” document format. For most users, the intended audience is a read-only consumer of the content in the PDF. Often, a document which has been authored using various editing tools is finally produced for electronic shipment via “saving as a PDF”. Due to this, many users can easily misunderstand the threat vectors possible via PDF - which often stems from believing the document to be akin to another image format, rather than a description language that their computer must execute in order to present its content to the user.
Some interesting features within the PDF specification:
- JavaScript (PDFjs, ECMA) interpreter
- Forms UI support (XFA, FDF, XFDF)
- U3D/PRC 3d-model embedded support
- Embedded Flash
- Inline HTML
- Numerous embedded image formats using external or embedded 3rd-party libraries
- PDF-within-PDF
- Multi-layer Encoded/encrypted stream data
A great example to begin with are the following two posts by Brendan Zagaeski:
Additionally, the following post walks through some common examples you’re likely to see in the wild:
If you are a novice to the PDF language syntax, review both of the above, and then come back here.
PDF Structure
From reading the above, you will recognize that a PDF document is a sequence of arbitrary-length sections, most of which define the objects that comprise the document. There is an index that is curiously placed at the end of the file. The objects may reference one another, and form a tree-like structure depending upon which objects contain other objects, or which objects simply define metadata or supporting data for other objects.
A rough diagram of this structure is below. It is an ordered structure, meaning that all of the document components have a specific ordering PDF expects them to adhere to.
entity identifier | contents | closure | description |
---|---|---|---|
%PDF-N.N | header data | N/A | Identifies the document type. “%PDF” can be preceded with up to 1024 bytes of non-document comment data |
X Y obj | object data | endobj | Define an object via unique id (two numbers). The contents of the object follow |
W Z obj | object data | endobj | Define an object via unique id (two numbers). The contents of the object follow |
# # obj | .repeats. | endobj | Continue repeating # # obj /endobj sequences until all document objects are defined |
xref | xref table | endobj | Continue repeating # # obj /endobj sequences until all document objects are defined |
trailer | trailer data | startxref XXXNNN | Continue repeating # # obj /endobj sequences until all document objects are defined |
%EOF | Marks end of PDF file. Anything past this will be ignored |
Another key point to consider is that, though the xref
table needs to exist and have an entry for every object, it doesn’t need to be correct.
It also needs to have a unique offset defined for each object for most readers to load the document. When the PDF reader determines that an offset
or object size in the xref
table is incorrect, it fixes this at run-time, at the cost of a small per-object delay. Likewise, the startxref
needs to exist as well.
Finally, each object can contain an optional stream using the stream
and endstream
tags. The stream
keyword tells the PDF reader to
interpret all of the following data (up to the next endstream
tag) as raw data content that the obj
encapsulates. This is how things like
fonts, images, etc. are embedded within documents. stream
must end with a whitespace, and then endstream
must also have whitespace around it.
Thus, an object definition may look something like this:
4 0 obj
<</BBox[ 0 0 117.181 14.338]/Filter/FlateDecode/Length 75/Matrix[ 1 0 0 1 0 0]/Resources<</Font 8 0 R >>/Subtype/Form/Type/XObject>>stream
...
endstream
endobj
The above defines an object with id “4 0
”, and then defines a stream within the object that contains some data. /Filter/FlateDecode
tells
the PDF reader that the embedded stream needs to be passed through the FlateDecode
filter (one of the compression algorithms supported by
PDF) before it can be interpreted properly. The /Length 75
tells how long the embedded data stream is (in bytes).
Analyzing the PDF
The PDF structure really outlines for us that the core components of the document structure are obj
/endobj
pairs and the optional
stream
/endstream
that occur within them.
Below is a simple script that can walk through the object entities (ignoring the xref
table) by giving it a PDF name:
list_objects.py [Python 3.x Script]
#!/usr/bin/env python
import argparse
import re
#
# Short script to discover and list the objects from a PDF
#
# Set up the cmdline args
ap = argparse.ArgumentParser(description="Lists objects from PDF")
ap.add_argument('-f','--filename', required=True, action='store', help='PDF File name')
args = ap.parse_args()
# Define a global regex to look for an object definition
objdef_re = re.compile(br'(\d+) (\d+) obj')
def get_next_object(fh):
buf = b''
while True:
# Read 4096 bytes at a time
tmpbuf = fh.read(4096)
# Add to the leftover content from prior read
buf += tmpbuf
# Find all matches within this buffer, and provide an iterator allowing us to step
# through them
matches = objdef_re.finditer(buf)
next_end = 0
for m in matches:
cut_tmp = False
if m.start() < next_end:
continue
s = b''
# Build a data structure keeping track of the parsed object ids
# as well as the raw content from the PDF
obj_item = {'raw': buf[m.start():m.end()],
'id_0': int(m.group(1)),
'id_1': int(m.group(2))}
# Find the next endobj
next_end = buf.find(b'endobj', m.start())
if next_end == -1:
# There is no more additional object content in what we read, need to pull
# in more data
while True:
cut_tmp = True
nbuf = fh.read(4096)
buf += nbuf
if not nbuf:
next_end = len(buf)
break
next_end = buf.find(b'endobj', m.start())
if next_end == -1:
next_end = len(buf)
else:
break
# Grab the actual object data contents
objdata = buf[m.start() + len(m.group(0)):next_end]
obj_item['size'] = len(objdata)
# But if we pulled in more data, we need to discard the older data, but keep
# the newer data after the "endobj", for the next match search
if cut_tmp:
buf = buf[next_end + 6:]
next_end = 0
# Return the parsed object contents
yield obj_item
# Clear the already-processed objs out of the buf to return memory
if next_end > 0:
buf = buf[next_end+6:]
# If we get an empty read, it means we reached EOF
if not tmpbuf:
return
def process_object(o):
print("Object {id0}:{id1}, len={size} >>> {content}".format(id0=o['id_0'], id1=o['id_1'],
size=o['size'],
content=o['raw']))
try:
with open(args.filename, 'rb') as pdf_fh:
for obj in get_next_object(pdf_fh):
# Called once for each object
process_object(obj)
except EOFError:
pass # Silently exit if we hit EOFError
except IOError:
print("There was an IO Error")
The above code steps through the document one object at a time, and collects the identifier numbers as well
as the object contents size for each. Additionally, the object contents are available for analysis as well in
the objdata
object within the get_next_object
function.
Authoring tools like this to analyze new file types and malware samples is a common exercise in malware analysis, so it is worth studying to understand the machanics that are employed above.
Security researcher Didier Stevens is well known for his work on PDF malware and format analysis. He has published a large number of tools out there which can be employed to perform even more in-depth PDF malware analysis:
PDF Tools
Stevens is known for putting together a couple of tools that are focused on analysis of suspect PDF files. One
of these, pdf-parser.py is a much more complex
implementation of a PDF parser than what I demonstrated above. Additionally, Stevens offers a tool named
make-pdf which can be used to create a PDF that
auto-executes some JavaScript (remember, a full JavaScript engine lives in most PDF readers) upon opening the
document. The toolset further contains an mPDF
library that can be used for constructing a PDF by calling
functions and classes that create the individual objects.
Finally, a utility named pdfid is available which
doesn’t really do parsing, but walks through the file and generates a histogram-like report of a long list of
features that PDF implements. It can be possible, in some situations, to use classification strategies with this
data to help identify PDFs that may have a higher likelihood of being malicious.
Stevens offers a number of PDF videos on his YouTube Channel:
- Analysis of PDFs Created with OpenOffice/LibreOffice
- Analyzing a Phishing PDF with /ObjStm
- PDF: Stream Objects (/ObjStm)
- PDF: April 1st 2018
PDF Parser
The pdf-parser
tool is provided as part of the modified Kali Linux VM that I put together for class. Below is
the help output from runnign the tool with -h
:
Usage: pdf-parser.py [options] pdf-file|zip-file|url
pdf-parser, use it to parse a PDF document
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-m, --man Print manual
-s SEARCH, --search=SEARCH
string to search in indirect objects (except streams)
-f, --filter pass stream object through filters (FlateDecode,
ASCIIHexDecode, ASCII85Decode, LZWDecode and
RunLengthDecode only)
-o OBJECT, --object=OBJECT
id(s) of indirect object(s) to select, use comma (,)
to separate ids (version independent)
-r REFERENCE, --reference=REFERENCE
id of indirect object being referenced (version
independent)
-e ELEMENTS, --elements=ELEMENTS
type of elements to select (cxtsi)
-w, --raw raw output for data and filters
-a, --stats display stats for pdf document
-t TYPE, --type=TYPE type of indirect object to select
-O, --objstm parse stream of /ObjStm objects
-v, --verbose display malformed PDF elements
-x EXTRACT, --extract=EXTRACT
filename to extract malformed content to
-H, --hash display hash of objects
-n, --nocanonicalizedoutput
do not canonicalize the output
-d DUMP, --dump=DUMP filename to dump stream content to
-D, --debug display debug info
-c, --content display the content for objects without streams or
with streams without filters
--searchstream=SEARCHSTREAM
string to search in streams
--unfiltered search in unfiltered streams
--casesensitive case sensitive search in streams
--regex use regex to search in streams
--overridingfilters=OVERRIDINGFILTERS
override filters with given filters (use raw for the
raw stream content)
-g, --generate generate a Python program that creates the parsed PDF
file
--generateembedded=GENERATEEMBEDDED
generate a Python program that embeds the selected
indirect object as a file
-y YARA, --yara=YARA YARA rule (or directory or @file) to check streams
(can be used with option --unfiltered)
--yarastrings Print YARA strings
--decoders=DECODERS decoders to load (separate decoders with a comma , ;
@file supported)
--decoderoptions=DECODEROPTIONS
options for the decoder
-k KEY, --key=KEY key to search in dictionaries
PDF parser has the ability to do the basic logic that I performed with the earlier tool. Additionally, it has many other features, including:
- It can decode encoded streams (ones that have one or more
/Filter
properties) - You can search within streams for content matching strings, regular expressions,
yara
signatures - You can select to extract the content of individual objects - navigate the document
- Perform a statistical analysis of PDF keywords and output a summary view
- Present a summary of each of the objects, similar to mine, but including an MD5 checksum of each object’s content
Of note is that pdf-parser.py only supports being run by Python 2.7
So we can begin with some simple examples, using the malware.pdf
from Lab 8. You may want to set the
permissions of the pdf-parser.py
binary so that is is executable by your user. This has already been
done in the Kali VM.
Viewing Stats
To view the statistics of the PDF file, run pdf-parser.py -a malware.pdf
:
Comment: 3
XREF: 1
Trailer: 1
StartXref: 1
Indirect object: 6
3: 1, 2, 6
/Catalog 1: 3
/Page 1: 5
/Pages 1: 4
Search keywords:
/AcroForm 1: 3
/XFA 1: 2
The above gives us numerous counts of various attributes of the file. We can see that there are 6
indirect
objects, 3
comments, and within the Indirect object
section, you can see that there are a list of various
PDF commands. Each of the rows consists of a well known indirect object element, followed by a count of how
often it appears, followed by a colon, and finally the list of objects the attribute appears in. For instance,
we have learned from looking at this that there’s a /Catalog
type object in the object titled 3 0 obj
. The
object titled 4 0 obj
contains data of type /Pages
.
There are also some keyword searches that are performed to look for objects that might have malicious intent in
them. In this case, good places to begin are listed as /AcroForm
and /XFA
, in 3 0 obj
and 2 0 obj
,
respectively.
Selecting objects
Selecting objects from the document can be performed with the -o
flag. From the above, the /XFA
object was
particularly interesting, and also was the first numbered object identified by the tool, so it’s a good place to
begin. I can run pdf-parser.py -o 2 malware.pdf
and get the following output:
obj 2 0
Type:
Referencing: 1 0 R
<<
/XFA 1 0 R
>>
The above attempts to extract a type, which it was not able to derive a known type for this indirect object. If you recall in the list of indirect objects from the earlier output, there was one curious row that I didn’t discuss:
3: 1, 2, 6
Turns out this is just reporting the count of the “NULL” or “untyped” objects from the file, and saying that there
are 3 such objects (of the 6 total), and these are 1 0 obj
, 2 0 obj
, and 6 0 obj
.
This is reflected in the object we are looking at:
obj 2 0
Type:
...
Another thing to notice is:
Referencing: 1 0 R
This tells us that this object contains a reference to another object. In this case, the other object contains a stream of XFA data, which this object is encapsulating. Syntactically, this is apparent in the object’s code:
<<
/XFA 1 0 R
>>
However, this is often not that obvious, so pdf-parser.py
provides helpful discovery of references. What this is telling
us also is that the data we really want to look at is present in object 1 0 obj
. So, we run pdf-parser.py -o 1 malware.pdf
and look at its output:
obj 1 0
Type:
Referencing:
Contains stream
<<
/Filter [/Fl /Fl]
/Length 8792
>>
Simple enough, we can see that this contains no references, but a new note tells us that a stream is embedded. By default, the stream’s contents are not displayed automatically, because often streams contain binary data that would not be easily printable.
Contains stream
The stream has been filtered and the dictionary of the object shows us that the filter is two /FlateDecode
operations (/Fl
for short):
/Filter [/Fl /Fl]
The other key/value in the dictionary tells us that the length of the stream data is 8792
:
/Length 8792
Extracting Stream Data
We can also extract stream data, so long as we give the object number to pdf-parser.py
, and the object we provide contains
stream data. So, let’s run pdf-parser.py -d stream1.dat -o 1 malware.pdf
to dump the stream to disk. You’ll note that the
object content (minus the stream) is displayed on the console, still.
Once done, we can ls -l stream1.dat
:
-rw-rw-r-- 1 root root 8793 Apr 6 20:12 stream1.dat
You’ll notice it is about the same size as the /Length
said it would be. If you look at the data with hexdump -C
, at the end
you’ll notice that a newline (0x0a
) was added on by Stevens’ tool. For this example that should be harmless, but it is worth keeping
in mind that you may want to validate the sizes of the extracted data if you do run into problems. You should at least get the stream
data, but there might be an additional whitespace added on like this.
If we look at the data, it’s all binary and looks pretty random.
00000000 78 9c ed dc 79 34 d5 f9 03 ff f1 66 ab 66 90 34 |x...y4.....f.f.4|
00000010 45 08 53 99 54 44 ab 7d 69 19 24 c5 68 b1 bb 86 |E.S.TD.}i.$.h...|
00000020 42 f6 25 fb 72 55 8a c9 cd 32 24 ca 36 75 c5 d4 |B.%.rU...2$.6u..|
00000030 b5 94 3d 5c 77 8a 6b bb 21 74 71 09 21 24 2e d9 |..=\w.k.!tq.!$..|
00000040 b7 eb de df f4 fd fb f7 e7 ef fc fa a3 d7 e7 9c |................|
...
This is because our command merely extracted the content that
was in the PDF. This content was compressed (with FlateDecode
!), so it isn’t much use to us unless we findd another tool to decode.
However, pdf-parser
has a -f
option that supports many of the common filters, and can extract this for us if we run the command
like so: pdf-parser.py -f -d stream1_filt.dat -o 1 malware.pdf
. Running the ls -l stream*.dat
we can see both outputs now:
-rw-rw-r-- 1 root root 8793 Apr 6 20:12 stream1.dat
-rw-rw-r-- 1 root root 91034818 Apr 6 20:20 stream1_filt.dat
Now we have the compressed stream, which is about 8.7kB, while the decompressed stream is roughly 90MB in size. Looking at the code, it is XFA that contains JavaScript, and even has the following interesting comments within the JS:
...
//ROP0
//7201E63D XCHG EAX,ESP
//7201E63E RETN
//ROP1
//7200100A JMP DWORD PTR DS:[KERNEL32.GetModuleHandle]
//ROP2
//7238EF5C PUSH EAX
//7238EF5D CALL DWORD PTR DS:[KERNEL32.GetProcAddress]
//7238EF63 TEST EAX,EAX
//7238EF65 JNE SHORT 7238EF84
//7238EF84 POP EBP
...
Integrating Yara
From the above, I can build a really simple yara
signature to look for the presence of the ROP#
comments. For example:
rule find_rop_comments {
meta:
author = "Coleman Kane <kaneca@mail.uc.edu>"
url = "https://class.malware.re/2020/04/05/pdf-document-analysis.html"
version = 1
strings:
$rop0 = "ROP0"
$rop1 = "ROP1"
$rop2 = "ROP2"
condition:
all of them
}
I can run pdf-parser.py
with the above signature provided to the -y
option and it will scan all streams and display the object
summary for all objects who’s streams match my signature. By default, it scans any object data after applying filters, so you
don’t need to specify the -f
option. If you do, pdf-parser.py
will output the stream contents to the screen as well as the
object summary.
Running pdf-parser.py -y find_rop_comments.yar malware.pdf
:
YARA rule: find_rop_comments (find_rop_comments.yar)
obj 1 0
Type:
Referencing:
Contains stream
<<
/Filter [/Fl /Fl]
/Length 8792
>>
The -y
feature also has the capability of accepting a folder instead of a file name, in which case it will consume all of the
rules files present in the folder. The report will publish a line like the following for every matching object:
YARA rule: find_rop_comments (find_rop_comments.yar)
The above states that it was a YARA
match, what rule within the signature(s) matched, and also which filename the matching rule
was present in.
If you want more details to report the matching content, you can add the --yarastrings
argument, e.g.:
pdf-parser.py -y find_rop_comments.yar --yarastrings malware.pdf
The output then looks like this, including the offsets and hexadecimal as well as ASCII content of the strings:
YARA rule: find_rop_comments (find_rop_comments.yar)
0007f2 $rop0:
524f5030
'ROP0'
000827 $rop1:
524f5031
'ROP1'
000868 $rop2:
524f5032
'ROP2'
obj 1 0
Type:
Referencing:
Contains stream
<<
/Filter [/Fl /Fl]
/Length 8792
>>
If you try matching against unfiltered streams, you’ll get no output: pdf-parser.py -y find_rop_comments.yar --unfiltered malware.pdf
.
However, this feature is very useful for cases where an exploit attempts to attack the filter itself, or other times where the exploit
is elsewhere, the stream contains data that can be accessed in memory upon successful exploit, but the author chose to lie about which
filter(s) to apply - possibly to confuse tools such as pdf-parser.py
. Thus, it is advisable to try both approaches and use the sum of
both sets of data to guide your investigations.
Other searching
Similarly, you can run simple string searches using the --searchstream=XXX
argument. For example, on malware.pdf
the following will
also match 1 0 obj
: pdf-parser.py --searchstream=ROP0 malware.pdf
Likewise, you can use a regular expression if you also provide the --regex
argument. E.g.:
pdf-parser.py --regex --searchstream="ROP(0|1|2)" malware.pdf
The --unfiltered
argument works in the same fashion to not try matching against filtered stream content.
mPDF Python Script Generation
Finally, if you use the -g
option, the script will change its behavior so that it will process the PDF and then output the Python
source code that can use the mPDF
module to genrate the PDF programmatically. This would enable you, for instance, to be able to
add additional content to the PDF for output. For instance, you could change some of the JavaScript in the XFA stream, or possibly
replace an embedded EXE with an EXE of your own choosing.
Similar to the -o
option, pdf-parser.py -g malware.pdf
will generate Python code where any streams are embedded in their encoded
form. If the command is called as pdf-parser.py -f -g malware.pdf
, the Python code will contain the filtered streams, enabling
the viewing and editing of the underlying data (such as the XFA-embedded JavaScript in the earlier example).
Remember that the mPDF
module doesn’t come with pdf-parser.py
, it is actually in the separate
make-pdf package.
Analysis Summary
This lecture discussed using both summary analysis of the PDF, as well as deep analysis of individual components of the PDF. It’s important to keep in mind that both of these approaches should coexist as part of document analysis, as they help paint different parts of the analytical picture. Additionally, this example provided the instructions for re-construction of the document. Often another challenging aspect of malware analysis can be trying to reconstruct the attack, and in doing so, modifying arbitrary variables of the attack to see how it changes things.
tags: malware yara pdf lecture