Basic Static Analysis of Malware
by Coleman Kane
Basic Static Analysis of Malware
Table of Contents
Introduction
As discussed earlier in the course, static analysis is the process of analyzing malware “at rest”, to extract identifying features and other characteristics from the tool without actually executing it. The benefit that this has is that it can enable you to identify capabilities and identifiers that you might not be able to get from attempting to interact with malware. You also can identify detection features that would be useful for discovering the sample, even in situations where an intrusion has not yet occurred, partially succeeded, or failed.
Container Model of Files
When discussing malware, it is largely important to recognize that we’re ultimately talking about files in a computer. Most of the time, these are files on disk, sometimes their either files or file fragments that are memory-resident. Sticking with the files on disk scenario, your hard drive itself can be considered a container of files. Well, the files themselves can also be viewed as containers - and ultimately I find this to be a very useful model to use when approaching malware analysis. At the most basic level, most of you are familiar with the concept of files being a container of bytes.
with open('file.bin', 'rb') as input_file:
file_bytes = input_file.read()
The above snippet of code opens the file file.bin
, as if it were a box, and proceeds to read all of the bytes, from
beginning to end, out of the file copying them into an array named input_file
. You can then pick and choose individual
bytes, or byte ranges, using the array index operator ([]
) in Python. This would not have been as easy or
straightforward using just the file alone - the file first had to be opened, thus even your Python system is treating
the file as a container, which must be opened to retrieve its contained bytes.
However, most files on a computer aren’t merely containers of bytes. Most files have defined structure to them, and
thus are containers of higher level data structures - even complete other files. One very obvious example would be the
ZIP File. You’re likely familiar with these, and you
even had to use one if you pulled the Windows7 VM from modern.IE. The basic concept is that the ZIP file contains
a bunch of other files, with the added wrinkle that these files are passed through a compression algorithm, to create
a new container of bytes representing the file in its compressed form. So the ZIP file contains the information to
tell your computer how to recreate the original file (as long as you have a compatible unzip
utility). This
information is effectively stored as a (typically) smaller collection of bytes within the ZIP archive.
Example Structured File Formats
Some other structured file examples:
- PNG Image files
- Microsoft Portable Executables (EXE/DLL) - this one will be particularly important for malware analysis, as many malware tools come as PE files, for Windows.
- MS Office CFB - file format used for old “legacy” MSOffice documents, and still used today for embedding components into current versions
- PDF File format
- ELF Binary file - the executable file format used by almost all Linux distributions, and most other UNIX variants.
Often times a particular file format ends up simply being a special case of another common file format. One perfect
example of this is the latest file formats saved from MS Office 365 (docx
, pptx
, xlsx
, etc.). Another great
example is the APK file format which is derived from a special
version of a JAR
file, a Java Application. All of these are actually just special variations of the ZIP File Format
described earlier in this section, each with some metadata constraints as well as archive type definitions that require
the presence of certain named files within the archive for it to be considered valid.
EXE Files
In 2020, I put together a lecture that covers some basic analysis of an EXE using common tools. I also introduced the Ghidra application for performing some initial static analysis, but mapping the output of these Linux utilities to the data panels of Ghidra.
This lecture walked through a number of purpose-built utilities common to most Linux systems, and installed on the Kali VM image I provided to all of you students in class. We use these as building blocks toward the main view of Ghidra, which can be overwhelming to the initial observer. Often I find it is very helpful to have these repeatable examples of the single or limited use tools to demonstrate recipes for getting specific data, and then follow these up with introducing a multi-purpose environment such as Ghidra that can demonstrate how stitching these data sets together can provide a lot of power.
Hexdump
Initially, I begin with the hexdump
utility. Some great examples and documentation
are available here.
My favorite invocation of hexdump
is using the -C
option. This gives a 16-byte-wide
hexadecimal dump output, as well as a preview of the raw text (sanitizing unprinable
characters) on the right. This gives you the ability to see the numeric representation,
as well as view the raw data for human-readable content or other patterns that are helped
by a denser viewport.
hexdump -C filename.exe
Less
The GNU less utility is used a lot in the beginning
of this lecture to control the output of the other commands, allowing me to page through
the data. The tool is similar to the more well-known more
command, which has a variant
on Windows systems, too. I happen to favor the less
tool, and a common slogan is “less is
more” attempting to communicate that less
is a rewrite of more
and also that less
has additional features beyond what’s offered in more
.
During the lecture, I use the spacebar to page through the file, and PageUp/Down are supported
as well. Additionally, if I want to search through the entire buffer, I can type the /
character,
followed by the text I want to search. I use this feature in the lecture, and it is what enables
me to skip around the file using numeric addresses.
Some more documentation here: Unix Less Command: 10 Tips for Effective Navigation
File
The File command is built in to pretty much every Linux and
BSD variant. It is build around libmagic
which is a library that can perform metadata analysis
based upon arbitrary file structure information stored in a “magic database”.
In the lecture, I use the following to dump out a brief list of file type and intended platform:
file filename.exe
ExifTool
The ExifTool utility is a Perl framework written by Phil Harvey. Originally designed to extract the EXIF content from image files that embeds camera, location, and other metadata, the author decided that this concept was broadly applicable to even more file formats than images.
This tool unfortunately did not get installed on the Kali VM, so you may wish to install it now with the following command:
apt install -y exiftool
In the lecture, I use this to dump out even more metadata about the EXE than file
, such as the
compilation timestamp, the version of the linker that I used (linkers organize compiled objects
into OS-native executable files), and the minimum compatible versions of the Windows OS. It is
important to note that this last item is not going to guarantee that the EXE is compatible with
all DLLs from that version of Windows, but mainly that the Windows kernel will understand the
file layout.
Objdump
The objdump utility is part
of the binutils package, which is a bundle of tools used in Linux/UNIX systems for working
with many core binary file types. The objdump
utility is designed to be a full metadata analysis
and reporting tool for executable files. On most systems, the objdump
utility can be extended
to support analysis of multiple executable file types through the installation of additional packages
or modules. This is the case on our Kali VM, which gives us the ability to use objdump
to analyze
native Linux binaries, as well as native Windows binaries.
In the lecture, I demonstrate using it to perform the following analyses.
Use -f
to dump the terse basic file header metadata:
objdump -f filename.exe
Use -x
to dump a verbose list of executable file structure. This reports all of the sections of
the executable, and where objdump
can interpret the section metadata, it reports that out too.
Using this, we navigate to the Sections:
part of the output, and we deep dive on how this
explains to the Windows kernel how to organize the sections of the file from disk into system RAM.
It is important to note that this grants the flexibility for the in-memory organization of any
file to deviate from the on-disk layout.
objdump -x filename.exe
Disassembly and Source Listing
Using the -d
/-D
and -S
arguments, objdump
can be told to disassemble the file and, in
the case of -S
, can also display C or C++ source code if it is available. The -D
option can
disassemble all sections of the file, while the -d
option will limit disassembly to sections
that are marked executable in the file header. This can be helpful if the file attempts to conceal
some of the code on disk inside data marked non-executable (to be changed at run-time).
objdump -d filename.exe
Strings
The strings tool is
also part of the binutils
package. This utility scans the file from beginning to end and
attempts to discover strings that would be encoded using standard conventions, such as a
sequence of human-readable characters followed by the \0
(NULL) byte (\x00
). The strings
utility can be told to change its behavior to filter only to longer-sized strings, and
also can identify a number of different string encodings, such as the UTF-16 that is popular
on Windows.
To show only 6-byte or greater strings (from lecture):
strings -n 6 filename.exe
To show any UTF-16 “Little Endian” strings, again with minimum length 16. This is very handy for Windows binaries, as many of them have UTF-16 string contents:
strings -n 6 -e l filename.exe
If you want strings
to also report the offset (within the file on disk) of each string,
you may use the -t x
option, which will report this offset from the beginning of the file
in hexadecimal.
strings -n 6 -e l -t x filename.exe
PDF File Format
The following link takes you to content I’ve assembled discussing PDF files and the PDF file format. One interesting characteristic of PDF files is that they are, largely, a plain text source-code-like file structure. PDF even has its own “programming” language that is based on PostScript.
Yes, this is the same link as under the Example Structured File examples list at the top of this page.
tags: lecture malware analysis static