1 February 2021

Basic Static Analysis of Malware

by Coleman Kane

Basic Static Analysis of Malware

Basic Static Analysis of Malware

Introduction

As discussed earlier in the course, static analysis is the process of analyzing malware “at rest”, to extract identifying features and other characteristics from the tool without actually executing it. The benefit that this has is that it can enable you to identify capabilities and identifiers that you might not be able to get from attempting to interact with malware. You also can identify detection features that would be useful for discovering the sample, even in situations where an intrusion has not yet occurred, partially succeeded, or failed.

Container Model of Files

When discussing malware, it is largely important to recognize that we’re ultimately talking about files in a computer. Most of the time, these are files on disk, sometimes their either files or file fragments that are memory-resident. Sticking with the files on disk scenario, your hard drive itself can be considered a container of files. Well, the files themselves can also be viewed as containers - and ultimately I find this to be a very useful model to use when approaching malware analysis. At the most basic level, most of you are familiar with the concept of files being a container of bytes.

with open('file.bin', 'rb') as input_file:
    file_bytes = input_file.read()

The above snippet of code opens the file file.bin, as if it were a box, and proceeds to read all of the bytes, from beginning to end, out of the file copying them into an array named input_file. You can then pick and choose individual bytes, or byte ranges, using the array index operator ([]) in Python. This would not have been as easy or straightforward using just the file alone - the file first had to be opened, thus even your Python system is treating the file as a container, which must be opened to retrieve its contained bytes.

However, most files on a computer aren’t merely containers of bytes. Most files have defined structure to them, and thus are containers of higher level data structures - even complete other files. One very obvious example would be the ZIP File. You’re likely familiar with these, and you even had to use one if you pulled the Windows7 VM from modern.IE. The basic concept is that the ZIP file contains a bunch of other files, with the added wrinkle that these files are passed through a compression algorithm, to create a new container of bytes representing the file in its compressed form. So the ZIP file contains the information to tell your computer how to recreate the original file (as long as you have a compatible unzip utility). This information is effectively stored as a (typically) smaller collection of bytes within the ZIP archive.

Example Structured File Formats

Some other structured file examples:

PNG Image files
Microsoft Portable Executables (EXE/DLL) - this one will be particularly important for malware analysis, as many malware tools come as PE files, for Windows.
MS Office CFB - file format used for old “legacy” MSOffice documents, and still used today for embedding components into current versions
PDF File format
ELF Binary file - the executable file format used by almost all Linux distributions, and most other UNIX variants.

Often times a particular file format ends up simply being a special case of another common file format. One perfect example of this is the latest file formats saved from MS Office 365 (docx, pptx, xlsx, etc.). Another great example is the APK file format which is derived from a special version of a JAR file, a Java Application. All of these are actually just special variations of the ZIP File Format described earlier in this section, each with some metadata constraints as well as archive type definitions that require the presence of certain named files within the archive for it to be considered valid.

EXE Files

In 2020, I put together a lecture that covers some basic analysis of an EXE using common tools. I also introduced the Ghidra application for performing some initial static analysis, but mapping the output of these Linux utilities to the data panels of Ghidra.

This lecture walked through a number of purpose-built utilities common to most Linux systems, and installed on the Kali VM image I provided to all of you students in class. We use these as building blocks toward the main view of Ghidra, which can be overwhelming to the initial observer. Often I find it is very helpful to have these repeatable examples of the single or limited use tools to demonstrate recipes for getting specific data, and then follow these up with introducing a multi-purpose environment such as Ghidra that can demonstrate how stitching these data sets together can provide a lot of power.

Hexdump

Initially, I begin with the hexdump utility. Some great examples and documentation are available here.

My favorite invocation of hexdump is using the -C option. This gives a 16-byte-wide hexadecimal dump output, as well as a preview of the raw text (sanitizing unprinable characters) on the right. This gives you the ability to see the numeric representation, as well as view the raw data for human-readable content or other patterns that are helped by a denser viewport.

hexdump -C filename.exe

Less

The GNU less utility is used a lot in the beginning of this lecture to control the output of the other commands, allowing me to page through the data. The tool is similar to the more well-known more command, which has a variant on Windows systems, too. I happen to favor the less tool, and a common slogan is “less is more” attempting to communicate that less is a rewrite of more and also that less has additional features beyond what’s offered in more.

During the lecture, I use the spacebar to page through the file, and PageUp/Down are supported as well. Additionally, if I want to search through the entire buffer, I can type the / character, followed by the text I want to search. I use this feature in the lecture, and it is what enables me to skip around the file using numeric addresses.

Some more documentation here: Unix Less Command: 10 Tips for Effective Navigation

File

The File command is built in to pretty much every Linux and BSD variant. It is build around libmagic which is a library that can perform metadata analysis based upon arbitrary file structure information stored in a “magic database”.

In the lecture, I use the following to dump out a brief list of file type and intended platform:

file filename.exe

ExifTool

The ExifTool utility is a Perl framework written by Phil Harvey. Originally designed to extract the EXIF content from image files that embeds camera, location, and other metadata, the author decided that this concept was broadly applicable to even more file formats than images.

This tool unfortunately did not get installed on the Kali VM, so you may wish to install it now with the following command:

apt install -y exiftool

In the lecture, I use this to dump out even more metadata about the EXE than file, such as the compilation timestamp, the version of the linker that I used (linkers organize compiled objects into OS-native executable files), and the minimum compatible versions of the Windows OS. It is important to note that this last item is not going to guarantee that the EXE is compatible with all DLLs from that version of Windows, but mainly that the Windows kernel will understand the file layout.

Objdump

The objdump utility is part of the binutils package, which is a bundle of tools used in Linux/UNIX systems for working with many core binary file types. The objdump utility is designed to be a full metadata analysis and reporting tool for executable files. On most systems, the objdump utility can be extended to support analysis of multiple executable file types through the installation of additional packages or modules. This is the case on our Kali VM, which gives us the ability to use objdump to analyze native Linux binaries, as well as native Windows binaries.

In the lecture, I demonstrate using it to perform the following analyses.

Use -f to dump the terse basic file header metadata:

objdump -f filename.exe

Use -x to dump a verbose list of executable file structure. This reports all of the sections of the executable, and where objdump can interpret the section metadata, it reports that out too. Using this, we navigate to the Sections: part of the output, and we deep dive on how this explains to the Windows kernel how to organize the sections of the file from disk into system RAM. It is important to note that this grants the flexibility for the in-memory organization of any file to deviate from the on-disk layout.

objdump -x filename.exe

Disassembly and Source Listing

Using the -d/-D and -S arguments, objdump can be told to disassemble the file and, in the case of -S, can also display C or C++ source code if it is available. The -D option can disassemble all sections of the file, while the -d option will limit disassembly to sections that are marked executable in the file header. This can be helpful if the file attempts to conceal some of the code on disk inside data marked non-executable (to be changed at run-time).

objdump -d filename.exe

Strings

The strings tool is also part of the binutils package. This utility scans the file from beginning to end and attempts to discover strings that would be encoded using standard conventions, such as a sequence of human-readable characters followed by the \0 (NULL) byte (\x00). The strings utility can be told to change its behavior to filter only to longer-sized strings, and also can identify a number of different string encodings, such as the UTF-16 that is popular on Windows.

To show only 6-byte or greater strings (from lecture):

strings -n 6 filename.exe

To show any UTF-16 “Little Endian” strings, again with minimum length 16. This is very handy for Windows binaries, as many of them have UTF-16 string contents:

strings -n 6 -e l filename.exe

If you want strings to also report the offset (within the file on disk) of each string, you may use the -t x option, which will report this offset from the beginning of the file in hexadecimal.

strings -n 6 -e l -t x filename.exe

PDF File Format

The following link takes you to content I’ve assembled discussing PDF files and the PDF file format. One interesting characteristic of PDF files is that they are, largely, a plain text source-code-like file structure. PDF even has its own “programming” language that is based on PostScript.

PDF File Analysis

Yes, this is the same link as under the Example Structured File examples list at the top of this page.

home

tags: lecture malware analysis static

CS6038/CS5138 Malware Analysis, UC

Course content for UC Malware Analysis

Basic Static Analysis of Malware