CS6038/CS5138 Malware Analysis, UC

Course content for UC Malware Analysis

View on GitHub
6 April 2021

Ghidra Scripting

by Coleman Kane

LAB04: Ghidra Scripting

The Ghidra scripting lectures should have familiarized you with using Ghidra to script analysis work. I have collected these together, with some additional informations, into the following Github repository:

Feel free to reuse any of the examples above, or others linked in the lectures or you may find online otherwise, in building this.

You will have two objectives with this assignment:

  1. Build a Ghidra Script (Python or Java) that can walk through the functions in a file, and reports the frequency-counts for the bi-grams and tri-grams for the P-Code instructions in the provided artifact
  2. A script or wrapper program that will go through a directory, identify what files are executables or DLLs, import those to Ghidra and make sure they get analyzed, and then execute the script build in #1

Your program/script in #2 should be able to walk through the directory, find the EXE files, and ensure that the JSON results are stored in a file named after the related binary, with a .json extension added. For instance, if your input artifact is VirusShare_e1b6940985a23e5639450f8391820655, then you’ll want to name the result VirusShare_e1b6940985a23e5639450f8391820655.json. If the input already has an extension, such as file.exe, it is entirely fine to save the output as file.exe.json.

Submit a ZIP file that contains:

  1. Your Ghidra script
  2. Your wrapper program or shell script that executes Ghidra analyzeHeadless
  3. Instructions on running the #2 program in a folder
  4. A directory containing the JSON results

Bi/Tri-grams

In data analysis, we often will operate on data extracted as “n-grams”, which represent fixed-length sequences of adjacent entities. Bi-grams represents adjacent pairs, while tri-grams represents adjacent triples. It is important to note that, if you have 4 adjacent operations such as below:

COPY
LOAD
INT_ADD
CALL

The above will evaluate to 3 bi-grams, and 2 tri-grams:

{
 "bi-grams": [
  ["COPY", "LOAD"],
  ["LOAD", "INT_ADD"],
  ["INT_ADD", "CALL"]
 ],
 "tri-grams": [
  ["COPY", "LOAD", "INT_ADD"],
  ["LOAD", "INT_ADD", "CALL"]
 ]
}

For our case, consider the following P-Code:

CF = INT_LESS ESP, 28:4
OF = INT_SBORROW ESP, 28:4
ESP = INT_SUB ESP, 28:4
SF = INT_SLESS ESP, 0:4
ZF = INT_EQUAL ESP, 0:4

$U4a0:4 = INT_ADD 32:4, ESP
$U1770:4 = LOAD ram($U4a0)
EAX = COPY $U1770

$U1770:4 = COPY EAX
STORE ram(ESP), $U1770

ESP = INT_SUB ESP, 4:4
STORE ram(ESP), 0x4014df:4
CALL *[ram]0x406f30:4

CF = COPY 0:1
OF = COPY 0:1
$U8cd0:4 = INT_AND EAX, EAX
SF = INT_SLESS $U8cd0, 0:4
ZF = INT_EQUAL $U8cd0, 0:4

The above contains 18 P-Code operations. A bi-gram analysis would yield 17 pairs of adjacent P-Code instructions, and a tri-gram analysis yields 16 sets of adjacent triples.

Set of Sample Files

Below is a ZIP file containing the malware samples I will be testing against. Use this as your data set, but also feel free to pull other data sets into the mix as well.

Malware samples ZIP file:

home

tags: malware assignment