Ghidra Scripting
by Coleman Kane
LAB04: Ghidra Scripting
The Ghidra scripting lectures should have familiarized you with using Ghidra to script analysis work. I have collected these together, with some additional informations, into the following Github repository:
Feel free to reuse any of the examples above, or others linked in the lectures or you may find online otherwise, in building this.
You will have two objectives with this assignment:
- Build a Ghidra Script (Python or Java) that can walk through the functions in a file, and reports the frequency-counts for the bi-grams and tri-grams for the P-Code instructions in the provided artifact
- A script or wrapper program that will go through a directory, identify what files are executables or DLLs, import those to Ghidra and make sure they get analyzed, and then execute the script build in #1
Your program/script in #2 should be able to walk through the directory, find the EXE files, and ensure that the JSON results are stored in a file named
after the related binary, with a .json
extension added. For instance, if your input artifact is VirusShare_e1b6940985a23e5639450f8391820655
, then
you’ll want to name the result VirusShare_e1b6940985a23e5639450f8391820655.json
. If the input already has an extension, such as file.exe
, it is entirely
fine to save the output as file.exe.json
.
Submit a ZIP file that contains:
- Your Ghidra script
- Your wrapper program or shell script that executes Ghidra
analyzeHeadless
- Instructions on running the #2 program in a folder
- A directory containing the JSON results
Bi/Tri-grams
In data analysis, we often will operate on data extracted as “n-grams”, which represent fixed-length sequences of adjacent entities. Bi-grams represents adjacent pairs, while tri-grams represents adjacent triples. It is important to note that, if you have 4 adjacent operations such as below:
COPY
LOAD
INT_ADD
CALL
The above will evaluate to 3 bi-grams, and 2 tri-grams:
{
"bi-grams": [
["COPY", "LOAD"],
["LOAD", "INT_ADD"],
["INT_ADD", "CALL"]
],
"tri-grams": [
["COPY", "LOAD", "INT_ADD"],
["LOAD", "INT_ADD", "CALL"]
]
}
For our case, consider the following P-Code:
CF = INT_LESS ESP, 28:4
OF = INT_SBORROW ESP, 28:4
ESP = INT_SUB ESP, 28:4
SF = INT_SLESS ESP, 0:4
ZF = INT_EQUAL ESP, 0:4
$U4a0:4 = INT_ADD 32:4, ESP
$U1770:4 = LOAD ram($U4a0)
EAX = COPY $U1770
$U1770:4 = COPY EAX
STORE ram(ESP), $U1770
ESP = INT_SUB ESP, 4:4
STORE ram(ESP), 0x4014df:4
CALL *[ram]0x406f30:4
CF = COPY 0:1
OF = COPY 0:1
$U8cd0:4 = INT_AND EAX, EAX
SF = INT_SLESS $U8cd0, 0:4
ZF = INT_EQUAL $U8cd0, 0:4
The above contains 18 P-Code operations. A bi-gram analysis would yield 17 pairs of adjacent P-Code instructions, and a tri-gram analysis yields 16 sets of adjacent triples.
Set of Sample Files
Below is a ZIP file containing the malware samples I will be testing against. Use this as your data set, but also feel free to pull other data sets into the mix as well.
Malware samples ZIP file:
tags: malware assignment