CS6038/CS5138 Malware Analysis, UC

Course content for UC Malware Analysis

View on GitHub
5 February 2020

Ghidra Intro

by Coleman Kane

The Ghidra reverse engineering suite consolidates a lot of functionality into a single tool. The second half of the lecture below goes on to explain very quickly some of the components in Ghidra, by relating them to the tools used in the previous post. If you’re using the Kali image I provided, you can open up the menu, navigate to section “07 - Reverse Engineering”, and select Ghidra from the list of utilities.

The following is same embedded video as was posted for the 2020-02-04 entry, just also duplicated here for convenience.

The Ghidra Analysis UI

After following the steps discussed Tuesday, and recorded in the video above, you should be able to load an EXE into the project, and open it up to work with in Ghidra’s analysis view, and finally wait for the Ghidra automated analysis work to complete. While the analysis is running, you’ll see an animated hourglass in the lower-right-hand corner. There is a screenshot below which I have annotated to identify 4 of the key UI widgets that we worked with in class, using large bold red letters:

Annotated Ghidra Screenshot

A) EXE file section layout

If you recall from the video, the EXE file is actually broken into sections, and a data structure within the EXE file contains the information to tell Windows how to load these sections into memory. This is the listing of those sections, and can be used for jumping around the different sections of the file in the disassembly/listing view.

This data is the same data listed in the Sections: portion of the objdump -x filename.exe output.

B) Symbol/function explorer

The symbol/function explorer is one of the most powerful features in Ghidra. It provides a tree-like alphabetic listing of all of the symbols that were identified in the executable binary, once analyzed. These include variable names, function names, class implementations, and more. This includes both the names of functions & variables external to the program, as well as those which are included within this program.

The Import Table maintains a list of the symbols that are referenced from external DLLs within this file, and on the command line objdump -x filename.exe includes the Import table, every symbol, and which DLL (or other shared library type) it can be found in.

The command objdump -t filename.exe will list all of the symbols referencing code or data within the program. For this reason, it will only work with the original, unstripped version of revolution_backdoor_windows.exe.

One thing to keep in mind is that, if functions don’t have a symbol attached to them (such as if they’ve been removed using strip), then Ghidra will auto-assign a slightly-descriptive FUN_xxxxxxxx name, where xxxxxxxx will be the hexadecimal memory address of the beginning of the function.

The example that I show in class uses a backdoor compiled from the following source code:

I have published a ZIPped copy of this binary to this link. The password for this archive is cs6038. You may want to open the source code for revolution_backdoor_windows.cpp in your browser, and then extract the EXE file into your Kali VM. Once there, you can remove the symbols using the following command (despite removing this data, the program will work just the same):

strip revolution_backdoor_windows.exe

Next, open the above file up in Ghidra, after stripping it, and you will find that you can no longer find the functions named main, isDirectoryExists, or StartsWith. Instead, you’ll find some branches of FUN_xxxxxxx-named functions.

FUN_ names

Ghidra automatically assigns this unnamed functions new names, so you can more easily work with them in the tool. What’s more, Ghidra gives you the ability to change these names to anything you wish. In our case, let’s change the name of FUN_401510 to the correct StartsWith function name.

If you right-click on the function name in the symbol navigator, you can choose the Edit Function… option (pressing the “F” key with the function selected is a shortcut). You’ll be presented with a dialog that allows you to modify the function’s prototype. Select the name of the function by double clicking on FUN_401510 in this dialog, and then simply rewrite it by typing StartsWith, similar to if you were editing an MS Word document. Next, click “OK” and Ghidra will propagate your changes throughout the project, updating all references to FUN_401510 so that they will now read StartsWith.

C) Disassembly view

The disassembly view is also called the “Listing” view, because it displays more than just disassembly. In parts of the file that store non-executable data, this Listing view attempts to depict the data and underlying data structures, where possible. As I demonstrated in the lecture, when data is encountered by the Listing view, it may provide you with duplicated presentations of the data so that you may view it in the different forms it may take within a program. Often this is helpful for analysis, as Ghidra may not be able to easily guess the best representation of the data.

Another thing that’s notable is that the Listing and the Decompiler views are locked to one another. So, if you place the cursor at a position within the listing, if it contains executable code, the corresponding function will be displayed in the decompiler view, and the line of code containing the seslected instruction will be highlighted.

One important thing of note is that the assembly code displayed here uses the “Intel Syntax”, while the assembly code displayed with objdump -d or objdump -s is displayed using the “AT&T Syntax”. In addition to slight differences in operation and register names, the operand order (the arguments the instruction acts upon) are reversed.

D) Decompiled view (reconstructed C/C++ code)

Ghidra has a unique capability in its high-end decompilation engine. By and large it does a great job decompiling your disassembly into readable C code, and often code that can be recompiled. The secret to this is that Ghidra has a machine code translator built in that will convert any supported machine code into a pseudo-instruction set known as PCODE. Some documentation of this is here:

For the example StartsWith function discussed above, the author’s code is here:

bool StartsWith(const char *a, const char *b)
{
   if(strncmp(a, b, strlen(b)) == 0) return 1;
   return 0;
}

The decompiled version extrapolated from the compiled machine code is here (including my small change from earlier):

uint __cdecl StartsWith(char *param_1,char *param_2)

{
  size_t _MaxCount;
  int iVar1;
  
  _MaxCount = strlen(param_2);
  iVar1 = strncmp(param_1,param_2,_MaxCount);
  return (uint)(iVar1 == 0);
}

Meanwhile, the 32-bit x86 assembly language looks like this:

    PUSH       EBP
    MOV        EBP,ESP
    SUB        ESP,0x18
    MOV        EAX,dword ptr [EBP + param_2]
    MOV        dword ptr [ESP]=>local_1c,EAX
    CALL       strlen

    MOV        dword ptr [ESP + local_14],EAX
    MOV        EAX,dword ptr [EBP + param_2]
    MOV        dword ptr [ESP + local_18],EAX
    MOV        EAX,dword ptr [EBP + param_1]
    MOV        dword ptr [ESP]=>local_1c,EAX
    CALL       strncmp

    TEST       EAX,EAX
    JNZ        LAB_00401542
    MOV        EAX,0x1

    JMP        LAB_00401547
LAB_00401542:
    MOV        EAX,0x0

LAB_00401547:
    LEAVE
    RET

The “Defined Strings” navigator

Defined Strings

I demonstrated using the “Defined Strings” view, which in the earlier Ghidra diagram is a tab hidden under the disassembly view marked with “D”. This view gives you a list of the strings identified in the file, using discovery logic similar to the strings command-line tool.

Similar to the Decompiler view, the “Defined Strings” navigator will make the Listing view jump to the location of where the string is stored within the file. Unlike the decompiler view, selecting another line containing another string nearby doesn’t automatically move the selection in the Defined Strings view.

In the listing view showing the string data, there’s an “XREF” label to the right of each string’s starting address (which also doubles as its variable name, auto-assigned with logic similar to the auto-function-naming we explored and modified earlier). For instance, selecting the unamed string will jump to its definition, which includes the following XREF information:

XREF[2]:      FUN_0401582:00403a9c(*),
              FUN_0401582:00403f04(*)

The above tells the name of the function that uses this data, and the exact postion (absolute within the program, not offset from the beginning of the function) where the data is used. If you double-click on either of these, you’ll be transported to the disassembly that references that data. The [2] next to XREF tells you there are 2 references to the data within the program. In this case, both happen to occur within the same function, but this may not always be the case.

Analysis Exercise

So, then next thing we are going to do is perform a short analysis to discover some information about a feature in the program. Let’s start by opening up the “Defined Strings” view, and navigate down to and select the string get_hostname from the list. This should position the listing cursor to highlight the following data:

                     s_get_hostname_004090d0           XREF[1]:     FUN_00401582:004036c5(*)  
004090d0 67 65 74        ds         "get_hostname\n"
         5f 68 6f 
         73 74 6e 

Double-clicking on the text FUN_00401582:004036c5 will change the listing’s cursor position to the location within the code where get_hostname is used by the program:

                     LAB_004036c5                                    XREF[1]:     00403599(j)  
004036c5 c7 44 24        MOV      dword ptr [ESP + 0x4],s_get_hostname_004090d0    = "get_hostname\n"
         04 d0 90 
         40 00

The above line of code copies the address of the first character, g, from get_hostname onto the address 4 bytes above the bottom of the stack (32-bits, or one integer variable).

Looking a bit further, we will see the following lines follow:

004036cd 8d 85 1c        LEA        EAX,[EBP + 0xfffe8b1c]
         8b fe ff
004036d3 89 04 24        MOV        dword ptr [ESP],EAX
004036d6 e8 95 39        CALL       strcmp               int strcmp(char * _Str1, char * 
         00 00

These three lines compute another pointer pointing at some local data from the heap, write that pointer into the memory on the top of the stack, and adjacent to the memory write operation we just looked at earlier. A common approach to managing local function memory is to allocate some space on the heap, and keep track of this space using the EBP (32-bit) or RBP (64-bit) register. Finally, the function strcmp is called, and reading the documentation, we can tell that this function compares to strings for equality. In this situation, it appears that Windows passes the arguments to the function using the stack, ESP register. The variations available in Windows are documented here. Generally speaking, these won’t be noted within the program, but Ghidra often may deduce them through auto-analysis. Barring that, it will often be up to you to determine this through deciphering how argument handling occurs in context.

So, in a nutshell the 4 lines of code above compare some locally-stored temporary data to the string get_hostname, and then likely will use that result to make a decision further down.

Next, lets look a little bit further down at what else this code will proceed to do. The next two lines read:

004036db 85 c0           TEST       EAX,EAX
004036dd 0f 85 2a        JNZ        LAB_0040380d
         01 00 00

This performs a test of the result against itself. The TEST instruction performs a bit-wise AND operation, discarding the result but retaining the CPU flags state. In this regard, it is similar to the CMP operation we saw in class and that operation’s relationship to the SUB instruction.

If the EAX register (which Windows Calling Convention will use to store the return value of a function) is zero prior to the TEST, then, and only then, will the zero flag be set in the CPU state by the TEST. Thus, if the two strings are equal, execution will continue after the JNZ. Otherwise, the JNZ sends the execution to a part of the code that will handle if the strings don’t match. In a situation like this, it often will test against another string in this event, effectively cascading down a long list of strings to compare to until it exhausts the list.

Getting back to our example, lets follow what would occur if the string matches get_hostname:

004036e3 c7 85 6c        MOV        dword ptr [EBP + local_139c],0x2e646d63
         ec ff ff 
         63 6d 64 2e
004036ed c7 85 70        MOV        dword ptr [EBP + local_1398],0x20657865
         ec ff ff 
         65 78 65 20
004036f7 c7 85 74        MOV        dword ptr [EBP + local_1394],0x6820632f
         ec ff ff 
         2f 63 20 68
00403701 c7 85 78        MOV        dword ptr [EBP + local_1390],0x6e74736f
         ec ff ff 
         6f 73 74 6e
0040370b c7 85 7c        MOV        dword ptr [EBP + local_138c],0x656d61
         ec ff ff 
         61 6d 65 00
00403715 c7 44 24        MOV        dword ptr [ESP + 0x8],0x44
         08 44 00 
         00 00
0040371d c7 44 24        MOV        dword ptr [ESP + 0x4],0x0
         04 00 00 
         00 00

The above is very interesting. If we look at it closely, we will notice Ghidra auto-declared 5 local variables, each of which is exactly 4 bytes from another (as denoted by the numbers in the local_#### variable names). Laid out, these each are right next to each other, and it appears to be writing these in a reverse order. One good thing to keep in mind when looking at things like this is to remember that all 32-bit integers each represent 4 bytes, and for some of the byte values there’s a human-readable character equivalent. The byte values that represent human readable characters are 0x20-0x7f. So, whenever, you see a large sequence of writes, and you’re able to discern (as is the case above) that all bytes in the data move operation are within this range, there’s a strong likelihood that a string is being written into memory, four bytes at a time, here.

A useful list of the ASCII characters to their numeric byte mapping: asciitable.com

Ghidra has a helpful tool for examining data like this, called “data convertor”. You can choose any data displayed in the listing view, and explore the different possible representations, and even change how the listing view displays the data to the user. Right click on the data 0x2e646d63 in the first line and a menu pops up.

Data Convertor

Choose Convert from that menu, which leads to a sub-menu reporting all possible variations. You can choose any of the variations, and the value displayed will change to match. Perform this operation for each of the EBP-referencing MOV lines, and you’ll get:

004036e3 c7 85 6c        MOV        dword ptr [EBP + local_139c],"cmd."
         ec ff ff 
         63 6d 64 2e
004036ed c7 85 70        MOV        dword ptr [EBP + local_1398],"exe "
         ec ff ff 
         65 78 65 20
004036f7 c7 85 74        MOV        dword ptr [EBP + local_1394],"/c h"
         ec ff ff 
         2f 63 20 68
00403701 c7 85 78        MOV        dword ptr [EBP + local_1390],"ostn"
         ec ff ff 
         6f 73 74 6e
0040370b c7 85 7c        MOV        dword ptr [EBP + local_138c],"ame\x00"
         ec ff ff 
         61 6d 65 00

That is much easier to work with, and allows you to quickly identify that the code is building the following string 4 bytes at a time:

cmd.exe /c hostname

Finally, if we look back at the source at https://github.com/Unam3dd/RevolutionShellV0.1/blob/master/revolution_backdoor_windows.cpp , we can see that the code at line #452 implements this code. Scrolling back up, we can determine that the really long function this lives within is the main function. As I’ve been walking around this function, the Decompiler view has been tracking my work and reports this function is FUN_00401582. So, I can use the Symbol Tree viewer, just as we did earlier, to find this function and name it main.

Often it is the case that you want to track down the main function first, and then it can become easier to determine names for all of the other functions once you are able to view them in the context in which they’re used by calling functions.

home

tags: malware lecture c x86 x86-64 asm cfg ghidra