7 March 2020

Malware Identification with Yara

by Coleman Kane

In this lecture, we will begin to move away from system forensics and into malware detection. Whether you continue your research career further, go to work for a security team at a major company, or decide to work for a cybersecurity firm, the Yara project will be key to malware detection engineering. We will introduce you to Yara using a sequence of iterations that demonstrate various features and methods for building pattern recognition into Yara. To generalize the example away from EXE files, we will be using the text for a popular story, “Alice’s Adventures in Wonderland” by Lewis Carroll.

I retrieved a copy of this text from Project Gutenberg, here:

Text of Alice in Wonderland

Yara Signatures Format

Yara signatures are a text format that is inspired by the syntax popularized by langauges like C & Java, and structured markup like JSON. This helps make it relatively understandable to many of you who are already familiar with these syntaxes. Yara provides a syntax for you to describe features to look for within an artifact, and provides an expressive language to help you to describe to the tool under what conditions should it register a match when it finds these features within a file.

As a demonstration, we will start with the following Yara signature:

rule rule1 {
 meta:
  /*
   * Simple key/value list of metadata pairs that can be useful for documenting signatures
   * and categorizing output. These can be reported back to the console after signature
   * matching, which can also be helpful to group signature matches from an output log.
   */
  author = "Coleman Kane"
  description = "Look for the words Alice, Queen, or Rabbit"

 strings:
  /*
   * List the features that we will be looking for within the file in this section of the
   * signature. Any features used in match conditions will be listed here.
   */
  $queen = "queen"
  $rabbit = "rabbit"
  $alice = "Alice"

 condition:
  /* The match condition. The signature will only match if this condition evaluates to
   * true. In this case, the condition is telling Yara that the file has to contain at
   * least one instance of any of the strings listed earlier.
   */
  any of them
}

The above file describes 1 signature, named rule1, that is broken into 3 sections.

The meta section serves primarily to document your Yara signatures. In many cases, a team working over multiple years is likely to build an increasingly diverse and context-specific array of Yara signatures. Often, these are very specific to identifying a threat that may have been defined during one investiation some time ago. Due to this, it helps to provide a lot of ways to document and categorize Yara signatures. A good example of a large library of Yara signatures that employs this feature a lot is Yara-Rules on GitHub, a community-cultivated library of Yara signatures that has been provided to the public.

The strings section is where you define the content features that you want to use for performing matching logic. In the example above, I simply listed the strings I wanted to search for in files. However, this section doesn’t describe how Yara should be matching against these strings. You can think of this as analogous to the difference, in many programming languages, between defining/assigning values to variables, and performing operations or comparisons on those variables. The strings section allows you to define the string and pattern constants that will be available for the next section, condition, to match using them. Strings don’t have to be constants, as I have used above, Yara also supports a regular expression syntax that is similar (but not identical) to PCRE as well as a fast binary pattern matching syntax that supports expression of byte content using hexadecimal values. Additionally, string matching can be defined to search for strings that happen to be encoded using many different encoding methods. More on that later.

Documentation for strings is available here:

Strings Documentation for Yara

The condition section is the only mandatory section in Yara, and is where the matching logic is performed. During analysis, the file artifact is searched for the strings and patterns defined in the strings section, and then the results of this are used to populate the variables which can then be referenced in the condition section. In the above example, I gave the condition of any of them, which uses them as a wildcard variable representing all strings.

I could have easily made the following condition to get the same effect, of matching if any of the strings are present in the file:

  $queen or $rabbit or $alice

Alternately, as Yara supports logic that matches content in lists (similar in behavior to the in operator in Python, or IN in SQL), the condition any of them could have been written like this:

  any of ($queen,$alice,$rabbit)

The negative consequence of using either of the above would be that if I add or remove strings from the strings section, I have to also edit the condition section to match the list. To address this Yara offers the them keyword, which literally evaluates to the global string wildcard variable ($*). The use of the asterisk (*) is supported for building lists of one or more signatures to incorporate into a comparison. For example, the wildcard $al* would expand to any defined variables from the strings section starting with al. For example, $alfred, $albert, and $alice, but neither $arthur nor $bob.

Full documentation about conditions is located here, in the Yara documentation:

Yara Conditions Documentation

Running Yara

The VM I provided for everyone has yara version 3.11.0 pre-installed on it. In teh previous section, I provided the text of a yara signature named rule1, and also referenced a link to download the text of Alice’s Adventures in Wonderland.

Download alice.txt and then copy the signature content into a file named rule1.yar.

To run the signature match against alice.txt, run the following command:

yara -m rule1.yar alice.txt

rule1 [] [author="Coleman Kane",description="Look for the words Alice, Queen, or Rabbit"] alice.txt

The -m option above will cause yara to report a list of the metadata tags on output, in addition to the name of the signature. The above command will simply display the matching rule names, with their metadata.

Often it will prove usefull to also see the string variable names and the content that matched. If you would like to also see this (for this example, it will be a lot of matched strings) you can use the -s modifier:

yara -s -m rule1.yar alice.txt

rule1 [author="Coleman Kane",description="Look for the words Alice, Queen, or Rabbit"] alice.txt
0x9b4:$rabbit: rabbit
0xa77:$rabbit: rabbit
0xb0d:$rabbit: rabbit
0x8e65:$rabbit: rabbit
0x97fb:$rabbit: rabbit
0x982c:$rabbit: rabbit
0x1f:$alice: Alice
0x15f:$alice: Alice
0x2c0:$alice: Alice
...
0x25376:$alice: Alice
0x25464:$alice: Alice
0x254f6:$alice: Alice
0x25e35:$alice: Alice

The above output is very long, as it matches a number of srings common across th entire story. We can use some UNIX command-line tools to summarize the match strings for us:

yara -s -m rule1.yar alice.txt | \
  grep ^0x |     # Just the rows listing string positions \
  cut -d: -f 2 | # Choose the 2nd column of output, separated by : \
  sort |         # Sort the output \
  uniq -c        # Display counts of unique values

Output:

    402 $alice
      6 $rabbit

Above shows that yara has matched 6 instances of the string referenced by the $rabbit variable, and 402 instances of the string referenced by $alice. Putting -f 3, instead, in the command line for cut, would have yielded summary counts for the literal text matches, rather than the variable names.

One thing to notice is that our signature also is supposed to look for "queen", but we did not get any matches for this, even though anyone familiar with the story knows there to be a queen in the story. More on this in the next section.

String Modifiers

As mentioned earlier, the strings section allows you to specify one or more encoding “modifiers” to be used in searching for strings. A few common variations of this are the following:

nocase: Case-insensitive string matching
wide: Match the UTF-16 encoding of the string. This is a popular encoding used in some Windows files.
ascii: Match the ASCII encoding of the string
fullword: Match the string only if it isn’t a sub-component of a longer alpha-numeric string
base64: Discover the string encoded within Base64
xor: Match against single-byte XOR-encoded variations

Case Insensitivity

An exaple use of this is demonstrated with the example from the last section displaying the strings with the -s modifier. As was mentioned, the string "queen" yielded no matches. If we look in the story at a passage referencing the queen, we can identify why:

The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself, and this he handed over to the other, saying, in a solemn tone, “For the Duchess. An invitation from the Queen to play croquet.” The Frog-Footman repeated, in the same solemn tone, only changing the order of the words a little, “From the Queen. An invitation for the Duchess to play croquet.”

The word “Queen” (upper case) appears in the file, and from our yara signature we determined that the lower cased version of this word does not appear within the story. To solve this, we can add the nocase modifier to the $queen string in the signature. We can add it arbitrarily to any strings we wish within the signature without impacting the other match strings:

rule rule1_queen_nocase {
 meta:
  /*
   * Simple key/value list of metadata pairs that can be useful for documenting signatures
   * and categorizing output. These can be reported back to the console after signature
   * matching, which can also be helpful to group signature matches from an output log.
   */
  author = "Coleman Kane"
  description = "Look for the words Alice, Queen, or Rabbit"

 strings:
  /*
   * List the features that we will be looking for within the file in this section of the
   * signature. Any features used in match conditions will be listed here.
   */
  $queen = "queen" nocase
  $rabbit = "rabbit"
  $alice = "Alice"

 condition:
  /* The match condition. The signature will only match if this condition evaluates to
   * true. In this case, the condition is telling Yara that the file has to contain at
   * least one instance of any of the strings listed earlier.
   */
  any of them
}

Writing the signature to disk as rule1_queen_nocase and running a command similar to above to get the summary output reports that the new signature finds where $queen is now matched:

yara -s -m rule1_queen_nocase.yar alice.txt | \
  grep ^0x |     # Just the rows listing string positions \
  cut -d: -f 2 | # Choose the 2nd column of output, separated by : \
  sort |         # Sort the output \
  uniq -c        # Display counts of unique values

$alice
$queen
$rabbit

Adding nocase to $rabbit will also allow us to match the uses of rabbit that are capitalized, such as at the beginning of sentences and when used as a proper noun:

$alice
$queen
$rabbit

Wide (UTF16) Strings

A common encoding to look for, when analyzing Windows malware, is the UTF16-encoded strings. This encoding is selected via the wide modifier. This encoding is commonly used to expand the number of printable characters within binary files, to incorporate many of the additional characters common in non-US langauges.

As alice.txt doesn’t contain any of this encoding of data, because it was intended to be read by a human with a UTF-8 viewer, we have to make a quick example file. The following command will create a new file that contains a UTF-16 “Hello”, followed by a UTF-8 encoded ASCII-compatible “hello”:

echo -e 'H\0e\0l\0l\0o\0\n\0hello' > hello.txt

The following signature will match the presence of either string within the file:

rule rule2 {
 meta:
  author = "Coleman Kane"
  description = "Detect hello"
 strings:
  $hello_utf8 = "hello"
  $hello_utf16 = "Hello" wide
 condition:
  $hello_utf16 or $hello_utf8
}

However, the following signature will match only on the presence of both strings within the file:

rule rule2 {
 meta:
  author = "Coleman Kane"
  description = "Detect hello"
 strings:
  $hello_utf8 = "hello"
  $hello_utf16 = "Hello" wide
 condition:
  $hello_utf16 and $hello_utf8
}

The output for both looks the same, since both strings are in the file:

yara -m -s rule2.yar hello.txt

rule2 [author="Coleman Kane",description="Detect hello"] hello.txt
0xc:$hello_utf8: hello
0x0:$hello_utf16: H\x00e\x00l\x00l\x00o\x00

For every string, if no encoding is specified, then it defaults to ascii. However, if any encoding (other than nocase) is specified, then ascii is not assumed and must be explicitly added if you also want to match the ascii variant. Multiple encodings on a single string are supported and will work in an additive fashion after the first one specified.

Here’s an example, rule2_simpler which combines the logic from the first signature above (matches either UTF-16 or UTF-8) into a single string and condition. Note that I needed to specify both wide and ascii in order to do a nocase match on both encodings:

rule rule2_simpler {
 meta:
  author = "Coleman Kane"
  description = "Detect hello"
 strings:
  $hello = "hello" wide ascii nocase
 condition:
  $hello
}

The result:

yara -m -s rule2_simpler.yar hello.txt

rule2_simpler [author="Coleman Kane",description="Detect hello"] hello.txt
0x0:$hello: H\x00e\x00l\x00l\x00o\x00
0xc:$hello: hello

The base64 and xor modifiers will be helpful in matching some of the encodings that are present within the RevolutionShell malware samples we’ve been building during class.

As can be seen in the output above, the NULL bytes are represented by the \x00. In fact, you can use this C-like syntax to include non-printable byte values inside your match strings as well.

Regular Expression Matching

Yara also supports regular expresion pattern matching, which can enable you to be more flexible with shorter signatures - often at the expense of comptational complexity/time. As mentioned earlier, Yara implements its own Regular Expression syntax that borrows heavily from PCRE. The documentation on what Yara does and does not implement can be found at the link below, and should be referenced when trying to build pattern signatures, so that an unimplemented feature is not used:

Yara Regular Expression Language

For this section, we will return to using the alice.txt file. Also, it is important to review the above syntax documentation, or at least keep it open in a new tab for reference, as I won’t be drilling into every one of the syntax elements here.

Say we want to match a file if there exists a sentence within the file that discusses both Rabbit and Alice. We might use the following signature, and if we were to run the signature across a large library of texts from, say, project Gutenberg, we might have a high-fidelity identification of copies of Alice’s adventures:

rule rule3 {
 meta:
  author = "Coleman Kane"
  description = "Find any sentence that discusses rabbit and alice"
 strings:
  $rabbit_sentence = /[^.]*(rabbit[^.]*alice|alice[^.]*rabbit)[^.]*\./ nocase
 condition:
  $rabbit_sentence
}

Then, we can run:

yara -m rule3.yar alice.txt

rule3 [author="Coleman Kane",description="Find any sentence that discusses rabbit and alice"] alice.txt

The above pattern basically matches if all of the following are true:

“alice” and “rabbit” (case insensitive) are in either order, and separated by zero or more characters that are not the period.
The above sequence in #1 also has zero or more non-period characters before or after it
Finally, the entire match above must end in a period

We can use similar logic to earlier to count the number of matches - however, it is important to recognize that yara is not presently intended to perform the job of match data extraction. The way the logic works, it is intended to report every location of a possible match, by iterating through each byte in the file. This results in cases where, for each sentence matched above, every possible sentence fragment also registers a match. This has to be taken into consideration, as well, when writing signatures.

yara -m -s rule3.yar alice.txt | \
  grep ^0x | \  # Only match lines reporting match positions
  wc -l         # Count the total number of matches

However, we know that each sentence ends with a period, before the next sentence starts. So, if we want to include the period from the prior sentence in our match pattern, in addition to the terminal period of the current sentence, we can perform the work of extracting single comple sentences, without overlap:

rule rule3_extract_sentences {
 meta:
  author = "Coleman Kane"
  description = "Extract any sentence that discusses rabbit and alice, starting with the prior sentence period."
 strings:
  $rabbit_sentence = /\.[^.]*(rabbit[^.]*alice|alice[^.]*rabbit)[^.]*\./ nocase
 condition:
  $rabbit_sentence
}

Running the same command-line from earlier, we get the following sentence count, which seems more like what would be expected (as the rabbit is not frequent throughout the story):

Here is some of the example output, note that the text incorporates some of the UTF-8 byte encodings for enhanced punctuation glyphs as well as CRLF line endings. These show up as \xXX bytes within the string, similar to the NULL bytes shown earlier:

...
0x22bee:$rabbit_sentence: . It quite makes my\x0D\x0Aforehead ache!\xE2\x80\x9D\x0D\x0A\x0D\x0AAlice watched the White Rabbit as he fumbled over the list, feeling\x0D\x0Avery curious to see what the next witness would be like, \xE2\x80\x9C\xE2\x80\x94for they\x0D\x0Ahaven\xE2\x80\x99t got much evidence _yet_,\xE2\x80\x9D she said to herself.
0x22cdf:$rabbit_sentence: . Imagine her\x0D\x0Asurprise, when the White Rabbit read out, at the top of his shrill\x0D\x0Alittle voice, the name \xE2\x80\x9CAlice!\xE2\x80\x9D\x0D\x0A\x0D\x0A\x0D\x0A\x0D\x0A\x0D\x0ACHAPTER XII.

As can be seen, there are some cases where we are spanning past the end of a sentence, because we only considered the period as the sentence ending. We can modify the pattern further to consider more than just the period, using character classes further:

$rabbit_sentence = /[.!?][^.!?]*(rabbit[^.!?]*alice|alice[^.!?]*rabbit)[^.!?]*[.!?]/ nocase

Using the earlier UNIX command line again, we can count the number of sentence matches and see it is now 19. We will have to use other code that we would write ourselves in order to normalize the output and strip the leading punctuation from each extracted sentence - as yara was not implemented for that function specifically.

Match Counts

Say we want to even more narrowly match the texts, we might want to discover where the match count is exactly or around this value. We might break up the confidence intervals into multiple rules:

rule rule3_exact {
 meta:
  author = "Coleman Kane"
  description = "Match 19 sentences discussing rabbit and alice"
  fidelity = "high"
 strings:
  $rabbit_sentence = /[.!?][^.!?]*(rabbit[^.!?]*alice|alice[^.!?]*rabbit)[^.!?]*[.!?]/ nocase
 condition:
  #rabbit_sentence == 19
}

rule rule3_maybe {
 meta:
  author = "Coleman Kane"
  description = "Match close to 19 sentences that discuss rabbit and alice"
  fidelity = "medium"
 strings:
  $rabbit_sentence = /[.!?][^.!?]*(rabbit[^.!?]*alice|alice[^.!?]*rabbit)[^.!?]*[.!?]/ nocase
 condition:
  ((#rabbit_sentence > 14) and (#rabbit_sentence < 19)) or
   ((#rabbit_sentence > 19) and (#rabbit_sentence < 24))
}

The above signature contains two rules:

First matches only if exactly 19 sentences match our pattern
Second matches only if the match count is within 5 less than, or greater than, 19

This is a great example of breaking logic up into multiple tiers of rules, and then incorporating that into the metadata so that we can extract the confidence in our matching from the yara output.

home

tags: malware yara pdf run-time-analysis lecture

CS6038/CS5138 Malware Analysis, UC

Course content for UC Malware Analysis