Malware Identification with Yara
by Coleman Kane
In this lecture, we will begin to move away from system forensics and into malware detection. Whether you continue your research career further, go to work for a security team at a major company, or decide to work for a cybersecurity firm, the Yara project will be key to malware detection engineering. We will introduce you to Yara using a sequence of iterations that demonstrate various features and methods for building pattern recognition into Yara. To generalize the example away from EXE files, we will be using the text for a popular story, “Alice’s Adventures in Wonderland” by Lewis Carroll.
I retrieved a copy of this text from Project Gutenberg, here:
Yara Signatures Format
Yara signatures are a text format that is inspired by the syntax popularized by langauges like C & Java, and structured markup like JSON. This helps make it relatively understandable to many of you who are already familiar with these syntaxes. Yara provides a syntax for you to describe features to look for within an artifact, and provides an expressive language to help you to describe to the tool under what conditions should it register a match when it finds these features within a file.
As a demonstration, we will start with the following Yara signature:
rule rule1 {
meta:
/*
* Simple key/value list of metadata pairs that can be useful for documenting signatures
* and categorizing output. These can be reported back to the console after signature
* matching, which can also be helpful to group signature matches from an output log.
*/
author = "Coleman Kane"
description = "Look for the words Alice, Queen, or Rabbit"
strings:
/*
* List the features that we will be looking for within the file in this section of the
* signature. Any features used in match conditions will be listed here.
*/
$queen = "queen"
$rabbit = "rabbit"
$alice = "Alice"
condition:
/* The match condition. The signature will only match if this condition evaluates to
* true. In this case, the condition is telling Yara that the file has to contain at
* least one instance of any of the strings listed earlier.
*/
any of them
}
The above file describes 1 signature, named rule1
, that is broken into 3 sections.
The meta
section serves primarily to document your Yara signatures. In many cases, a team working over multiple
years is likely to build an increasingly diverse and context-specific array of Yara signatures. Often, these
are very specific to identifying a threat that may have been defined during one investiation some time ago. Due
to this, it helps to provide a lot of ways to document and categorize Yara signatures. A good example of a large
library of Yara signatures that employs this feature a lot is Yara-Rules on GitHub,
a community-cultivated library of Yara signatures that has been provided to the public.
The strings
section is where you define the content features that you want to use for performing
matching logic. In the example above, I simply listed the strings I wanted to search for in files. However,
this section doesn’t describe how Yara should be matching against these strings. You can think of this
as analogous to the difference, in many programming languages, between defining/assigning values to
variables, and performing operations or comparisons on those variables. The strings
section allows you
to define the string and pattern constants that will be available for the next section, condition
, to match
using them. Strings don’t have to be constants, as I have used above, Yara also supports a regular expression
syntax that is similar (but not identical) to PCRE as well as a fast binary pattern matching syntax that
supports expression of byte content using hexadecimal values. Additionally, string matching can be defined
to search for strings that happen to be encoded using many different encoding methods. More on that later.
Documentation for strings
is available here:
The condition
section is the only mandatory section in Yara, and is where the matching logic is
performed. During analysis, the file artifact is searched for the strings and patterns defined in the
strings
section, and then the results of this are used to populate the variables which can then be
referenced in the condition
section. In the above example, I gave the condition of any of them
, which
uses them
as a wildcard variable representing all strings.
I could have easily made the following condition to get the same effect, of matching if any of the strings are present in the file:
$queen or $rabbit or $alice
Alternately, as Yara supports logic that matches content in lists (similar in behavior to the in
operator in Python,
or IN
in SQL), the condition any of them
could have been written like this:
any of ($queen,$alice,$rabbit)
The negative consequence of using either of the above would be that if I add or remove strings from the strings
section, I have to also edit the condition
section to match the list. To address this Yara offers the them
keyword, which literally evaluates to the global string wildcard variable ($*)
. The use of the asterisk (*
)
is supported for building lists of one or more signatures to incorporate into a comparison. For example, the
wildcard $al*
would expand to any defined variables from the strings
section starting with al
. For example,
$alfred
, $albert
, and $alice
, but neither $arthur
nor $bob
.
Full documentation about conditions is located here, in the Yara documentation:
Running Yara
The VM I provided for everyone has yara
version 3.11.0 pre-installed on it. In teh previous section, I provided
the text of a yara signature named rule1
, and also referenced a link to download the text of Alice’s Adventures
in Wonderland.
Download alice.txt
and then copy the signature content into a file named rule1.yar
.
To run the signature match against alice.txt
, run the following command:
yara -m rule1.yar alice.txt
rule1 [] [author="Coleman Kane",description="Look for the words Alice, Queen, or Rabbit"] alice.txt
The -m
option above will cause yara
to report a list of the metadata tags on output, in addition to the name
of the signature. The above command will simply display the matching rule names, with their metadata.
Often it will prove usefull to also see the string variable names and the content that matched. If you would like
to also see this (for this example, it will be a lot of matched strings) you can use the -s
modifier:
yara -s -m rule1.yar alice.txt
rule1 [author="Coleman Kane",description="Look for the words Alice, Queen, or Rabbit"] alice.txt
0x9b4:$rabbit: rabbit
0xa77:$rabbit: rabbit
0xb0d:$rabbit: rabbit
0x8e65:$rabbit: rabbit
0x97fb:$rabbit: rabbit
0x982c:$rabbit: rabbit
0x1f:$alice: Alice
0x15f:$alice: Alice
0x2c0:$alice: Alice
...
0x25376:$alice: Alice
0x25464:$alice: Alice
0x254f6:$alice: Alice
0x25e35:$alice: Alice
The above output is very long, as it matches a number of srings common across th entire story. We can use some UNIX command-line tools to summarize the match strings for us:
yara -s -m rule1.yar alice.txt | \
grep ^0x | # Just the rows listing string positions \
cut -d: -f 2 | # Choose the 2nd column of output, separated by : \
sort | # Sort the output \
uniq -c # Display counts of unique values
Output:
402 $alice
6 $rabbit
Above shows that yara
has matched 6 instances of the string referenced by the $rabbit
variable, and 402
instances of the string referenced by $alice
. Putting -f 3
, instead, in the command line for cut
, would
have yielded summary counts for the literal text matches, rather than the variable names.
One thing to notice is that our signature also is supposed to look for "queen"
, but we did not get any
matches for this, even though anyone familiar with the story knows there to be a queen in the story. More
on this in the next section.
String Modifiers
As mentioned earlier, the strings
section allows you to specify one or more encoding “modifiers” to be used
in searching for strings. A few common variations of this are the following:
nocase
: Case-insensitive string matchingwide
: Match the UTF-16 encoding of the string. This is a popular encoding used in some Windows files.ascii
: Match the ASCII encoding of the stringfullword
: Match the string only if it isn’t a sub-component of a longer alpha-numeric stringbase64
: Discover the string encoded within Base64xor
: Match against single-byte XOR-encoded variations
Case Insensitivity
An exaple use of this is demonstrated with the example from the last section displaying the strings with the
-s
modifier. As was mentioned, the string "queen"
yielded no matches. If we look in the story at a
passage referencing the queen, we can identify why:
The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself, and this he handed over to the other, saying, in a solemn tone, “For the Duchess. An invitation from the Queen to play croquet.” The Frog-Footman repeated, in the same solemn tone, only changing the order of the words a little, “From the Queen. An invitation for the Duchess to play croquet.”
The word “Queen” (upper case) appears in the file, and from our yara
signature we determined that the lower
cased version of this word does not appear within the story. To solve this, we can add the nocase
modifier
to the $queen
string in the signature. We can add it arbitrarily to any strings we wish within the signature
without impacting the other match strings:
rule rule1_queen_nocase {
meta:
/*
* Simple key/value list of metadata pairs that can be useful for documenting signatures
* and categorizing output. These can be reported back to the console after signature
* matching, which can also be helpful to group signature matches from an output log.
*/
author = "Coleman Kane"
description = "Look for the words Alice, Queen, or Rabbit"
strings:
/*
* List the features that we will be looking for within the file in this section of the
* signature. Any features used in match conditions will be listed here.
*/
$queen = "queen" nocase
$rabbit = "rabbit"
$alice = "Alice"
condition:
/* The match condition. The signature will only match if this condition evaluates to
* true. In this case, the condition is telling Yara that the file has to contain at
* least one instance of any of the strings listed earlier.
*/
any of them
}
Writing the signature to disk as rule1_queen_nocase
and running a command similar to above to get
the summary output reports that the new signature finds where $queen
is now matched:
yara -s -m rule1_queen_nocase.yar alice.txt | \
grep ^0x | # Just the rows listing string positions \
cut -d: -f 2 | # Choose the 2nd column of output, separated by : \
sort | # Sort the output \
uniq -c # Display counts of unique values
402 $alice
77 $queen
6 $rabbit
Adding nocase
to $rabbit
will also allow us to match the uses of rabbit that are capitalized,
such as at the beginning of sentences and when used as a proper noun:
402 $alice
77 $queen
54 $rabbit
Wide (UTF16) Strings
A common encoding to look for, when analyzing Windows malware, is the UTF16-encoded strings. This
encoding is selected via the wide
modifier. This encoding is commonly used to expand the number
of printable characters within binary files, to incorporate many of the additional characters
common in non-US langauges.
As alice.txt
doesn’t contain any of this encoding of data, because it was intended to be read
by a human with a UTF-8 viewer, we have to make a quick example file. The following command will
create a new file that contains a UTF-16 “Hello”, followed by a UTF-8 encoded ASCII-compatible
“hello”:
echo -e 'H\0e\0l\0l\0o\0\n\0hello' > hello.txt
The following signature will match the presence of either string within the file:
rule rule2 {
meta:
author = "Coleman Kane"
description = "Detect hello"
strings:
$hello_utf8 = "hello"
$hello_utf16 = "Hello" wide
condition:
$hello_utf16 or $hello_utf8
}
However, the following signature will match only on the presence of both strings within the file:
rule rule2 {
meta:
author = "Coleman Kane"
description = "Detect hello"
strings:
$hello_utf8 = "hello"
$hello_utf16 = "Hello" wide
condition:
$hello_utf16 and $hello_utf8
}
The output for both looks the same, since both strings are in the file:
yara -m -s rule2.yar hello.txt
rule2 [author="Coleman Kane",description="Detect hello"] hello.txt
0xc:$hello_utf8: hello
0x0:$hello_utf16: H\x00e\x00l\x00l\x00o\x00
For every string, if no encoding is specified, then it defaults to ascii
. However, if any
encoding (other than nocase
) is specified, then ascii
is not assumed and must be explicitly added
if you also want to match the ascii
variant. Multiple encodings on a single string are supported and
will work in an additive fashion after the first one specified.
Here’s an example, rule2_simpler
which combines the logic from the first signature above (matches either
UTF-16 or UTF-8) into a single string and condition. Note that I needed to specify both wide
and ascii
in order to do a nocase
match on both encodings:
rule rule2_simpler {
meta:
author = "Coleman Kane"
description = "Detect hello"
strings:
$hello = "hello" wide ascii nocase
condition:
$hello
}
The result:
yara -m -s rule2_simpler.yar hello.txt
rule2_simpler [author="Coleman Kane",description="Detect hello"] hello.txt
0x0:$hello: H\x00e\x00l\x00l\x00o\x00
0xc:$hello: hello
The base64
and xor
modifiers will be helpful in matching some of the encodings that are present within
the RevolutionShell malware samples we’ve been building during class.
As can be seen in the output above, the NULL bytes are represented by the \x00
. In fact, you can use this
C-like syntax to include non-printable byte values inside your match strings as well.
Regular Expression Matching
Yara also supports regular expresion pattern matching, which can enable you to be more flexible with shorter signatures - often at the expense of comptational complexity/time. As mentioned earlier, Yara implements its own Regular Expression syntax that borrows heavily from PCRE. The documentation on what Yara does and does not implement can be found at the link below, and should be referenced when trying to build pattern signatures, so that an unimplemented feature is not used:
For this section, we will return to using the alice.txt
file. Also, it is important to review the above
syntax documentation, or at least keep it open in a new tab for reference, as I won’t be drilling into every
one of the syntax elements here.
Say we want to match a file if there exists a sentence within the file that discusses both Rabbit and Alice. We might use the following signature, and if we were to run the signature across a large library of texts from, say, project Gutenberg, we might have a high-fidelity identification of copies of Alice’s adventures:
rule rule3 {
meta:
author = "Coleman Kane"
description = "Find any sentence that discusses rabbit and alice"
strings:
$rabbit_sentence = /[^.]*(rabbit[^.]*alice|alice[^.]*rabbit)[^.]*\./ nocase
condition:
$rabbit_sentence
}
Then, we can run:
yara -m rule3.yar alice.txt
rule3 [author="Coleman Kane",description="Find any sentence that discusses rabbit and alice"] alice.txt
The above pattern basically matches if all of the following are true:
- “alice” and “rabbit” (case insensitive) are in either order, and separated by zero or more characters that are not the period.
- The above sequence in #1 also has zero or more non-period characters before or after it
- Finally, the entire match above must end in a period
We can use similar logic to earlier to count the number of matches - however, it is important to recognize
that yara
is not presently intended to perform the job of match data extraction. The way the logic works,
it is intended to report every location of a possible match, by iterating through each byte in the file. This
results in cases where, for each sentence matched above, every possible sentence fragment also registers a
match. This has to be taken into consideration, as well, when writing signatures.
yara -m -s rule3.yar alice.txt | \
grep ^0x | \ # Only match lines reporting match positions
wc -l # Count the total number of matches
3129
However, we know that each sentence ends with a period, before the next sentence starts. So, if we want to include the period from the prior sentence in our match pattern, in addition to the terminal period of the current sentence, we can perform the work of extracting single comple sentences, without overlap:
rule rule3_extract_sentences {
meta:
author = "Coleman Kane"
description = "Extract any sentence that discusses rabbit and alice, starting with the prior sentence period."
strings:
$rabbit_sentence = /\.[^.]*(rabbit[^.]*alice|alice[^.]*rabbit)[^.]*\./ nocase
condition:
$rabbit_sentence
}
Running the same command-line from earlier, we get the following sentence count, which seems more like what would be expected (as the rabbit is not frequent throughout the story):
23
Here is some of the example output, note that the text incorporates some of the UTF-8 byte encodings for
enhanced punctuation glyphs as well as CRLF line endings. These show up as \xXX
bytes within the string,
similar to the NULL bytes shown earlier:
...
0x22bee:$rabbit_sentence: . It quite makes my\x0D\x0Aforehead ache!\xE2\x80\x9D\x0D\x0A\x0D\x0AAlice watched the White Rabbit as he fumbled over the list, feeling\x0D\x0Avery curious to see what the next witness would be like, \xE2\x80\x9C\xE2\x80\x94for they\x0D\x0Ahaven\xE2\x80\x99t got much evidence _yet_,\xE2\x80\x9D she said to herself.
0x22cdf:$rabbit_sentence: . Imagine her\x0D\x0Asurprise, when the White Rabbit read out, at the top of his shrill\x0D\x0Alittle voice, the name \xE2\x80\x9CAlice!\xE2\x80\x9D\x0D\x0A\x0D\x0A\x0D\x0A\x0D\x0A\x0D\x0ACHAPTER XII.
As can be seen, there are some cases where we are spanning past the end of a sentence, because we only considered the period as the sentence ending. We can modify the pattern further to consider more than just the period, using character classes further:
$rabbit_sentence = /[.!?][^.!?]*(rabbit[^.!?]*alice|alice[^.!?]*rabbit)[^.!?]*[.!?]/ nocase
Using the earlier UNIX command line again, we can count the number of sentence matches and see it is now 19
. We
will have to use other code that we would write ourselves in order to normalize the output and strip the leading
punctuation from each extracted sentence - as yara
was not implemented for that function specifically.
Match Counts
Say we want to even more narrowly match the texts, we might want to discover where the match count is exactly or around this value. We might break up the confidence intervals into multiple rules:
rule rule3_exact {
meta:
author = "Coleman Kane"
description = "Match 19 sentences discussing rabbit and alice"
fidelity = "high"
strings:
$rabbit_sentence = /[.!?][^.!?]*(rabbit[^.!?]*alice|alice[^.!?]*rabbit)[^.!?]*[.!?]/ nocase
condition:
#rabbit_sentence == 19
}
rule rule3_maybe {
meta:
author = "Coleman Kane"
description = "Match close to 19 sentences that discuss rabbit and alice"
fidelity = "medium"
strings:
$rabbit_sentence = /[.!?][^.!?]*(rabbit[^.!?]*alice|alice[^.!?]*rabbit)[^.!?]*[.!?]/ nocase
condition:
((#rabbit_sentence > 14) and (#rabbit_sentence < 19)) or
((#rabbit_sentence > 19) and (#rabbit_sentence < 24))
}
The above signature contains two rules:
- First matches only if exactly 19 sentences match our pattern
- Second matches only if the match count is within 5 less than, or greater than, 19
This is a great example of breaking logic up into multiple tiers of rules, and then incorporating that into the metadata so that we can extract the confidence in our matching from the yara output.
tags: malware yara pdf run-time-analysis lecture