Semantics-Aware Malware Detection - 王朝网络宽屏版

Notes from: "Semantics-Aware Malware Detection"

[Christodorescu, Jha, Seshia, Song, Bryant]

--------

Problem:

--------

- old trend in malware: create a supervirus, infect millions with it

- new trend in malware: create a virus, create several derivatives of

that virus (each of which would require a distinct signature for AV

scanners) which differ from the base virus only slightly, have each

derivative infect smaller numbers of (possibly just local) hosts

--> so, how to make virus signatures for the main base virus that also

"catch" these derivatives, rather than having to have a sample of each

derivative and having to have even more bloated signature databases

------------------

malware detector : system which attempts to detect whether a program has

------------------ malicious intent

- evade detection via obfuscating the malware

- malware detectors use pattern matching are susceptible to obfuscation

-- a purely syntactic approach

- match some sequence of instructions (perhaps using regular expressions)

- but not matching based on what code DOES, matching based on

particular instruction sequences in code

----------------

Their approach :

----------------

- malware detection algo incorporates instruction semantics : seeking to

match based on malicious program traits

- so theoretically would catch derivative programs which have the same

functionality as the original

========

I. Intro

========

malware instance : program that has malicious intent

- e.g. viruses, trojans, worms

- can classify malware w.r.t. its propagation method and goal

malware detector : system that attempts to identify malware

- e.g. a virus scanner : uses sigs and other heuristics to identify malware

goal of malware writer : modify or morph malware so that it evades detection

- ideally do this derivative/variant-generation automatically

- e.g. evade detection via program obfuscation

-- e.g. virus encrypts its malicious payload and decrypts it during exec

-- e.g. polymorphism and metamorphism

-- polymorphic virus : obfuscates decryption loop using transformations

-- sample transformations :

-- nop insertion

-- code transposition (change order of instructions and put jumps

in in order to retain original semantics)

-- register reassignment (permute register allocation)

-- metamorphic virus : obfuscate entire virus

-- on replication, change code in variety of ways

-- e.g. code transposition, substitution of equivalent

instruction sequences, change of conditional jumps,

- another (obfuscation) technique : add new behaviors to existing malware

-- by adding new behavior to base code, able to evade detection even though

base code would have been detected

-- cf. Sobig.A through Sobig.F (worms), Netsky, Beagle (from 2004) to present

- commercial malware detectors : use simple pattern matching approach

-- match : program contains a sequence of instructions that is matched

by a regular expression

-- basic deficiency : ignore SEMANTICs of these instructions

-- not resilient to slight variations

-- so must use different patterns for detecting two malware instances

that are slight variations of each other

- suppose program P transformed by a compiler phase to produce program P^t

(maybe that compiler phase consisted of register allocation)

- translation-validation problem :

-- determine whether P^t simulates P

-- used to prove that various compiler phases preserve semantics of P

- view hacker obfuscations as compiler phases

-- then malware detection problem seems similar to translation-validation

problem

- after hacker obfuscation produces P' out of P (where P is a

known bad program), does P' have same semantics as P?

-- but hacker obfuscations will produce P' such that P' has *different*

semantics than P (added functionality, e.g.)

- so P and P' are not equivalent

-- plus, translation-validation makes assumptions a/b transformed program :

-- e.g. all branches in P' must match branches in P

- this does not necessarily hold for hacker transformations

-- all the same, some useful ideas from translation-validation :

(1) modeling semantics of instructions, (2) use decision procedures

- observe that certain malicious behaviors appear in all variants of a

certain malware (e.g. the decryption loop in a polymorphic virus)

-- but decryption loop, e.g., might appear in different form in variants

- algo for discovering specified malicious behaviors in a given program

- formal semantics : define problem of determining whether a program

exhibits a specified malicious behavior

-- in general, undecidable

- semantics-aware malware detection algo :

-- handle limited set of transformations

-- detects multiple variants with single template

-- more resilient than commercial AV scanners [claim]

==================================

II. Semantics of Malware Detection

==================================

(1) Specify malicious behavior : via templates

Template T : instruction sequence (I_T) which uses variables (V_T) and constants (C_T)

----------------

Sample template:

----------------

(1)A = const_addr1 // encrypted memory starts here

(2)B = const_addr2 // location to put decrypted memory

(3)condition(A)?TRUE : goto (7), FALSE : goto (4)

(4)mem[B] = f(mem[A])

(5)A = A + c

(6)B = B + d

goto (3)

(7)jump const_addr2

variables, V_T : { A, B }

symbolic constants, C_T : { const_addr1, const_addr2, c, d, f(x), p(x) }

-- symbolic constants : either n-ary function, F(n), or n-ary predicate, P(n)

-- actual constants (e.g. c, d) are 0-ary functions

-- n-ary function : takes n arguments, returns a value

-- n-ary predicate : takes n arguments, returns a boolean

-- c : TAU <==> the type of symbolic constant c is TAU

==> resilient to obfuscations :

-- register renaming (since use vars and constants)

-- changing starting addy of a memory block (since use vars)

==> When does an instruction sequence contain a malicious behavior?

-- if template and instruction sequence have the same effect on memory

-- IOW, the malicious behavior specified by the template is

demonstrated by the instruction sequence

==> also resilient to insertion of irrelevant instruction sequences

-----------------------------------------------------------------

Sample malware instance (instruction sequence) of above template:

-----------------------------------------------------------------

(1) eax = 0x403000

(2) ebx = 0x400000

(3) edx = eax + 3

(4) eax >= 406000 ?TRUE : goto (9), FALSE : goto (5)

(5) mem[ebx] = mem[edx - 3] << 2 + 1

(6) eax = eax + 4

(7) edx = edx + 4

(8) ebx = ebx + 4

goto (4)

(9) jump 0x400000

- edx above is totally unnecessary, could have written :

(1) eax = 0x403000

(2) ebx = 0x400000

(3) eax >= 0x406000 ?TRUE : goto (7), FALSE : goto (4)

(4) mem[ebx] = mem[eax] << 2 + 1

(5) eax = eax + 4

(6) ebx = ebx + 4

goto (3)

(7) jump 0x400000

- but this second instance is a more obvious match to the given template

- so provided first instance to illustrate that even when additional

registers are included, that the first instance matches would be caught

--> some resilience based on semantic matching to hacker obfuscation

- execution context : an assignment of values (of appropriate types) to

the set of symbolic constants, C_T

-- e.g. const_addr1 <-- 0x403000

const_addr2 <-- 0x400000

condition(X) <-- X >= 0x406000

f(X) <-- X << 2 + 1

c <-- 4

d<-- 4

-- EC_T : execution context for a template T

EC_T( c ) : where c is a constant, returns the value of that constant

-- EC_T : is a function with domain C_T such that

for all c in C_T, the type of c and the type of EC_T(c) are the same

- EC_T( T ) : the template obtained by replacing every constant c in C_T by EC_T(c)

- state s^T for the template T is a 3-tuple denoted : ( val^T, pc^T, mem^T )

-- val^T : V_T --> Values, is an assignment of values to the variables in V_T

-- pc^T : value for the program counter

-- mem^T : Addr --> Values, gives the memory content

- a state for the instruction sequence I is a 3-tuple (val, pc, mem)

-- val : Reg --> Values, is an assignment of values to the set of registers

-- pc : a value for the PC

-- mem : Addr --> Values, gives memory contents

- S^T is state space for template; S is state space for instruction sequence

- given EC_T for a template T and template is in state s^T

- if execute instruction i from the template EC_T(T) in state s^T, we

transition to a new state s_1^T and generate a system event e (event

may be sys call or kernel trap, e.g.); denote this via :

s^T ---[e]---> s_1^T

- if an instruction i does NOT generate a system event, e(i) = null

[it generates the null event]

- for every initial template state, s_0^T, executing T in an execution

context EC_T generates a sequence :

SIGMA( T, EC_T, s_0^T ) = s_0^T ---[e_1^T]---> s_1^T ---[e_2^T]---> ...

- for i >= 1,

s_i^T == state after executing i'th instruction from the template EC_T(T) and

e_i^T == the event generated by the i'th instruction

--> s_i^T ---> execute (i+1)-th instruction ---> generate event ---> s_(i+1)^T

- if template doesn't terminate, sequence can be infinite

SIGMA( I, s_0 ) == sequence when the instruction sequence I is executed

from the initial state s_0

==========================================================================

- Definition 1 : "instruction sequence I contains a behavior specified by T"

==========================================================================

an instruction sequence I contains a behavior specified by the template

T if there exists a program state s_0, an execution context EC_T and a

template state s_0^T such that mem( s_0^T ) == mem( s_0 ) and the two

sequences SIGMA( T, EC_T, s_0^T ) and SIGMA( I, s_0 ) are finite and :

-- note that the initial PCs don't have to be the same

------------------------------------

requirements for the two sequences :

------------------------------------

(1) SIGMA( T, EC_T, s_0^T )

= s_0^T ---[e_1^T]---> s_1^T ---[e_2^T]---> ... ---[e_k^T]---> s_k^T

SIGMA( I, s_0 )

= s_0 ---[e_1]---> s_1 ---[e_2]---> ... ---[e_r]---> s_r

Let affected( SIGMA( T, EC_T, s_0^T ) ) be the set of addresses such that :

mem(s_0^T)[a] != mem(s_k^T)[a]

(addresses whose contents before applying the template from the initial

state are different than AFTER applying the template to completion)

REQUIRED : mem( s_k^T )[a] == mem( s_r )[a] for all a in affected(...)

for any memory address whose contents change when applying the template,

the value of those contents are the same as the value of those contents

when applying the instruction sequence from the init state to completion

RECALL: the memory in the two starting states is identical

THIS SAYS: call M the set of memory addresses whose content changes

during the execution of the template. Then for every m in M,

mem[m] at the end of execution of the template must equal mem[m] at

the end of execution of the instruction sequence

--> NOTE that this is NOT the same thing as saying that the contents

of memory at the end of both must be identical

we only care about the memory addresses whose content changed

during execution of the template

this means that there could be some memory address A and its

contents could be changed during execution of the instruction

sequence BUT NOT during execution of the template

--> for example, edx in the given example

--> edx could have value 0 at the beginning of both

--> then edx could have value 0 at the end of the template but

have value 0x406000 at the end of the instruction sequence

==> this is OK; edx as such is "extraneous"

(2) ignoring null events, the event sequence < e_0^T, e_1^T, ..., e_k^T >

is a subsequence of < e_1, e_2, ..., e_r >

two system events e_i^T and e_j MATCH if the arguments to each are the

same and the return values from each are the same

(3) if pc( s_k^T ) is in affected( SIGMA( T, EC_T, s_0^T ) )

then pc( s_r ) is in affected( SIGMA( T, EC_T, s_0^T ) )

- if the program counter at the end of applying the template is an

address whose contents have changed (due to applying the template)

then the program counter after issuing the instruction sequence

must also have changed due to that instruction sequence

[A problem is that affected(...) contains ADDRESSES whereas pc( s_i^T )

refers to a VALUE or the *contents* of some address]

- do we assume that the *address* of the program counter stays static?

< Tangent re: condition 3; see cond_3_tangent.txt for details >

==============================================

- Definition 2 : Template Matching Problem (TMP)

==============================================

a program P satisfies a template T iff P contains an instruction

sequence I such that I contains a behavior specified by T

NOTATION : P |= T (P satisfies template T)

e.g. P is malware; T specifies decryption loop; if P contains *any*

instruction sequence I which contains the behavior specified by T

then P satisfies T (e.g. PGP should satisfy that T)

----------------

Variant Family :

----------------

- Let T be a set of templates (e.g. T = { T_1, T_2, ..., T_l } and

T_1 specifies decryption loop behavior, T_2 specifies email addy

search behavior, ... )

- The set T defines a variant family F :

{ P | for all T_i in T, P satisfies T_i }

-- note that it is possible for some family members, f in F,

to have *additional* behavior than what is specified by T

-- so adding functionality to a family member will not cause that

family member to be thrown out of the family

-- removing some functionality may result in booting the family

member though

------------------------------

- Theorem 1 : TMP is undecidable

------------------------------

Let M be a Turing machine and P_M be a program that uses instructions

and simulates M

Assume there is some address sp_addr which P_M doesn't alter the contents of

Before starting to simulate M, P_M sets mem[sp_addr] = 0

After simulating M, P_M sets mem[sp_addr] = 1

Consider template T_1 : mem[sp_addr] = 0

mem[sp_addr] = 1

So P_M satisfies template T_1 if and only if M halts which implies a

solution to the halting problem which is undecidable therefore...

------------------

A weaker semantics

------------------

Semantics from above may be too strict; imagine a template uses certain

memory locations to store temporary values then such mem locs should not

be checked for equality in comparing the executions of an instruction

sequence and a template.

core memory locations: core( SIGMA( T, EC_T, s_0^T ) ) is a subset of the

affected memory addresses

Then modified condition 1 :

- for all a in core( SIGMA( T, EC_T, s_0^T ) ),

mem( s_k^T )[a] = mem( s_r )[a]

- so we only check "core" memory locations for equality

How to define "core" memory locations?

- label each instruction in I_T as "temp" or "persistent"

- core memory locations = any address in affected( ... ) that is the

target of a PERSISTENT load or store

=======================================

III. Semantics-aware matching algorithm :

=======================================

- Sound but not complete

Sound : no false positives (anything that matches definitely matches)

--> every solution produced is indeed a solution

Complete : no false negatives (catch everything that matches)

--> produces all possible solutions

- tool takes as input : template and a binary (for MSFT Windows on Intel

IA-32 (x86) platform)

- tool outputs whether any fragment of the program contains the behavior

specified by the template

-- describes the matched fragment of the program plus relates template

vars to program registers and memory locations and lists irrelevant

instruction sequences that are part of the matched program fragment

- architecture : tool built on top of IDA Pro disassembler

- binary comes in and is disassembled, create control flow graphs from

it (one per program function), produce intermediate representation (IR)

-- IR generated using library of x86 instruction transformers

-- IR is architecture-independent

-- IR can still contain library and sys calls that are platform specific

--> remove this dependency by using summary functions expressed in IR

- not much info on this (how works exactly, examples, ...)

==> point of all this, though, is for IR to be platform- and

architecture-independent, so could conceivably use one template

to detect some virus with semantically identical variants

produced for Win32, Linux, ...

- so malware detector algo takes Template and program IR, interacts with :

- decision procedures, which include :

-- nop library, random execution, simplify, UCLID

and outputs YES or NO

nop library : predefined patterns of irrelevant instructions

randomized execution : randomized x86 interpreter

simplify : theorem prover

UCLID : decision procedure with bit-vector arithmetic

===============================================

- algo description, malware detection algo : A_MD

===============================================

- finds, for each template node, a matching node in the program

- a template node is an expression; e.g. "A = const_addr1", "condition(A)?",

- a program node is an expression; e.g. "eax = 0x403000", "eax >= 0x406000?"

- two nodes, n_T (from the template) and n_P (from the program) match if

there exists an assignment to vars from the n_T expression that unifies

it with n_P

- once two matching nodes are found, check whether def-use relationships

true b/n template nodes also hold true in the corresponding program

-- def-use relationship :

two nodes n_1 and n_2 are related by a def-use relationship IF

n_1's DEFinition of a var reaches n_2's USE of the same var

e.g. n_1 = statement S_i and n_2 = statement S_j and

n_1 defines some variable X and n_2 *uses* that value of X

(as set during statement S_i)

[see def_use_rp.txt in this folder for slighly more in-depth

though still somewhat elementary explanation]

- if all the nodes in the template have matching counterparts in the

program that satisfy these conditions, then the algo has found a

program fragment that satisfies the template and produces a proof of

this relationship

- formally, given a malicious code template T and a program P,

A_MD : is a predicate on pairs of programs and templates

A_MD : Programs x Templates ---> { "yes", "don't know" }

"don't know" == algo can't determine that prog satisfies the template

such that A_MD( P, T ) returns "yes" when P satisfies T under the

conditions below

----------------------

Simplified template T; X^key == X (bit-vector XOR) key

----------------------

(1) A = const_addr1// b_1

(2) B = const_addr2// b_2

(3) A > const_addr3 ? TRUE: goto 7, FALSE: goto 4// b_3

(4) mem[B] = mem[A]^key// b_4

(5) A = A + c// b_5

(6) B = B + d// b_6

goto 3 // they have another typo here

(7) jump const_addr2// b_7

---------------------

Simplified program P:

---------------------

(1) eax = 0x403000// a_1

(2) ebx = 0x400000// a_2

(3) nop// a_3

(4) ecx = eax + 1// a_4

(5) ecx - 1 > 0x406000 ? TRUE: goto 10, FALSE: goto 6// a_5

(6) eax = ecx - 1// a_6

(7) mem[ebx] = mem[eax]^5// a_7

(8) eax = eax + 1// a_8

(9) ebx = ebx + 2// a_9

goto 4

(10) jump 0x400000// a_10

===========================================================

For A_MD(P, T) = "yes", the following conditions must hold:

===========================================================

-----------------------------------------------

(1) matching of template nodes to program nodes

-----------------------------------------------

- each node n in T has to unify with a node m in P

- call n_T : the set of template nodes and n_P the set of program nodes

- IOW, there is an onto partial function f from the nodes of P to the

nodes of T (there may exist nodes p in n_P such that f(p) = not defined;

however for every node t in n_T, there exists some p in n_P such that

f(p) = t)

then f : { nodes in P } ---> { nodes in T }

{ n_P } ---> { n_T }

{ a_1, a_2, ..., a_m } ---> { b_1, b_2, ..., b_n }

m >= n

then: f( a_i ) = b_j for some i,j

and a_i and b_j are unifiable

B( X, n, m ), where n = f( m )

== the binding of variable X (where X is referred to in the template

instruction at node n) TO an expression referred to in the

program instruction at node m;

==> So is it okay if more than one program node maps to the SAME

template node? By the definition of f, this is allowed...

e.g. B(A,1,1) = eax alsoB(B,2,1) = eax

B(A,2,2) = ebxB(A,1,2) = ebx

B(A,3,5) = (ecx - 1)

B(A,5,8) = eax

B(B,6,9) = ebx

----------------------------------------------------------------------------------

sample definition of f [there is more than one possible that satisfies this cond]:

----------------------------------------------------------------------------------

a_1 --> b_1

a_2 -->b_2

a_3 -->undefined

a_4 -->undefined

a_5 -->b_3

a_6 -->undefined

a_7 -->b_4

a_8 -->b_5

a_9 -->b_6

a_10 --> b_7

---------------------------------

(2) preservation of def-use paths

---------------------------------

- for each node n in T, define def(n) as the set of vars defined in n

- for each node n in T, define use(n) as the set of vars USED in node n

---------- ------ ------

nodes in T def(n) use(n)

---------- ------ ------

b_1 A none

b_2 B none

b_3 none A

b_4 none A, B

b_5 A A

b_6 B B

b_7 none none

- nodes n_pre : predecessor to the template entry node

- nodes n_post : successor to all template exit nodes

- def( n_pre ) = use( n_post ) = V_T

- so def( n_pre ) = set of all nodes defined in the template?

- and use( n_post ) = set of all nodes used in the template?

- a def-use path is a sequence of nodes < n_1, n_2, ..., n_k > such that

+ for 1 <= i < k, there is an edge from n_i to n_(i+1) in T

[i.e. < n_1, n_2, ..., n_k > is a valid path through template T ]

+ I( n_1 ) == the instruction at node n_1

I( n_1 ) DEFines a variable that I( n_k ) USEs

[i.e. def( n_1 ) INTERSECT use( n_k ) != {} ]

+ none of the instructions at the intermediate nodes redefine the

value of the variables in def(n_1)

[i.e. def(n_1) INTERSECT def(n_i) = {} for 1 < i < k ]

- so from our template graph,

paths : n_1 --> n_2 --> n_3

n_3 --> n_4

n_3 --> n_7

n_4 --> n_5 --> n_6 --> n_3

def( A ) : { n_1, n_5 } use( A ) : { n_3, n_4, n_5 }

def( B ) : { n_2, n_6 } use( B ) : { n_4, n_6 }

---------------------------------------------

def-use relationships, denoted duse(n_1,n_2):

---------------------------------------------

A : duse( n_1, n_3 ) path : n_1 --> n_2 --> n_3

A : duse( n_1, n_4 ) path : n_1 --> n_2 --> n_3 --> n_4

A : duse( n_1, n_5 ) path : n_1 --> n_2 --> n_3 --> n_4 --> n_5

A : duse( n_5, n_3 ) path : n_5 --> n_6 --> n_3

A : duse( n_5, n_4 ) path : n_5 --> n_6 --> n_3 --> n_4

A : duse( n_5, n_5 ) path : n_5 --> n_6 --> n_3 --> n_4 --> n_5

B : duse( n_2, n_4 ) path : n_2 --> n_3 --> n_4

B : duse( n_2, n_6 ) path : n_2 --> n_3 --> n_4 --> n_5 --> n_6

B : duse( n_6, n_4 ) path : n_6 --> n_3 --> n_4

B : duse( n_6, n_6 ) path : n_6 --> n_3 --> n_4 --> n_5 --> n_6

- any def-use relationship in the template, e.g. between nodes n and n',

must hold between nodes m and m' WHERE n = f(m) and n' = f(m')

- consider two nodes n and n' in T that are def-use related

- let nodes n and n' in T be matched to nodes m and m' in P

- suppose X is a var that is defined in n and used in n'

B( X, n, m ) = r "X is bound to expression r in node m"

B( X, n', m' ) = r' "X is bound to expression r' in node m'"

- for example, n = n_1, n' = n_3, X = A

m = m_1, m' = m_5, X = A

then B(X, n_1, m_1 ) = eax

B(X, n_3, m_5 ) = (ecx - 1)

so r = eax and r' = (ecx - 1)

then need to make sure that value of (eax) after "eax = 0x403000" (m_1)

is equal to value of (ecx - 1) before "ecx - 1 > 0x406000" (m_5)

*for every program path corresponding to the template path from n_1 to n_3*

- in this case there is just one such program path

- need to make sure that value of r AFTER executing I(m) is same as

value of r' BEFORE executing I(m')

- let's try another example: n_6 and n_4 are related by def-use r/p in T

and n_6 corresponds to m_9 and n_4 corresponds to m_7

B( B, n_6, m_9 ) = ebx

B( B, n_4, m_7 ) = ebx

condition is trivially satisfied

===========================================================

- Theorem 2 : A_MD is sound with respect to the TMP semantics;

===========================================================

i.e. A_MD( P, T ) = "yes" ===> P |= T

- sound but not complete

- can show there exists a program P and a template T such that

P satisfies T but A_MD( P, T ) returns "don't know"

template T program P

---------- ---------

for ( i = 0; i < 10; i++ ) for ( i = 0; i < 10; i += 2 )

a[i] = 0; a[i] = 0;

for ( i = 1; i < 10; i += 2 )

a[i] = 0;

- A_MD won't catch this matching as the ordering of updates in the

template loop is different than that of the program loops

=================

- Local unification

=================

-- produces an assignment to vars in an instruction at a template node

such that this matches a program node instruction

-- program node expressions contain ground terms only

-- instantiate template vars with program expression

-- template node 1 ( A = const_addr1 ) can be unified with

program node 1 ( eax = 0x403000 ) OR

program node 2 ( ebx = 0x400000 )

but not with program node 8 ( eax = eax + 1 ) or 9 ( ebx = ebx + 2 )

++ so note that unification assigns to vars BUT ALSO to constants

-- [ t_1, p_1 ] { A <-- eax, const_addr1 <-- 0x403000 }

[ t_2, p_2 ] { B <-- ebx, const_addr2 <-- 0x400000 }

[ t_3, p_5 ] { A <-- ecx - 1, const_addr3 <-- 0x406000 }

[ t_4, p_7 ] { B <-- ebx, A <-- eax, key <-- 5 }

[ t_5, p_8 ] { A <-- eax, c <-- 1 }

[ t_6, p_9 ] { B <-- ebx, d <-- 2 }

[ t_7, p_10 ] { const_addr2 <-- 0x400000 }

--> B is only assigned ebx so we don't have to worry about B

--> const_addr1 is only assigned 0x403000

--> const_addr2 is only assigned 0x400000

--> key is only assigned 5

--> const_addr3 is only assigned 0x406000

--> c is only assigned 1

--> d is only assigned 2

--> A is assigned both: { eax, ecx - 1 }

RULES about producing an assignment to vars in an instruction at a

template node such that the resulting expr matches a prog node instr:

(1) a var in the template can be unified with any prog expr except for

assignment expressions

(2) a constant in the template can only be unified with a prog constant

(3) the memory function (in the template) can only be unified with the

program's memory function

(4) a predefined operator in the template (e.g. OR, AND, XOR, NOT,

BITAND, LEQ, PLUS, MINUS, ...) can only be unified with THE SAME

operator in the program

(5) an external function call in the template can only be unified with

the same external function call in the program

==> result of this is a binding relating template variables to program expressions

- note that we don't require that bindings be consistent (monomorphic) since

as above we have that A is bound to eax at one point yet bound to ecx - 1

at another

-- such a requirement would be overly restrictive (not resilient to

obfuscation attacks)

- instead we want to check program expressions bound to the same template

variable (e.g. eax is bound to A and so is ecx - 1 )

--> verify that such expressions change value in the same way that the

template variable changes value

---------------------=====================================

ACCOMPLISH THIS via : value preservation on def-use chains

---------------------=====================================

- for each def-use chain in the template, check :

whether the value of the program expression corresponding to the

template-variable use is the same as the value of the program

expression corresponding to the template-variable definition

- e.g. t_1 and t_3 are def-use related for variable A

p_1 maps to t_1 and p_5 maps to t_3

R : the program fragment between (exclusive between) p_1 and p_5

R : ebx = 0x400000

nop

ecx = eax + 1

so the value for A after p_1 must be the same as the value for A

before p_5

-- the value for A at p_1 is : eax

-- the value for A at p_5 is : ecx - 1

so, confirm that from < p_2, p_3, p_4 >, eax == (ecx - 1)

"the expressions corresponding to template variable A after p_1

must be equal in value to the expressions corresponding to

template variable A before p_5 (since p_1 and p_5 are related by

a def-use relationship for variable A)"

express this via a predicate :

for each path I in program fragment R

val_pre( I )(eax) = val_post( I )(ecx - 1)

[e.g. I = <p_2, p_3, p_4>]

val_pre( I )( x ) : the value of x before path I

val_post( I )( x ) : the value of x after path I

- for each match between a program node p_i and a template node t_j, the

algo checks whether the def-use chain is preserved for each program

expression corresponding to a template var used at the matched node

- does R maintain the value predicate PHI :

PHI = { for all I in R,

( val_pre<I>(eax) = val_post<I>( ecx - 1 ) )

}

- treat decision procedures as oracles that answer the question,

"Does a program fragment R maintain an invariant PHI?"

- another example, def-use chain t_5, t_3 maps to p_8, p_5

R : < p_9, p_4 >

: ebx = ebx + 2

ecx = eax + 1

definition of A in t_5 : eax

definition of A in t_3 : ecx - 1

val_pre( I )( eax ) =? val_post( ecx - 1 ) Yes.

- another example, def-use chain t_1, t_4 maps to p_1, p_7

R : < p_2, p_3, p_4, p_5, p_6 >

: ebx = 0x400000

nop

ecx = eax + 1

ecx - 1 > 0x406000 ?

eax = ecx - 1

definition of A in t_1 : eax

definition of A in t_5 : eax

val_pre( I )( eax ) =? val_post( eax ) Yes.

=====================================================

- Using decision procedures to check value-preservation

=====================================================

- want to figure out whether a program fragment P is a semantic nop

with respect to some conditions, e_1 and e_2

if -- for every path I in that program fragment -- the value of e_1

*before* that path is the same as the value of e_2 *after* that path

then that PHI( I, e_1, e_2 ) holds

if ( PHI( I, e_1, e_2 ) for every I in R holds )

then PHI( P, e_1, e_2 ) holds and D returns TRUE

if on the other hand, there exists some I' in R for which

PHI( I', e_1, e_2 ) == "don't know" or "false"

then D( P, e_1, e_2 ) = "don't know" or "false"

we can also determine whether: !PHI( P, e_1, e_2 ) holds...

if ( PHI( I, e_1, e_2 ) for every I in R *does not hold* )

then !PHI( P, e_1, e_2 ) does not hold

D^+ : a decision procedure for PHI( P, e_1, e_2 )

D^- : a decision procedure for !PHI( P, e_1, e_2 )

=======================================================================

- "semantic nops" : some program fragment I is a semantic nop with

respect to two conditions e_1 and e_2 if the value of e_1 before that

fragment is the same as the value of e_2 after that fragment

=======================================================================

- value-preservation oracle : decision procedure, determines whether a

given program fragment is a semantic nop w.r.t. certain prog vars

D determines whether the value predicate,

PHI( P, e_1, e_2 )

holds for all paths in P; i.e.

for all paths I in P,

( val_pre<I>(e_1) = val_post<I>(e_2) )

-------------

- NOP library : identifies sequences of nop instructions

-------------

-- annotate basic blocks as nop sequences when applicable

-- use actual nop instructions plus some sequences which are known to be nops

-----------------------------

- randomized symbolic execution

-----------------------------

-- program frag executed using a random initial state

-- at completion, check whether !PHI(P,e_1,e_2) is true

-- if so, at least one path in the program fragment is not a

semantic nop; and so whole P is NOT a semantic nop

-- else, whole P may be a semantic nop

--------------------------

- theorem proving : SIMPLIFY

--------------------------

-- value preservation problem formulated as a theorem to be proved

(as long as program frag has NO loops)

-- program frag represented as state transformer DELTA

-- then use SIMPLIFY to prove the formula

DELTA ==> PHI( P, e_1, e_2 )

- if SIMPLIFY confirms that the formula is valid,

then the program fragment is a semantic nop

-- can't handle programs which include bit-vector operations

2: ebx = 0x400000

3: nop

4: ecx = eax + 1

Simplify formula :

(IMPLIES (AND (EQ ebx_1 4194304)

(EQ ecx_4 (+ eax_pre 1) )

(EQ ecx_post ecx_4)

)

(EQ eax_pre (- eax_post 1))

)

same as:

(ebx_1 == 4194304) && (ecx_4 == (eax_pre + 1)) && (ecx_post == ecx_4)

==> ( eax_pre == (eax_post - 1) )

- theorem proving : UCLID

-- based on UCLID infinite-state bounded model checker

-- logic supported by UCLID is a superset of that supported by SIMPLIFY

-- UCLID precisely models int and bit-vector arith

-- model program fragment instructions as state transformers for each

-- UCLID simulates program frag for a given # of steps and determines

whether PHI( P, e_1, e_2 ) holds at the end of the simulation

=========================

Strengths and Limitations

=========================

----------------------------------------------------------------

4 classes of obfuscations, A_MD can handle (to varying degrees):

----------------------------------------------------------------

- code reordering, equivalent instruction replacement, register renaming,

garbage insertion

(1) code reordering : change order of instructions; maintain execution

order using jumps

- handled via use of control flow graphs

(2) register renaming : reassigns registers in selected program fragments

- so any tool searching for signature which is based on specific register

names will miss the obfuscated variant

- use templates; vars assigned registers only during unification

- so can detect any program with the same behavior as the template

irrespective of register allocation

(3) garbage insertion : insert instruction sequences that don't change

behavior of program

- uses decision procedures to reason about value predicates on def-use chains

(4) equivalent instruction replacement

- uses rich instruction set to replace groups of consecutive instructions

with other short instruction sequences that have the same semantics

- e.g. to add 1 to register X, can use: < add X, 1 > or < inc X >

- handle limited kind of this via normalize code into IR which supports only

semantically-disjoint operations

-----------

LIMITATIONS

-----------

- doesn't recognize that: "x = x + 2" is equivalent to "eax = eax << 1"

- can only detect programs that exhibit same ordering of memory updates

-- recall the problem where we populated a[i] with zero via pattern:

<1,2,3,...,20> versus <2,4,6,8,...,20,1,3,5,...,19>

-- comes from use of def-use chains for value preservation checking

but NB : automatically generating functionally-equivalent variants is hard

Used : (1) decryption-loop template and (2) mass email template