Binary Synthesis Translation (BST)
Barton Clark, (Naval Surface Warfare Center Q21), firstname.lastname@example.org, and
Jeffrey Solka, (Naval Surface Warfare Center Q20), email@example.com
Binary Synthesis Translation (BST) is a system that allows us to safely extract a portion(s) of a legacy/COTS executable. Once this code has been extracted we can then apply advanced optimization/translation operations. This approach can provide for execution of the extracted code upon a secure platform/environment.
The outline of this abstract is as follows I will first discuss the work that is currently being conducted at NSWCDD with regard to binary synthesis. I will then discuss a pilot program that is currently under way utilizing our technology to mitigate critical anti tamper technology gaps that exists among several naval weapon systems that are currently being sold through foreign military sales.
Decompilers have been around for a long time indeed the first known code decompiler was developed by the "Father of decompilation" Maury Halstead. In 1960, Maury directed a project on decompilation mainly to show the usefulness of the Neliac language as a Universal Computer Oriented Language, as well as a problem-oriented language. Joel Donnelly and Herman Englander implemented the D-Neliac decompiler for the Univac M-460 Countess Computer while on Maury's staff. ,Each time Joel would bring an 'impossibility' to Maury, they would sit down and figure a way around it., As Herm noted, ,the difficult we do immediately -- the impossible takes a little longer., The D-Neliac decompiler was an operational decompiler that decompiled Univac M-460 binary code into Neliac source code.
System platform dependency and application programming interface (API) routines native to a given operating system are bound into each binary image by the complier. API routines are either written in the language the compiler was written in or in lower assembler. The result of this operation is a binary program that contains not only the routines written by the programmer, but a great number of other routines linked in by the linker. A typical binary program written in C to display ``hello world'' and compiled on a COTS platform has over 22 different API subroutines in the binary program. The same Program written in Pascal and compiled on a COTS platform might generate more than 37 subroutines in the executable program. Conceptually, a decompiler works very much the same way a compiler works. It takes instructions from one format and translates them to another. The decompiler is broken up into several steps or phases.
We employ and combine several different modern decompilation techniques in pursuit of a secure synthesis solution. These techniques are based on compiler and optimization theory, and are applied to the process of de-compilation in a novel way. Our work utilizes several different components of existing Open-Source decompilers. They include dcc a UNIX based command line decompiler written by Cristina Cifuentes. Boomerang is a UNIX/Windows based GUI system that is probably the most mature of the major decompilers in use today. Boomerang is primarily developed by QuantumG and Mike Van Emmerik. Andromeda is a GUI version for Windows developed by Andrey Shulga. Our BST can be characterized as a system composed of several phases which are grouped into modules dependent on language or machine features. The front-end is a machine dependent module that parses the binary program, analyzes the semantics of the instructions in the program, and generates an intermediate representation of the program. We generate a control flow graph of each subroutine. BST can operate with language and machine independent modules. The system analyzes the low-level intermediate code and transforms it into a high-level representation available in any high-level language, and analyzes the structure of the control flow graph(s) and transforms them into graphs that make use of high-level control structures. Finally, the backend is a target language dependent module that generates code for the target language.
BST benefits from compiler and Application Programming Interface (API) signatures resident within a binary image. In the former compiler signatures for any start-up code is ignored and not decompiled. In the latter any API references are used for variable type information and propagated thru out the function analysis process. The BCS system is comprised of the three following modules each with a set of corresponding sub-modules:
Front-end: (Machine Code Dependent)
Intermediate Code Generation
Control Flow Graph Generation
Analysis (machine code Independent)
Global Data Flow Analysis
Code Generation, Optimization, Translation
The syntax parser analyzer groups bytes of the source program into grammatical phrases (or sentences) of the source machine language. These phrases or Idioms are stored in a hierarchical tree. One limitation with machine code is that the hierarchy will always have a maximum of two levels. The main problem encountered while parsing machine code is determining what data is and what an instruction is. For example, a case table which usually resides after the function that invokes it will be located in the code segment and is unknown to the decompiler weather the table is data or instruction. This is a common problem with COTS memory architectures that utilize the von Neumann architecture (Data and Code reside in same memory). One can not simply make the assumption that instructions can be parsed sequentially assuming that the next byte will always hold an instruction. Many machine dependent heuristics are needed in order to determine the correct set of instructions Semantic Analysis This phase involves checking the source program for the semantic meaning of groups of instructions. We gather the type information, and propagate this across the subroutine. If we safely assume that binary programs were produced by a compiler, the semantics of the machine language is correct in order for the program to execute (assuming the program executed properly). We make the assumption that semantic errors will not be present in the source program unless the syntax parser has performed an error such as data has been parsed instead of instructions Intermediate Code Generation We need to generate an intermediate representation of the source program so that the decompiler can analyze low-level structures within the module. During the generation process a road map is followed that will allow easier migration to the target language desired.. During this phase we utilize the three-address code instruction mapping.
Control Flow Graph Generation We generate a control flow graph of each subroutine in the source. This approach can be very useful for removing dead code, obfuscations, determining def-use relationships across procedures etc. We can additionally determine highlevel control structures used in the program. This graph will also assist in removal of compiler generated intermediate jumps Data Flow Analysis During this phase we attempt to refine the intermediate code. Here we begin to identify high-level language expressions. We try to remove any temporary registers or condition flags as these constructs are not used in high-level languages. This process involves determining data-dependencies and defined usages within a basic block of code. We use API signature type information to assist in identification of variable types. This information is then used thru out the procedural data scoping process. Once the procedure data scope has been established, then a higher order inter-procedural data scope is obtained. During this process complex global data flow analysis equations are solved for each procedure. This includes any parameters that are referenced by the procedure and any return values, and any global data variables are modified inside the procedures.
Code Generation During the final phase or back-end of the process, we begin the process of producing target source-code. The target language needs to be specified along with any optimizing or specific in-lining options. Traversal of the control graph for each sub-routine is implemented to handle such issues as variable naming, local stacks arguments, and register variable identifiers . Additionally the control-structures and intermediate instructions that were created in earlier steps are now translated to high-level language statements.
The discussion above is primarily concerned with converting machine language to a higher-level language source. The particular weapon system that we are protecting uses machine language that was originally derived from assembly language source. This system never used a high-level language like ,C, or FORTRAN.
Our approach with the SPY 1-D radar combat system is to identify CPI code to extract. We then disassemble the machine binary into an internal representation. We then generate control flow graphs, conduct semantic analysis ,data flow analysis this includes single static assignment (SSA), data propagation, register variable identification, data type propagation, data type analysis, primitive data types, complex data types, control flow analysis. We used these techniques to successfully extract a function or functions void of any data dependencies and API dependencies. This renders a sequence of code that will execute outside its normal environment. It is important to note that after CPI extraction, we must patch the host executable where CPI code once resided. If for example the host executable contains one CPI function that needs to be extracted, then we must patch the function entry point of that CPI function. We need to be able to invoke the external CPI function from the host machine. We handle this problem by replacing the next instruction after the function entry point with an interrupt service routine (ISR) call. This ISR routine will reside within the host ISR Table. The ISR routine can examine the contents of the stack owned by the process that invoked the ISR. We examine the stack for a return address. This return address will point back to the location directly following the ISR invocation. This information will allow us to determine which CPI function is invoking the ISR routine. One issue remains for the CPI to execute remotely. If for example the CPI function calls an operating system API call that in turn issues an interrupt service routine (ISR) request, then it is necessary to provide a mechanism that allows for the CPI function to execute the ISR instruction (on the host) remotely.
We mitigate this problem by providing on the host (target) machine and on the external secure board a remote procedure call interface (RPC) module that includes both client and server capability. This for example allows a CPI function to issue an ISR call while executing remotely.. In the case of the CPI function issuing an ISR call back to the host machine the CPI call would be issued within the context of a remote procedure call client. The host machine would be functioning within the context of a remote procedure call server. Once we have completed transforming and extracting the CPI code from the original executable we then package a remote relocatable library. This library module can be initially tested with in the original host executable process environment. We perform this operation to ensure that all dependency (memory, system call) operations have been resolved correctly. This process is very similar to executing a program and then at run time issuing a system load library call to load a given dynamic library into a process space. We conduct a validation and verification operation to insure proper operation of the CPI. The CPI code is then written out to a secure Field Programmable Gate Array (FPGA) chip residing on an external board. The FPGA employs a 128 bit Non-Volatile encryption strategy. This key can not be extracted externally. At this point we can determine what type of core execution image will reside within the (FPGA). If we decide to use a soft core execution environment (EE) the system will run at a greatly reduced speed but with greater flexibility. If we use a hard-core EE image the system can execute up to 800 MHz. If we decide to change the EE of the FPGA it will be necessary to translate the CPI code to match the EE of the FPGA.
Naturally this approach introduces a degree of operational latency by virtue of passing thru the ISR and RPC code. This is of course not a problem with the 1750 CPU found in the SPY-1D which currently runs at 33 MHz. The only real timing problem in this case is not to return too early back to the host process. In the case of newer COTS systems however while typical CPU's are executing internally at 3.0 GHz. range they are still only able to fetch instructions and memory at local bus speeds which are typically in the 800MHz. range.
The Altera FPGA we are using operates in the 800 MHz. ranges. Since we are using BST to rebuild the extracted function(s) we an employ such techniques as Function-Level Working-Set tuning. We can profile functions that are executed with in a target executable. The functions that are executed more frequently than others can be moved closer to the top of the module. This way the operating system can keep the popular code in memory and only load the rest of the module as needed (and then page it out again when it's no longer needed). This approach can provide for a significant increase in speed as it reduces on demand memory paging. We can further increase the speed of the extracted function by implementing strategies like reciprocal multiplication. The idea with reciprocal multiplication is to use multiplication instead of division in order to implement the division operation. Typically multiplication is four to six times faster than native division operations. The idea is to multiply a dividend by a fraction that is the reciprocal of the divisor. For example, if you wanted to divide 30 by 3, you would simply compute the reciprocal of 3, which is one divided by 3. The results of such an operation is approximately 0.3333333, so if you simply multiplying 30 by 0.3333333, you'd end up with it the correct result, which is 10.
We can further optimize by deconstructing an instruction and implementing its micro-code underpinnings. If we examine a typical floating point instruction found in many CPI algorithms. We can see that much of these instructions are implemented as lower-level multiply and divide operations. We can apply symmetric parallel optimization during the construction process and we can fabricate a custom micro-instruction that will utilize symmetric parallel optimization to run much faster than the original instruction.
The FPGA includes additional I/O pin tamper detection safety circuitry that is crucial to prevent any type of black-box attack. If an individual were to gain possession of the FPGA and probe with arbitrary input signals, the Altera provides a tamper detection capability. It works as follows: Learning Phase, the Input Monitoring Model is trained from real or simulated inputs and outputs to the device to be protected. These inputs and outputs could be measured, obtained from the device specification, or extrapolated from piecemeal knowledge of the device's characteristics. Once this database of known input-output combinations (both normal operational inputs and simulated tamper-style inputs) is created, it will be applied to a time-sensitive Input Monitoring Model.
Operational Phase, the component under protection is either considered to be in its normal operational environment or it is outside of this environment and being subjected to laboratory attacks. External inputs are received by the augmented anti-tamper package and fed directly to the protected component, the Input Monitoring Model and the Output Obfuscation Model. These inputs come from either the intended environment or from laboratory testing; the component is unaware of the source initially. In the former case, the Input Monitoring Model will respond with a No Tamper signal. This signal is used by the Output Obfuscation Model's Gating Mechanism to pass the normal operational outputs from the critical component through to the output of the augmented package. If, however, the recognizer outputs a Tampered Input signal, this signal initiates processing within the Output Obfuscation Model. Based upon the temporal inputs being received and the indicator from the recognizer, the generator will produce obfuscated outputs that are then passed through the gating mechanism to the output.
BST holds promise to solve many complex software vulnerability problems present in today's modern computer systems as well as older legacy systems. BST can help identify polymorphic viral code that falls under the radar screen of today's current signature based anti-virus products. BST can be used to mitigate buffer overflow vulnerabilities in legacy/COTS code by extracting code that is unsecured and running it within a secure framework. BST can be used to prevent Reverse Engineering (RE) efforts by protecting CPI code in a secure shielded environment particularly software vendor's protection mechanisms can be extracted and executed in a secure environment as well as military weapon systems CPI content. BST can be used to assist in software optimization and or upgrading of obsolete legacy code to modern software libraries.