Saturday, May 30, 2009

Dynamically generating and executing x86 code

Introduction

Computer programs can generate machine code in memory and then execute it. This is the case with Sun's Java Hotspot VM, which dynamically compiles Java bytecode to native code to increase the VM's performance. Several exploits are based on buffer overflows to remotely inject machine code into memory and then jumping into it. Google released it's Chrome web browser featuring the V8 Javascript Engine, which greatly improved Javascript's performance by compiling Javascript to native code.

These applications rely on dynamically generating and executing native code. While this can be used to improve performance of scripting languages, it is also a security vulnerability which has been addressed with new CPU technology such as the NX bit. This article briefly shows how to generate a simple x86 function and invoke it in C, and then it demonstrates how to make it work with recent CPUs and operating systems which make use of the NX bit.

A simple example of x86 machine code

The first program features a simple function, and we will use objdump to look at the generated x86 code:
#include <stdio.h>

int function()
{
return 42;
}

int main(int argc, char *argv[])
{
printf("return is %d\n", function());
return 0;
}
This simple program can then be compiled with GCC and analysed with objdump:
gcc simple.c -o simple
objdump -S simple
The generated code for "function" can be found within the output of objdump:
080483c4 <function>:
80483c4: 55 push %ebp
80483c5: 89 e5 mov %esp,%ebp
80483c7: b8 2a 00 00 00 mov $0x2a,%eax
80483cc: 5d pop %ebp
80483cd: c3 ret
So the function saves the previous frame pointer, stores the new one in %ebp, moves 42 (0x2a) to %eax, restores the frame pointer and returns. Since this is a simple function that doesn't even use the stack, we can simplify it a bit by not saving and restoring the frame pointer:
gcc -fomit-frame-pointer simple.c -o simple
objdump -S simple
080483c4 <function>:
80483c4: b8 2a 00 00 00 mov $0x2a,%eax
80483c9: c3 ret
This is almost as simple as a function can get. The first value on each line represents the virtual memory address where that instruction will be located. The next byte values represent the x86 machine code, and the equivalent assembly code is shown next (in AT&T syntax; this page compares AT&T syntax to Intel's).

It's very instructive to read the Intel manuals to see what exactly each opcode does. For the MOV instruction, see Volume 2A: Instruction Set Reference, A-M, section 3.2, "MOV" instruction (around page 640). The MOV instruction has several variations; the one used by GCC in this example is:

OpcodeInstruction64-Bit ModeCompat/Leg ModeDescription
B8 + rdMOV r32, imm32ValidValidMove imm32 to r32

The opcode is thus B8 + rd, where rd is a value from 0 to 7 to select the destination register. So B8 moves to %eax, B9 to %ebx, and so on. The opcode is followed by the 32-bit constant value to move. Notice that the x86 architecture is little endian, so the least significant byte comes first. Thus 2a 00 00 00 is the 32-bit hexadecimal representation of 42.

Finally, C3 is the opcode to return from a function. It is specified in Volume 2B: Instruction Set Reference, N-Z, section 4.1, RET instruction, around page 338.

Dynamically generating the function

The following program allocates a buffer in memory, writes the opcodes to return 42 into it and casts a function pointer to the buffer. Invoking this pointer will now execute the function at the buffer, which will immediately return 42:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char *argv[])
{
uint8_t *buf = malloc(1000);

buf[0] = 0xb8;
uint32_t u32 = 42;
memcpy(buf + 1, &u32, 4);
buf[5] = 0xc3;

int (*ptr)(void) = (int (*)(void)) buf;

printf("return is %d\n", ptr());

return 0;
}


A brief overview of memory protection on the x86 architecture

Modern CPUs provide several mechanisms for memory protection. Segmentation was introduced in the x86 architecture with the 80286, and allowed splitting the memory in segments with different access rights. A more common technology for memory protection nowadays is paging, which splits the memory in more fine grained units called pages.

Let's start with segmentation. Each segment has an 8-byte segment descriptor (see Volume 3A: System Programming Guide, section 3.4.5), which is reproduced below.


  • Base is the 32-bit start address of the segment
  • G and limit determine the segment size. Limit is 20 bits, so the maximum value is 1M. However, when G is 1, the granularity becomes 4K instead of 1 byte, and the maximum is thus 4K * 1M = 4G
  • D/B determines whether the default size is 16 or 32 bits
  • L maks a 64-bit code segment (only in IA-32e mode)
  • AVL is available for the system's programmer
  • P flags whether the segment is currently present in memory
  • DPL is the descriptor priviledge level, which is used to protect kernel-mode memory from user-mode processes
  • S is the descriptor type, either system or code/data
  • Type depends on the S bit. For code/data segments, the first bit determines if it's a code or data segment.
    • For data segments, the other 3 bits determine if the segment expands up or down, is writable or read only, and whether it was accessed
    • For code segments, the other 3 bits determine whether the segment is conforming, execute only or executable/readable, and whether it was accessed
So it seems that code segments are never writable at all, and data segments are never executable. So what happened in the previous program?

The catch is that current operating systems use a flat memory model, where the memory isn't actually segmented at all. Segmentation is however always enabled in x86, so what Linux and other operating systems do is setup both code and data segments to cover the entire memory. Thus, when we write to our buffer, we are using the data segment; and when we execute it, the instruction pointer (%eip) of the CPU uses the code segment to execute at the same memory address.

What about protection at the page level? Again, the Intel manuals have a complete description of the mechanism. The interesting part is in Volume 3A: System Programming Guide, section 3.7.6. This section describes the format of page directory and page table entries, which is reproduced here:


Memory is divided into pages, usually 4K each. Each 4K page has a page entry in some page table which describes the state of the page. The fields of the page entry mean:
  • Page-table base address is the base address of the table. Since it's 20 bits wide, the remaining 12 bits of each memory address are taken from the virtual memory address. So the high 20 bits determine the page, and the low 12 bits the offset within the page; notice that 12-bits can address exactly 4K
  • Avail are available bits for the system's programmer
  • G means global page and is ignored
  • Page Table Attribute Index, introduced in the Pentium III
  • Dirty flags whether a page has been written to
  • Accessed flags whether a page has been accessed (written or read)
  • Cache Disabled controls caching of individual pages
  • Write-Through controls write through or write back caching policy
  • User/Supervisor flags whether this page has user or supervisor priviledge
  • Read/Write specifies whether the page is read only or read/write
  • Present flags whether the page is currently present in memory. When it's not, a page fault is triggered when the page is accessed. This allows operating systems to swap memory between the RAM and the hard disk
So paging can determine whether a page is writeable or not, but not whether it's executable or not. Thus it's not possible to prevent data memory from being executed using only segmentation with a flat memory model and paging.

The NX bit

Newer x86 CPUs with 64-bit support have an extended page entry. This extended page entry is also available on CPUs with Physical Address Extension, which was introduced with the Pentium Pro and allows a 32-bit processor to address more than 4G of memory. The new page entry is now (section 3.10.3 of Volume 3A: System Programming Guide):


The lower 32 bits are the same as before, the new bits are either available for the system's programmer, reserved or extend the base address. Bit 63 however is the execute-disable bit, another name for the NX bit used by Intel. So pages can now be marked as not executable.

So why doesn't 32-bit Linux use PAE to protect it's data pages? Actually, Linux has support for it in the kernel, but some Linux distributions (including Ubuntu) disable it by default to support older processors. Support for other operating systems vary; see the Wikipedia article on the NX bit for more details.

Dynamically generating x86 machine code on system with NX protection

If you executed the previous program in a system with NX protection, it likely failed with a segmentation fault, or just crashed. To execute data memory, the memory has to be allocated with different functions. In Linux, instead of malloc use mmap to allocate the buffer:

#include <sys/mman.h>

...

uint8_t *buf = mmap(NULL, 1000, PROT_EXEC | PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);


Mmap is useful for other purposes; these particular arguments allocate a 1000 byte region which is readable, writeable and executable. The memory is private for the calling process and anonymous (meaning it is not backed by a file descriptor). Mmap is available on most UNIX systems including Linux; the equivalent Windows functions for the same purpose are VirtualAlloc and VirtualProtect.

The updated program works on any x86 architecture now.

Update

Of course, the buffer created with mmap is now both writeable and executable, so it's prone to buffer overflow exploits. Blake C. pointed out the handy mprotect function, which allows changing the protection of memory previously allocated with mmap. Thus, it is wise to make the memory non-writeable after writing the opcodes:

    if (mprotect(buf, 1000, PROT_EXEC | PROT_READ) < 0) {
fprintf(stderr, "mprotect failed: %s\n", strerror(errno));
return 1;
}

Wrapping up

A brief overview of memory protection on the x86 architecture was presented to help understand how to generate machine code and execute it. The example program is very simplistic, but it satisfied my curiosity on how JIT compilation works. For more serious work, some libraries are available:
Please let me know of any errors in the text or any suggestions you might have.