Writing a basic x86-64 JIT compiler from scratch in stock Python
By Christian Stigen Larsen
Posted 08 Nov 2017 — updated 11 Nov 2017
In this post I’ll show how to write a rudimentary, native x86-64 just-in-time
compiler (JIT) in CPython, using only the built-in modules.
Update: This post made the front page of HN, and I’ve incorporated some of the discussion feedback. I’ve also written a
follow-up post that JITs Python bytecode to x86-64.
The code here specifically targets the UNIX systems macOS and Linux, but should be easily translated to other systems such as Windows. The complete code is available on github.com/cslarsen/mi….
The goal is to generate new versions of the below assembly code at runtime and execute it.
1 | 复制代码48 b8 ed ef be ad de movabs $0xdeadbeefed, %rax |
We will mainly deal with the left hand side — the byte sequence 48 b8 ed ...
and so on. Those fifteen machine code bytes comprise an x86-64 function that multiplies its argument
with the constant 0xdeadbeefed
. The JIT step will create functions with different such constants. While being a contrived form of specialization, it illuminates
the basic mechanics of just-in-time compilation.
Our general strategy is to rely on the built-in ctypes
Python module to load the C standard library. From there, we can access system functions to interface with the virtual memory manager. We’ll use mmap
to fetch a page-aligned
block of memory. It needs to be aligned for it to become executable. That’s the reason why we can’t simply use the usual C function malloc
, because it may return memory that spans page boundaries.
The function mprotect
will be used to mark the memory block as read-only and executable. After that, we should be able to call into our freshly compiled block of code through ctypes.
The boiler-plate part
Before we can do anything, we need to load the standard C library.
1 | 复制代码import ctypes |
There are other ways to achieve this, for example
1 | 复制代码>>> import ctypes |
To find the page size, we’ll call sysconf(_SC_PAGESIZE)
. The _SC_PAGESIZE
constant is 29 on macOS but 30 on Linux. We’ll just hard-code those in our program. You can find them by digging into system header files or writing
a simple C program that print them out. A more robust and elegant solution would be to use the cffi
module instead of ctypes, because it can automatically parse header files. However, since I wanted to stick to the default CPython
distribution, we’ll continue using ctypes.
We need a few additional constants for mmap
and friends. They’re just written out below. You may have to look them up for other UNIX variants.
1 | 复制代码import ctypes |
Although not strictly required, it is very useful to tell ctypes the signature of the functions we’ll use. That way, we’ll get exceptions if we mix invalid types. For example
1 | 复制代码# Set up sysconf |
tells ctypes that sysconf
is a function that takes a single integer and produces a long integer. After this, we can get the current page size with
1 | 复制代码pagesize = sysconf(_SC_PAGESIZE) |
The machine code we are going to generate will be interpreted as unsigned 8-bit bytes, so we need to declare a new pointer type:
1 | 复制代码# 8-bit unsigned pointer type |
Below we just dish out the remaining signatures for the functions that we’ll use. For error reporting, it’s good to have the strerror
function available. We’ll use munmap
to destroy the machine code block after we’re done
with it. It lets the operating system reclaim that memory.
1 | 复制代码strerror = libc.strerror |
At this point, it’s hard to justify using Python rather than C. With C, we don’t need any of the above boiler-plate code. But down the line, Python will allow us to experiment much more easily.
Now we’re ready to write the mmap
wrapper.
1 | 复制代码def create_block(size): |
This function uses mmap
to allocate page-aligned memory for us. We mark the memory region as readable and writable with the PROT
flags, and we also mark it as private and anonymous. The latter means the memory will not be
visible from other processes and that it will not be file-backed. The Linux mmap
manual page covers the details (but be sure to view the man page for your system). If the mmap
call fails, we raise it as a Python error.
To mark memory as executable,
1 | 复制代码def make_executable(block, size): |
With this mprotect
call, we mark the region as readable and executable. If we wanted to, we could have made it writable as well, but some systems will refuse to execute writable memory. This is sometimes called the W^X security
feature.
To destroy the memory block, we’ll use
1 | 复制代码def destroy_block(block, size): |
I edited out a badly placed del
in that function after the HN submission.
The fun part
Now we’re finally ready to create an insanely simple piece of JIT code!
Recall the assembly listing at the top: It’s a small function — without a local stack frame — that multiplies an input number with a constant. In Python, we’d write that as
1 | 复制代码def create_multiplication_function(constant): |
This is indeed a contrived example, but qualifies as JIT. After all, we do create native code at runtime and execute it. It’s easy to imagine more advanced examples such as JIT-compiling Brainfuck to x86-64 machine code. Or using AVX instructions for blazing fast, vectorized math ops.
The disassembly at the top of this post was actually generated by compiling and disassembling the following C code:
1 | 复制代码#include <stdint.h> |
If you want to compile it yourself, use something like
1 | 复制代码$ gcc -Os -fPIC -shared -fomit-frame-pointer \ |
Here I optimized for space (-Os
) to generate as little machine code as possible, with position-independent code (-fPIC
) to prevent using absolute jumps, without any frame pointers (-fomit-frame-pointer
) to remove
superfluous stack setup code (but it may be required for more advanced functions) and using the current CPU’s native instruction set (-march=native
).
We could have passed -S
to produce a disassembly listing, but we’re actually interested in the machine code, so we’ll rather use a tool like objdump
:
1 | 复制代码$ objdump -d libmultiply.so |
In case you are not familiar with assembly, I’ll let you know how this function works. First, the movabs
function just puts an immediate number in the RAX register. Immediate is assembly-jargon for encoding something
right in the machine code. In other words, it’s an embedded argument for the movabs
instruction. So RAX now holds the constant 0xdeadbeefed
.
Also — by AMD64 convention — the first integer argument will be in RDI, and the return value in RAX. So RDI will hold the number to multiply with. That’s
what imul
does. It multiplies RAX and RDI and puts the result in RAX. Finally, we pop a 64-bit return address off the stack and jump to it with RETQ. At this level, it’s easy to imagine how one could implement continuation-passing style.
Note that the constant 0xdeadbeefed
is in little-endian format. We need to remember to do the same when we patch the code. (By the way, a good mnemonic for remembering the word order is that little endian means “little-end first”).
We are now ready to put everything in a Python function.
1 | 复制代码def make_multiplier(block, multiplier): |
At the bottom, we return the ctypes function signature to be used with this code. It’s somewhat arbitrarily placed, but I thought it was good to keep the signature close to the machine code.
The final part
Now that we have the basic parts we can weave everything together. The first part is to allocate one page of memory:
1 | 复制代码pagesize = sysconf(_SC_PAGESIZE) |
Next, we generate the machine code. Let’s pick the number 101 to use as a multiplier.
1 | 复制代码mul101_signature = make_multiplier(block, 101) |
We now mark the memory region as executable and read-only:
1 | 复制代码make_executable(block, pagesize) |
Take the address of the first byte in the memory block and cast it to a callable ctypes function with proper signature:
1 | 复制代码address = ctypes.cast(block, ctypes.c_void_p).value |
To get the memory address of the block, we use ctypes to cast it to a void pointer and extract its value. Finally, we instantiate an actual function from this address using the mul101_signature
constructor.
Voila! We now have a piece of native code that we can call from Python. If you’re in a REPL, you can try it directly:
1 | 复制代码>>> print(mul101(8)) |
Note that this small multiplication function will run slower than a native Python calculation. That’s mainly because ctypes, being a foreign-function library, has a lot of overhead: It needs to inspect what dynamic types you pass the function every
time you call it, then unbox them, convert them and then do the same with the return value. So the trick is to either use assembly because you have to access some new Intel instruction, or because you compile something like Brainfuck to native
code.
Finally, if you want to, you can let the system reclaim the memory holding the function. Beware that after this, you will probably crash the process if you try calling the code again. So probably best to delete all references in Python as well:
1 | 复制代码destroy_block(block, pagesize) |
If you run the code in its complete form from the GitHub repository, you can put the multiplication constant on the command line:
1 | 复制代码$ python mj.py 101 |
Debugging JIT-code
If you want to continue learning with this simple program, you’ll quickly want to disassemble the machine code you generate. One option is to simply use gdb or lldb, but you need to know where to break. One trick is to just print the hex value of
the block
address and then wait for a keystroke:
1 | 复制代码print("address: 0x%x" % address) |
Then you just run the program in the debugger, break into the debugger while the program is pausing, and disassemble the memory location. Of course you can also step-debug through the assembly code if you want to see what’s going on. Here’s an example
lldb session:
1 | 复制代码$ lldb python |
At this point, hit CTRL+C to break back into the debugger, then disassemble from the memory location:
1 | 复制代码(lldb) x/3i 0x1002fd000 |
Notice that 65 hex is 101 in decimal, which was the command line argument we passed above.
If you only want a disassembler inside Python, I recommend the Capstone module.
What’s next?
A good exercise would be to JIT-compile Brainfuck programs to native code. If you want to jump right in, I’ve made a GitHub repository at github.com/cslarsen/br….
I even have a Speaker Deck presentation to go with it. It performs JIT-ing and optimizations, but uses GNU Lightning to compile native code instead of this approach.
It should be extremely simple to boot out GNU Lightning in favor or some code generation of your own. An interesting note on the Brainfuck project is that if you just JIT-compile each Brainfuck instruction one-by-one, you won’t get much of a speed
boost, even if you run native code. The entire speed boost is done in the code optimization stage, where you can bulk up integer operations into one or a few x86 instructions. Another candidate for such compilation would be the Forth language.
Also, before you get serious about expanding this JIT-compiler, take a look at the PeachPy project. It goes way beyond this and includes a disassembler and supports seemingly the entire x86-64 instruction
set right up to AVX.
As mentioned, there is a good deal of overhad when using ctypes to call into functions. You can use the cffi
module to overcome some of this, but the fact remains that if you want to call very small JIT-ed functions a large number of
times, it’s usually faster to just use pure Python.
What other cool uses are there? I’ve seen some math libraries in Python that switch to vector operations for higher performance. But I can imagine other fun things as well. For example, tools to compress and decompress native code, access virtualization
primitives, sign code and so on. I do know that some BPF tools and regex modules JIT-compile queries for faster processing.
What I think is fun about this exercise is to get into deeper territory than pure assembly. One thing that comes to mind is how different instructions are disassembled to the same mnemonic. For example, the RETQ instruction has a different opcode
than an ordinary RET, because it operates on 64-bit values. This is something that may not be important when doing assembly programming, because it’s a detail that may not always matter, but it’s worth being aware of the difference. I saw that
gcc, lldb and objdump gave slightly different disassembly listings of the same code for RETQ and MOVABSQ.
There’s another takeaway. I’ve mentioned that the native Brainfuck compiler I made didn’t initially produce very fast code. I had to optimize to get it fast. So things won’t go fast just because you use AVX, Cuda or whatever. The cold truth is that
gcc contains a vast database of optimizations that you cannot possibly replicate by hand. Felix von Letiner has a classic talk about source
code optimization that I recommend for more on this.
What about actual compilation?
A few people commented that they had expected to see more about the actual compilation step. Fair point. As it stands, this is indeed a very restricted form of compilation, where
we barely do anything with the code at runtime — we just patch in a constant. I may write a follow-up post that focuses solely on the compilation stage. Stay tuned!
本文转载自: 掘金