A little while ago I had the pleasure of pressing the button to open source the Ruby JIT glue for the Ruby+OMR JIT compiler. This code, when combined with an appropriately modified copy of the CRuby VM and Eclipse OMR, shows how we added a simple JIT compiler to CRuby.

In this post, I'll cover some similar ground to my RubyKaigi talk from December 2015 (Video, Slides), but this time I will be able link to and show source code.

This post is focused on JIT technology; however, the VM we have up on Github was also modified to take advantage of OMR's Garbage Collection technology. For more on that, see the talk that Craig Lehman and Robert Young gave at RubyKaigi, It's dangerous to GC alone. Take this!.

Philosophy

The OMR project comes from a premise that many pieces of language runtime implementation are fundamentally very similar, even across disparate languages. This belief means that OMR is structured as a large core of language independent technology, with a language dependent “glue” that connects it to an actual langauge implementation.

In the case of the Ruby JIT code, our “glue” comes in two pieces:

  1. Modifications to the Ruby VM in order do things like load the JIT compiler object, and export certain routines so that the JIT can generate calls into the VM and use VM functionality.
  2. Adapter code that specializes the language-independent OMR compiler code (called Testarossa) to Ruby — this is the content of the rbjitglue project on GitHub.

We tried our best to try to minimize the first piece as we wanted to show off a relatively low-touch addition of a JIT. We’ve always hoped the Ruby community would eventually take on the OMR JIT as its JIT technology, so it was important to try to keep the VM as unmodified as possible — no maintainer likes monster patches! This decision limits certain pieces of the compiler technology, but we can grow past that (and some of the things we need are starting to be developed independently!).

Modifying the CRuby VM

Before we are able to run any JIT code, we of course need to provide an interface between the VM and the JIT compiler.

In our implementation, we add a call to vm_jit_init inside Init_BareVM. That function, implemented in vm_jit.c, loads a JIT DLL and then fills in a couple of structs full of function pointers. It is through these function pointers that communication between the JIT and the VM are established; the VM can call JIT methods using these function pointers (init_f, terminate_f, compile_f and dispatch_f). Similarly, the JIT can call VM methods, or generate calls to VM methods by using the function pointers stored in the JIT callbacks struct. We'll return to that callback struct later.

Another important connection piece is the global variables the Ruby VM uses. We need to pass those addresses to the Ruby JIT as well; this is done using the globals struct, which is filled in and passed to vm_jit_init.

Compilation Control

The question of when to compile a method is a question of compilation control. Production JIT compilers have very complex heuristics to manage the tradeoffs between startup speed, ramp-up, and peak throughput. In the Ruby JIT, we've chosen a much simpler choice: counted compilation. Each method has a counter associated with it. The counter is decremented every time the method is run, and when the counter hits zero the method is compiled.

The injection of compilation control is tightly integrated into how the VM calls into JIT compiled methods.

JIT Compiled Method Dispatch

The OMR Ruby JIT adopts a very simple calling convention for JIT methods. The JIT generates methods that only take one parameter:

typedef VALUE (*jit_method_t)(rb_thread_t*);

All other information for the method is retrieved from that thread parameter in the compiled code.

The Ruby VM has the core interpreter loop mildly modified to support calling these compiled bodies should they exist. Normally, the interpreter starts a core-interpreter loop invocation by calling vm_exec_core. In order for us to check if we can use a JIT compiled body, we have replaced that call with one to a cleverly named function vm_exec2.

vm_exec2 is very simple:

static inline VALUE
vm_exec2(rb_thread_t *th, VALUE initial)
{
    VALUE result;
    if (VM_FRAME_TYPE(th->cfp) != VM_FRAME_MAGIC_RESCUE && 
        VM_FRAME_TYPE_FINISH_P(th->cfp) && 
        vm_jitted_p(th, th->cfp->iseq) == Qtrue) { 
	  result = vm_exec_jitted(th);
    } else { 
      result = vm_exec_core(th, initial);
    }
    return result;
}

Cut down without comments and #ifdefs, it comes down to a very simple question:

  1. So long as the frame is of the right type (because there are some that currently give JIT methods indigestion), and there is a JIT body, then invoke that and get the result.
  2. Otherwise, call back into vm_exec_core and interpret this frame.

Adapting the compiler to the Ruby VM

To add the OMR JIT compiler to a language, the most fundamental piece is how you convert the language's source into the Intermediate Representation of the JIT compiler.

Testarossa's Intermediate representation is called Trees (although they are actually Directed-Acyclic-Graphs — unfortunately, history is a powerful force on a project). The representation of a program in Trees consists of three different major elements:

  1. A node represents a value. Nodes can be primitive operations that produce values (like lconst 10 or lload valueInMemory ), or composite operations like add that consume their children and produce a result. Nodes have data types, and operations will require and produce types; for example, lconst produces a "long" value — in Testarossa's case, a 64-bit integer. aladd produces an "address" type value from a child of address and long type.
  2. TreeTops root trees of computation. They are connected as a doubly linked list and are used to control program ordering; the order of operations within a tree of nodes doesn't matter, only between TreeTops. Where the order of evaluation will matter, between nodes of a tree, we can force an ordering by ‘anchoring’ nodes under TreeTops.
  3. TreeTops are connected within BasicBlocks, which define the control flow of a program.

So, in order to JIT-compile Ruby, we have to convert a Ruby method from this:

def foo(a, b)
    a + b 
end

into Trees. In Testarossa, we call this process IL Generation.

IL Generation in Ruby

Fortunately for us, we don't have to deal with the hard problems of parsing Ruby code. Inside the CRuby interpreter this has already been taken care of for us. Since Ruby 1.9, Ruby has been a bytecode-based interpreter, and so your Ruby code gets parsed and turned into bytecode that is then interpreted. You can see see the bytecodes by using the method RubyVM::InstructionSequence.dump:

$cat test.rb
def foo(a, b)
    a + b
end
$ ruby -e "puts RubyVM::InstructionSequence.compile_file('test.rb').disasm"
# <snipped iseq for main>  
<RubyVM::InstructionSequence:foo@t.rb>=======================
local table (size: 3, argc: 2 [opts: 0, rest: -1, post: 0, block: -1] s1)
[ 3] a<Arg>     [ 2] b<Arg>     
0000 trace            8                                               (   1)
0002 trace            1                                               (   2)
0004 getlocal_OP__WC__0 3
0006 getlocal_OP__WC__0 2
0008 opt_plus         <callinfo!mid:+, argc:1, ARGS_SKIP>
0010 trace            16                                              (   3)
0012 leave                                                            (   2)

So, to generate Trees for Ruby, we can parse the bytecodes of a method, rather than the source directly. In the rbjitglue, this happens almost entirely within the file RubyIlGenerator.cpp.

The basic bytecode IL generation scheme from Testarossa that Ruby follows assumes that each bytecode corresponds to, at most, one basic block. So, an array of basic blocks is created, and then, the bytecodes undergo an “abstract interpretation process,” where the bytecodes are walked and trees are generated into the basic blocks. Abstract interpretation here means that the IL Generator generates code that corresponds to what the interpreter would do.

When a bytecode-level jump is detected, a tree-level jump is generated to the basic block corresponding to the destination bytecode.

The bulk of the work ends up being done inside of RubyIlGenerator::indexedWalker. At its heart, it is a giant switch statement:

switch (insn)
   {
   case BIN(nop):                  /*nothing */ _bcIndex += len; break;
   // elided
   case BIN(getlocal):             push(getlocal(getOperand(1)/*idx*/, getOperand(2)/*level*/)); _bcIndex += len; break;
   // elided
   case BIN(leave):                _bcIndex = genReturn(pop()); break;

For each YARV bytecode, we reproduce in the abstract the operation that would occur at execution time. So, for example, the YARV opcode getlocal pushes a value from the local area onto the stack; so, the abstract version of this generates code that would load a value from the local area, then pushes it onto an abstract stack. Other operands, like leave in this example, would consume values off this abstract stack.

At the end of the procedure, we have trees! Let's look at the trees for the getlocal and opt_plus operations from foo above.

n23n      treetop                                                                    
n22n        lloadi  a[#612  a -24]                             
n21n          aloadi  ep[#599  ep +48]                        
n20n            aloadi  cfp[#600  cfp +48]                       
n19n              aload  <parm 0 Lrb_thread_t;>[#598  Parm]
n28n      treetop                                                                    
n27n        lloadi  b[#613  b -16]                               
n26n          aloadi  ep[#599  ep +48]                        
n25n            aloadi  cfp[#600  cfp +48]                 
n24n              aload  <parm 0 Lrb_thread_t;>[#598  Parm]  
n35n      astorei  pc[#603  pc]                                
n34n        aloadi  cfp[#600  cfp +48]                          
n33n          aload  <parm 0 Lrb_thread_t;>[#598  Parm]       
n32n        aconst 0x5606fbdad740                                                    
n36n      treetop                                                                    
n31n        lcall  vm_opt_plus[#290  helper Method]           
n30n          aload  <parm 0 Lrb_thread_t;>[#598  Parm]       
n29n          aconst 0x5606fbdad7a0                                                  
n22n          ==>lloadi
n27n          ==>lloadi

For education's sake, I've simplified this by stripping out some flags and node metadata. You can generate your own log like this by using traceFull,log=LOGFILENAME to see the full output.

There's quite a bit to see here, even though it's been simplified!

  1. In the left column are node numbers, which are node identities that let we humans talk about a particular node in an unambiguous way.
  2. n23n Is a treetop node. It's used to hang children off of to provide ordering. The relative ordering here is saying that the nodes underneath node 23 must be evaluated before the nodes underneath node 28.
  3. The tree rooted at n22n is the load of the parameter a passed into foo. The shape of this tree is dictated by the Ruby code. The sequence is essentially thread->cfp->ep[indexOf(a)] in C.
  4. The tree rooted at n35 is a store to the YARV program counter. This tree is here because we ensure in the Ruby JIT that that the interpreter can take over at any time, by restoring all interpreter state whenever it is possible that the JIT code may lose control (for example, through an exception occurring during a callback).
  5. The tree rooted at n31n corresponds to the call to a helper for the YARV instruction opt_plus. This is a generated callback.

Callback Generation

I said earlier we'd talk more about the jit_callback_struct. In its basic form, it looks like this:

struct jit_callbacks_struct {
   /* Include generated callbacks. */
   #include "vm_jit_callbacks.inc" 
   /* ... */
   const char *         (*rb_id2name_f)                 (ID id);
  // ...
}

The #include is there because an appreciable fraction of our callbacks are actually generated at build time, rather than created by hand. We are able to do this for Ruby because of the very clever design of YARV.

YARV uses an “instruction definition file” called insns.def, which is used to generate the core interpreter loop. However, we can also take advantage of this definition file to create callbacks automatically. In our modified RubyVM build process, we use the same processing code that creates the interpreter loop, and have it emit callbacks and function pointer declarations automatically.

So, an instruction definition looks like this (gently annotated for this blog post):

/**
  @c put
  @e to_str
  @j to_str の結果をスタックにプッシュする。
 */
DEFINE_INSN
tostring
() 				/* Instruction operands -- these are in the instruction seqeuence */ 
(VALUE val)     /* Values popped from the operand stack                           */ 
(VALUE val)     /* Return value                                                   */ 
{
    val = rb_obj_as_string(val);
}

Our generated callback (put into vm_jit.inc) looks like this:

VALUE
vm_tostring_jit(VALUE val)
{
    val = rb_obj_as_string(val);
    return val;
}

Generating callbacks allows us to support a broad range of Ruby opcodes without having to write large amounts of ILGen code by hand. But there's a tradeoff: as Evan Phoenix covers elegantly in his RubyKaigi 2015 keynote, this style of JIT has a limited optimization horizon; however, it fits quite nicely into the philosophy of our Ruby JIT.

Code Generation

I won't cover code generation too much here, as it's not so Ruby-specific and really needs coverage from others. The final product of compilation, however, is a startPC which corresponds to the beginning of the generated code for a method. That's what will be invoked in vm_exec_jitted, as described above.

Conclusion

Phew — you made it! Thanks so much for reading. That was a not-so-short summary of how the Ruby+OMR JIT compiler works. Right now, it only works for Ruby 2.2; however, I'm slowly plugging away at moving forward towards Ruby 2.4.

There's a huge list of things that still need to be built for the Ruby JIT:

  1. Runtime assumptions: The ability to speculatively assume certain things about the state of the machine, and then update the code for when the state of the virtual machine changes. An example would be assuming that the + operator hasn't been redefined, but then being able to cope with it when it does get redefined, either by discarding compiled methods, or more likely, by patching the compiled code.
  2. Asynchronous compilation: Today, when a method is compiled, it stops the thread executing Ruby code until the compilation is done. This doesn't need to happen, but for simplicities sake, we chose not to implement asynchronous compilation. However, if we were to, we could add things like multiple compilation threads.
  3. More advanced optimization: Today, the Ruby JIT uses a short list of very few optimizations. OMR supports many more — we need to enable them.
  4. Recompilation: We have preliminary recompilation support; however, we use the same compilation strategy — we need to enable more optimization.

I’d love to hear what you think of our progress so far. Leave a comment below this post, send a note through our mailing list, or open an issue on GitHub.

4 Comments on "Introducing the Ruby+OMR JIT"

  1. Thank you for writing this! I’m sort of a serial language designer/implementer, so OMR definitely attracts my attention.

    Regarding your last points:

    1. Runtime assumptions: does this mean Ruby+OMR doesn’t do any deoptimization yet? JITting Ruby without speculatively optimizing and being able to deoptimize sounds tricky, since in Ruby nearly anything can change at runtime. Does Ruby+OMR instead just emit a lot of VM calls to handle this?

    3. Do the not-yet enabled optimizations require more cooperation from the VM?

    • Hi Peter,

      You are correct that we currently don’t do any deoptimization. This means that the code we generate is _very_ conservative to handle all of the dynamism in Ruby. We’d like to doing deoptimization soon, though we also need to do some thinking and work in OMR about how to expose this functionality in a way that is usable.

      Regarding the not yet enabled optimizations: Certainly some of them may need some VM interfacing: for example, teaching the simplifier how fold FIXNUM computation (and inserting the redefinition guards required). But a good swath of them operate purely on the intermediate language level, and we don’t have to teach optimizations about Ruby.

Join The Discussion

Your email address will not be published. Required fields are marked *