New Ideas for Medicine Program

Trace-based Just-in-Time Type Specialization for Dynamic
Languages
Andreas Gal∗+, Brendan Eich∗, Mike Shaver∗, David Anderson∗, David Mandelin∗,
Mohammad R. Haghighat$, Blake Kaplan∗, Graydon Hoare∗, Boris Zbarsky∗, Jason Orendorff∗,
Jesse Ruderman∗, Edwin Smith#, Rick Reitmaier#, Michael Bebenita+, Mason Chang+#, Michael Franz+
Mozilla Corporation∗
{gal,brendan,shaver,danderson,dmandelin,mrbkap,graydon,bz,jorendorff,jruderman}@mozilla.com
Adobe Corporation#
{edwsmith,rreitmai}@adobe.com
Intel Corporation$
{mohammad.r.haghighat}@intel.com
University of California, Irvine+
{mbebenit,changm,franz}@uci.edu
Abstract
Dynamic languages such as JavaScript are more difﬁcult to com-
pile than statically typed ones. Since no concrete type information
is available, traditional compilers need to emit generic code that can
handle all possible type combinations at runtime. We present an al-
ternative compilation technique for dynamically-typed languages
that identiﬁes frequently executed loop traces at run-time and then
generates machine code on the ﬂy that is specialized for the ac-
tual dynamic types occurring on each path through the loop. Our
method provides cheap inter-procedural type specialization, and an
elegant and efﬁcient way of incrementally compiling lazily discov-
ered alternative paths through nested loops. We have implemented
a dynamic compiler for JavaScript based on our technique and we
have measured speedups of 10x and more for certain benchmark
programs.
Categories and Subject Descriptors D.3.4 [Programming Lan-
guages]: Processors — Incremental compilers, code generation.
General Terms Design, Experimentation, Measurement, Perfor-
mance.
Keywords JavaScript, just-in-time compilation, trace trees.
1. Introduction
Dynamic languages such as JavaScript, Python, and Ruby, are pop-
ular since they are expressive, accessible to non-experts, and make
deployment as easy as distributing a source ﬁle. They are used for
small scripts as well as for complex applications. JavaScript, for
example, is the de facto standard for client-side web programming
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proﬁt or commercial advantage and that copies bear this notice and the full citation
on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior speciﬁc permission and/or a fee.
PLDI’09, June 15–20, 2009, Dublin, Ireland.
Copyright c© 2009 ACM 978-1-60558-392-1/09/06. . . $5.00
and is used for the application logic of browser-based productivity
applications such as Google Mail, Google Docs and Zimbra Col-
laboration Suite. In this domain, in order to provide a ﬂuid user
experience and enable a new generation of applications, virtual ma-
chines must provide a low startup time and high performance.
Compilers for statically typed languages rely on type informa-
tion to generate efﬁcient machine code. In a dynamically typed pro-
gramming language such as JavaScript, the types of expressions
may vary at runtime. This means that the compiler can no longer
easily transform operations into machine instructions that operate
on one speciﬁc type. Without exact type information, the compiler
must emit slower generalized machine code that can deal with all
potential type combinations. While compile-time static type infer-
ence might be able to gather type information to generate opti-
mized machine code, traditional static analysis is very expensive
and hence not well suited for the highly interactive environment of
a web browser.
We present a trace-based compilation technique for dynamic
languages that reconciles speed of compilation with excellent per-
formance of the generated machine code. Our system uses a mixed-
mode execution approach: the system starts running JavaScript in a
fast-starting bytecode interpreter. As the program runs, the system
identiﬁes hot (frequently executed) bytecode sequences, records
them, and compiles them to fast native code. We call such a se-
quence of instructions a trace.
Unlike method-based dynamic compilers, our dynamic com-
piler operates at the granularity of individual loops. This design
choice is based on the expectation that programs spend most of
their time in hot loops. Even in dynamically typed languages, we
expect hot loops to be mostly type-stable, meaning that the types of
values are invariant. (12) For example, we would expect loop coun-
ters that start as integers to remain integers for all iterations. When
both of these expectations hold, a trace-based compiler can cover
the program execution with a small number of type-specialized, ef-
ﬁciently compiled traces.
Each compiled trace covers one path through the program with
one mapping of values to types. When the VM executes a compiled
trace, it cannot guarantee that the same path will be followed
or that the same types will occur in subsequent loop iterations.

Hence, recording and compiling a trace speculates that the path and
typing will be exactly as they were during recording for subsequent
iterations of the loop.
Every compiled trace contains all the guards (checks) required
to validate the speculation. If one of the guards fails (if control
ﬂow is different, or a value of a different type is generated), the
trace exits. If an exit becomes hot, the VM can record a branch
trace starting at the exit to cover the new path. In this way, the VM
records a trace tree covering all the hot paths through the loop.
Nested loops can be difﬁcult to optimize for tracing VMs. In
a na¨ıve implementation, inner loops would become hot ﬁrst, and
the VM would start tracing there. When the inner loop exits, the
VM would detect that a different branch was taken. The VM would
try to record a branch trace, and ﬁnd that the trace reaches not the
inner loop header, but the outer loop header. At this point, the VM
could continue tracing until it reaches the inner loop header again,
thus tracing the outer loop inside a trace tree for the inner loop.
But this requires tracing a copy of the outer loop for every side exit
and type combination in the inner loop. In essence, this is a form
of unintended tail duplication, which can easily overﬂow the code
cache. Alternatively, the VM could simply stop tracing, and give up
on ever tracing outer loops.
We solve the nested loop problem by recording nested trace
trees. Our system traces the inner loop exactly as the na¨ıve version.
The system stops extending the inner tree when it reaches an outer
loop, but then it starts a new trace at the outer loop header. When
the outer loop reaches the inner loop header, the system tries to call
the trace tree for the inner loop. If the call succeeds, the VM records
the call to the inner tree as part of the outer trace and ﬁnishes
the outer trace as normal. In this way, our system can trace any
number of loops nested to any depth without causing excessive tail
duplication.
These techniques allow a VM to dynamically translate a pro-
gram to nested, type-specialized trace trees. Because traces can
cross function call boundaries, our techniques also achieve the ef-
fects of inlining. Because traces have no internal control-ﬂow joins,
they can be optimized in linear time by a simple compiler (10).
Thus, our tracing VM efﬁciently performs the same kind of op-
timizations that would require interprocedural analysis in a static
optimization setting. This makes tracing an attractive and effective
tool to type specialize even complex function call-rich code.
We implemented these techniques for an existing JavaScript in-
terpreter, SpiderMonkey. We call the resulting tracing VM Trace-
Monkey. TraceMonkey supports all the JavaScript features of Spi-
derMonkey, with a 2x-20x speedup for traceable programs.
This paper makes the following contributions:
• We explain an algorithm for dynamically forming trace trees to
cover a program, representing nested loops as nested trace trees.
• We explain how to speculatively generate efﬁcient type-specialized
code for traces from dynamic language programs.
• We validate our tracing techniques in an implementation based
on the SpiderMonkey JavaScript interpreter, achieving 2x-20x
speedups on many programs.
The remainder of this paper is organized as follows. Section 3 is
a general overview of trace tree based compilation we use to cap-
ture and compile frequently executed code regions. In Section 4
we describe our approach of covering nested loops using a num-
ber of individual trace trees. In Section 5 we describe our trace-
compilation based speculative type specialization approach we use
to generate efﬁcient machine code from recorded bytecode traces.
Our implementation of a dynamic type-specializing compiler for
JavaScript is described in Section 6. Related work is discussed in
Section 8. In Section 7 we evaluate our dynamic compiler based on
1 for (var i = 2; i < 100; ++i) {
2 if (!primes[i])
3 continue;
4 for (var k = i + i; i < 100; k += i)
5 primes[k] = false;
6 }
Figure 1. Sample program: sieve of Eratosthenes. primes is
initialized to an array of 100 false values on entry to this code
snippet.Interpret
Bytecodes
Monitor
Record
LIR Trace
Execute
Compiled Trace
Enter
Compiled Trace
Compile
LIR Trace
Leave
Compiled Trace
loop
edge
hot
loop/exit
abort
recording
ﬁnish at
loop header
cold/blacklisted
loop/exit
compiled trace
ready
loop edge with
same types
side exit to
existing trace
side exit,
no existing trace
Overhead
Interpreting
Native
Symbol Key
Figure 2. State machine describing the major activities of Trace-
Monkey and the conditions that cause transitions to a new activ-
ity. In the dark box, TM executes JS as compiled traces. In the
light gray boxes, TM executes JS in the standard interpreter. White
boxes are overhead. Thus, to maximize performance, we need to
maximize time spent in the darkest box and minimize time spent in
the white boxes. The best case is a loop where the types at the loop
edge are the same as the types on entry–then TM can stay in native
code until the loop is done.
a set of industry benchmarks. The paper ends with conclusions in
Section 9 and an outlook on future work is presented in Section 10.
2. Overview: Example Tracing Run
This section provides an overview of our system by describing
how TraceMonkey executes an example program. The example
program, shown in Figure 1, computes the ﬁrst 100 prime numbers
with nested loops. The narrative should be read along with Figure 2,
which describes the activities TraceMonkey performs and when it
transitions between the loops.
TraceMonkey always begins executing a program in the byte-
code interpreter. Every loop back edge is a potential trace point.
When the interpreter crosses a loop edge, TraceMonkey invokes
the trace monitor, which may decide to record or execute a native
trace. At the start of execution, there are no compiled traces yet, so
the trace monitor counts the number of times each loop back edge is
executed until a loop becomes hot, currently after 2 crossings. Note
that the way our loops are compiled, the loop edge is crossed before
entering the loop, so the second crossing occurs immediately after
the ﬁrst iteration.
Here is the sequence of events broken down by outer loop
iteration:

v0 := ld state[748] // load primes from the trace activation record
st sp[0], v0 // store primes to interpreter stack
v1 := ld state[764] // load k from the trace activation record
v2 := i2f(v1) // convert k from int to double
st sp[8], v1 // store k to interpreter stack
st sp[16], 0 // store false to interpreter stack
v3 := ld v0[4] // load class word for primes
v4 := and v3, -4 // mask out object class tag for primes
v5 := eq v4, Array // test whether primes is an array
xf v5 // side exit if v5 is false
v6 := js_Array_set(v0, v2, false) // call function to set array element
v7 := eq v6, 0 // test return value from call
xt v7 // side exit if js_Array_set returns false.
Figure 3. LIR snippet for sample program. This is the LIR recorded for line 5 of the sample program in Figure 1. The LIR encodes
the semantics in SSA form using temporary variables. The LIR also encodes all the stores that the interpreter would do to its data stack.
Sometimes these stores can be optimized away as the stack locations are live only on exits to the interpreter. Finally, the LIR records guards
and side exits to verify the assumptions made in this recording: that primes is an array and that the call to set its element succeeds.
mov edx, ebx(748) // load primes from the trace activation record
mov edi(0), edx // (*) store primes to interpreter stack
mov esi, ebx(764) // load k from the trace activation record
mov edi(8), esi // (*) store k to interpreter stack
mov edi(16), 0 // (*) store false to interpreter stack
mov eax, edx(4) // (*) load object class word for primes
and eax, -4 // (*) mask out object class tag for primes
cmp eax, Array // (*) test whether primes is an array
jne side_exit_1 // (*) side exit if primes is not an array
sub esp, 8 // bump stack for call alignment convention
push false // push last argument for call
push esi // push first argument for call
call js_Array_set // call function to set array element
add esp, 8 // clean up extra stack space
mov ecx, ebx // (*) created by register allocator
test eax, eax // (*) test return value of js_Array_set
je side_exit_2 // (*) side exit if call failed
...
side_exit_1:
mov ecx, ebp(-4) // restore ecx
mov esp, ebp // restore esp
jmp epilog // jump to ret statement
Figure 4. x86 snippet for sample program. This is the x86 code compiled from the LIR snippet in Figure 3. Most LIR instructions compile
to a single x86 instruction. Instructions marked with (*) would be omitted by an idealized compiler that knew that none of the side exits
would ever be taken. The 17 instructions generated by the compiler compare favorably with the 100+ instructions that the interpreter would
execute for the same code snippet, including 4 indirect jumps.
i=2. This is the ﬁrst iteration of the outer loop. The loop on
lines 4-5 becomes hot on its second iteration, so TraceMonkey en-
ters recording mode on line 4. In recording mode, TraceMonkey
records the code along the trace in a low-level compiler intermedi-
ate representation we call LIR. The LIR trace encodes all the oper-
ations performed and the types of all operands. The LIR trace also
encodes guards, which are checks that verify that the control ﬂow
and types are identical to those observed during trace recording.
Thus, on later executions, if and only if all guards are passed, the
trace has the required program semantics.
TraceMonkey stops recording when execution returns to the
loop header or exits the loop. In this case, execution returns to the
loop header on line 4.
After recording is ﬁnished, TraceMonkey compiles the trace to
native code using the recorded type information for optimization.
The result is a native code fragment that can be entered if the
interpreter PC and the types of values match those observed when
trace recording was started. The ﬁrst trace in our example, T45,
covers lines 4 and 5. This trace can be entered if the PC is at line 4,
i and k are integers, and primes is an object. After compiling T45,
TraceMonkey returns to the interpreter and loops back to line 1.
i=3. Now the loop header at line 1 has become hot, so Trace-
Monkey starts recording. When recording reaches line 4, Trace-
Monkey observes that it has reached an inner loop header that al-
ready has a compiled trace, so TraceMonkey attempts to nest the
inner loop inside the current trace. The ﬁrst step is to call the inner
trace as a subroutine. This executes the loop on line 4 to completion
and then returns to the recorder. TraceMonkey veriﬁes that the call
was successful and then records the call to the inner trace as part of
the current trace. Recording continues until execution reaches line
1, and at which point TraceMonkey ﬁnishes and compiles a trace
for the outer loop, T16.