develooper Front page | perl.perl5.porters | Postings from April 2009

LLVM as a compilation target for Perl

Thread Next
From:
Yuval Kogman
Date:
April 1, 2009 00:03
Subject:
LLVM as a compilation target for Perl
Message ID:
a891e1bd0903310856i1fdd7311p667e9504ab68dc1a@mail.gmail.com
Hi,
First off, apologies for posting out of thread. I'm unsubscribed from p5p
until I am back in Israel and have more time, and I don't know how to inject
in-reply-to using gmail. For this reason please remember to CC me when
replying.

This is a summary of my thoughts on LLVM as a target for Perl 5.

The reason I'm posting is that it seems to me like most of the facts in the
current discussion are just plain wrong.

I tried dabbling with LLVM JIT for Perl last year, and came up with somewhat
interesting results.

Anyway, here is a summary of my thoughts concerning LLVM and potentially
also other compilation targets, the numerous problems that will have to be
dealt with, and what I perceive to be the important benefits (IMHO
performance is not the selling point for this effort).




LLVM - what it is and what it isn't (from a p5 internals POV)
=============================================================

This section aims to clarify some misconceptions about what llvm is/does.

LLVM is a compiler toolchain that can be used as a library. It has several
components:


LLVM assembley
--------------

llvm assembley is a low level language. This is the intermediate
representation LLVM uses.

It has two on disk formats, one binary and one text, and an in memory
structure
that can be generated using the C++ libraries.

It is designed to be portable and easy to optimize.

The level of abstraction is very similar to C: memory management is manual
and
pointer based, it uses the same C types and data structure abstractions
(Structs, pointer arrays, etc), it follows the C calling conventions
(calling
an llvm function and a C function is the same)

The language itself is very different from C: the code structure is much
more
similar to that of an assembley language, it's designed to be written and
read
by machines, not humans.

the language also provides some more control features than C, such as tail
calls and efficient exception handling.


C to LLVM compilers
-------------------

llvm-gcc and clang compile C to llvm assembley.

they can be used to generate LLVM optimizable versions of the perl pp
functions and support code (e.g. sv.c and friends).

[Yuval: this was the main focus of my experimentation]


Running LLVM code
-----------------

llvm provides an llvm assembley interpreter (jit based), but it also has an
llvm assembley to static binary compiler.

The interpreter has some startup penalties, and requires the libraries to be
linked.


LLVM at runtime
----------------

the llvm bytecode lib can be used to generate code at runtime, which can
then
be run by the jit.

this code is linked into the host process, and can use existing C or llvm
symbols, whether statically or dynamically loaded.




Llvm's major performance advantage over traditional compilers is in its link
time optimization ability. It can analyze code inter procedurally, and
optimize
aggressively based on this knowlege. For instance:

* code loaded from a library can be inlined to a callsite if appropriate

* pointer indirection can be eliminated by modifying the calling conventions
at
  the call site and in the function

* loop invariants inside a function can be moved outside to the callsite

* all symbols can be made static after linkage, preventing further linkage
but
  allowing more aggressive optimizations




Current Results
===============

Perl internals on LLVM
----------------------

By compiling perl to llvm assembley with llvm-gcc (in the future clang will
be
more appropriate), i was able to produce an llvm assembley based libperl.
this
means that llvm could optimize in between the various code units.

This produced variable results, with perlbench running on average 10% faster
compared to the same version of GCC on my MacBook Pro if I recall correctly.

The results of this have been mailed to p5p.

On their own these results are uninteresting - it's just porting the build
toolchain to a rather esoteric compiler, which happens to be insignificantly
faster on my machine.

Perl code on LLVM
-----------------

The next logical step is compiling optrees to LLVM assembley.

I created a C function that is akin to unrolling of runops for a simple
arithmetic perl subroutine.

Perl's opcode dispatch is very simple to inline this way, since runops
essentially just walks a graph:

PL_op = CALL_FPTR(PL_op->op_ppaddr)(aTHX)

unrolling this would amount to creating:

PL_op = pp_foo(aTHX);
PL_op = pp_bar(aTHX);

For simple arithmetic compiling a perl optree to this form yielded a
performance improvement of about 15% compared to runops_standard for the
llvm
compiled perl.

LLVM can produce a function pointer which is trivial to install in the XSUB
field of an SV.

This means that jitted subroutines are upgraded to XSUBs, while unjitted (or
unjittable) subroutines are executed using the runops function with no code
changes.




An incremental plan for P5 on LLVM
==================================

In order to properly tap into the potential of LLVM as a target, several
changes must be made.

This section provides an incremental plan for deliverables that should
demonstrate the usefulness or futility of the project early on, before
sweeping
code changes are committed to.

1. JITable ops
--------------

To allow LLVM to aggressively optimize the ops data should be transfered
using
C calling conventions instead of using Perl's various stacks.

For instance using pp_add currently involves pushing two SVs on to the
stack,
and then invoking a C function and providing it with a pointer to the stack
using aTHX.

A more direct approach would be a r_add function:

SV * r_add(pTHX_ SV *l, SV *r);

pp_add could be defined in terms of r_add by simply popping off parameters
and
calling the C function with them.

The possible tradeoffs of this conversion are discussed in more detail
below.

The conversion should be carried out piecemeal, starting with a handful of
ops
for testing. Only ops with a perceived benefit should be converted.

2. JITtable emitted code
------------------------

Using these ops is a little more involved than simply unrolling runops.

The implicit stack parameters for a given optree must be deduced.

Fortunately arity data is annotated in the opcode table.

By starting with simple jittable ops and working up towards more complex
ones
this approach can be usable without supporting all of perl's operator
calling
conventions.

By processing the optree and annotating it with an SSA form (using a lookup
table on the side, to avoid changing any struct definitions), the data flow
is
easily converted to LLVM assembley.

the boundry between jittable ops and non jittable ops is bridged by
generating
appropriate stack pushes and pops.

Effectively, as far as the regular Perl runloop is concerned, the sequence
of
jitted ops has been converted to a custom single op

The generated code for e.g. 3 + $a will look something like:

SV *t1 = r_const(aTHX_ op1); /* extract the sv inside op1 */
SV *t2 = r_padsv(aTHX_ op2); /* extract the lexical variable $a */
SV *t3 = r_add(aTHX_ op3, t1, t2);
PL_op = op4; /* whatever is the ->op_next of the add op */
XPUSHs(t3); /* push the result on the stack */
/* execution resumes normally using runops */

the r_* functions will look inside an explicit op parameter, instead of
relying
on PL_op to get the various flags and metadata they require. Data present in
the dynamic scope is still reached using the THX or globals (in this case
the
pad offset is found using op2, but the value inside the pad is not reachable
via just the optree, that part remains the same).

Each subroutine is compiled separately. Initially only subroutines that are
easy to compile (made of fixed arity ops, no stack boundries, etc) can be
compiled, with additional functionality added later.

By "upgrading" pure perl subs into XSUBs after successful compilation, no
special hooking is required.

For the uninitiated - SSA is Static Single Assignment. In this form every
*value* has a symbol, and symbols are only assigned once. This is a common
intermediate representation for many languages.

LLVM assembly is based on SSA so this form will map directly to the emitted
output.

Generating an SSA form from a stack is fairly simple by using a stack of
symbols and a counter to generate new symbols. That way each stack position
is
assigned a symbol on push, and the symbol is used on peek (and of course
discarded on pop).

This symbol stack is duplicated on branches and regenerated similarly to the
way the SSA phi function works on join.

3. Control ops
--------------

Branches can initially be handled simply by comparing PL_op:

PL_op = pp_foo(aTHX); /* some conditional/loop op */

if ( PL_op == if_op->op_next ) {
...
} else { /* == op_other */
...
}

simplifying the task of emitting "native" branches such as

if ( SvTRUE(t1) ) {
...
} else {
...
}

This is easy to do because all of the conditional ops are handled very
consistently by Perl.

The only exception to this rule is explicit vs. implicit return, but that's
a
relatively simple case to handle.


Long term
---------

If/when the project is at the stage where trivial subroutines can be
compiled
to LLVM assembley trees, be jitted into a function pointer and installed
into
an XSUB, then it will be time to convert the remaining ops, starting with
pp_hot, to parameter based definitions.

Other corners to be rounded off:

Command line options and/or pragmas for controlling the compilation should
be
introduced, including parameters for automatic jitting (see below for
profile
based compilation).

Stress testing, especially to demonstrate limitations of C based runtime
need
to be examined. Since this potentially uses the C stack more aggressively
new
limitations might be found and will need to be worked around. See below for
Cheney on the MTA and other ideas.

New unit tests should be written (or at least a new harness to use the
existing
tests with various JIT options).

A marker for jitted CVs (as opposed to real XSUB) could be introduced, to
allow
disabling jit temporarily (e.g. for debugger tracing, reenabling pluggable
runops, etc). The LLVM optree should probably be stored either in magic or
in
the SV ANY slot of the CV.

Lots of work will be needed for making llvm an optional dependency. LLVM is
not
as portable as Perl, and will only be a desirable target for platforms which
it
supports native jit for. This incurs a big penalty on the already high
complexity of the code (many build variants, lots of preprocessing), and
must
be managed very carefully to avoid becoming a big maintenance burden.


Tradeoffs
---------

Splitting up pp_ ops into smaller functions is a big manual effort.
Fortunately
it can be done incrementally and it's not that hard to convert each
function.

The drawbacks are that it will either slow down normal C compliation, or it
will require even more crazy preprocessing of the C code to make sure the
r_*
functions are not defined unless compilation is targetting LLVM.

Aside from this the drawbacks of using LLVM involve:

1. more complex compilation toolchain (llvm assembley output must be dealt
with)

2. slower startup times for llvm based perls

3. inability to use the debugger or pluggable runops modules on jitted code

4. more memory usage (compiled optrees, llvm library, llvm intermediate
   structures, etc)

5. this plan is not guaranteed to work well. preliminary results are vague
   (but still somewhat promising). While not as comitting as a new virtual
   machine for perl 5, this is still a big undertaking.

   I estimate at least 1-2 months of full time work by someone who is at
   least somewhat familiar with both projects until we can judge whether or
   not this project can succeed at all.

   this is not *that* much work, but is still more than anyone involved in
   p5 internals currently has. I suspect GSoC might be a good fit for this.

   i doubt we can estimate how long from the point where it's deemed viable
   to the point where it's a real performance improvement, to the point
   where it's production ready. even though this is not as ambitious as
   ponie, this is still a big undertaking.

Advantages of this plan compared to other approaches of jitting perl 5:

1. no code changes, the same intricate definitions of all of perl's opcodes
   are reused. this means that all the accumilated wisdom is not lost, the
   data structures are the same, and in theory the system is still binary
   compatible with compiled XS modules

   this means a lot less work than developing a real vm backend for perl 5

   llvm's C interoperability features are key to this

2. llvm is backed by many organizations, it's stable, well funded and has a
   lively community so it is unlikely to die off.

   this provides us with a real, feature complete jit system with very
   advanced optimizations that can target many platforms, without any
   maintenance burden on the p5 developers.



However, in the long run this is still a micro optimization. Ruby is
reportedly
about twice as slow as perl for most things. That doesn't keep anyone from
writing successful applications in ruby.

Java is probably many times faster than Perl. None of us seem to be
switching
to it because you can crunch numbers faster in it.

A JIT can't make dramatic high level performance improvements.

It'll speed up tight loops, simple arithmetic, opcode dispatch, etc but for
apps whose performance is bound by memory management overhead or IO this
will
effort will go almost unnoticed.

Furthermore, IMHO most apps are not even bound by those limitations, but
rather
by programmer laziness/sloppiness. If better performance is a low hanging
fruit
that remains unpicked, JIT is really not going to make much of a difference.

Since typical Perl code probably won't be dramatically affected by the
performance benefits of this project, funding is probably unlikely.

Even if it runs 2x faster, it doesn't mean it'll be 2x cheaper to run. This
is
the "you are not google" axiom of scalability. Even if people want this
performance, for most of them it won't actually help their business.

This project might allow Perl to become an appropriate choice for more
performance critical applications, or alleviate the urge to write XS based
optimizations, but that effect will probably be minor at best.

I suspect the main benefit would actually be in marketing (look how excited
we're getting about python being "5x faster"), and in demonstrating that the
runtime is able to adapt and be modernized, perhaps paving the path to more
ambitious language and runtime improvements.




***** THAT LAST CHUNK WAS THE IMPORTANT BIT, IF YOU SKIPPED IT THE REST IS
EVEN
MORE BORING *****








Future directions
=================

Some ideas on how to take this further, based on other interesting projects.

These are ideas for further experimentation, some of which I've tried
pursuing
out of curiosity, none of which were anywhere near done.

None of these ideas are relevant until a stable and useful LLVM emitter can
actually be demonstrated for p5, with good performance on artificial
benchmarks.

They could theoretically be used to bridge the gap between good performance
in
the lab and good performance in the wild.



1. Profile based compilation
----------------------------

To avoid too much memory usage for generated code and too much cpu usage
generating it, code tracing should probably be used to identify hot paths.

This can be done by modifying ENTERSUB to trigger jit compilation for
commonly
used CVs.

To avoid cache spilling while tracing, a hashing variant of a bloom filter
may
be used to count CVs. This loses accuracy but given that this is a
stochastic
process anyway the tradeoff is probably fair.

A hash function with N salts can be used to sequentially flip the bits of
the
bloom filter. The hash is computed over the salt and the CV address.

The first hash is used to find the actual bloom filter. By doing this lookup
once we avoid a cache spill once per hash function.

The hashing is then repeated for the CV address. If the bloom filter bit is
off, it is flipped on and the loop is aborted.

If all the bits are turned on, the subroutine is jitted.

By increasing N the time before jitting is increased (higher threshold for
what
is a hot path).

By increasing the number or size of the bloom filters collision is avoided.

By using the hashing variant of a bloom filter (sub partitions), instead of
spilling the cache up to N times, it is spilled at most once.

The advantage over a hash of counters keyed by SV is significant: there is
only
one pointer indirection, the memory usage for the table is constant, and the
operations to update the counters are simple and always require a constant
time.

Yet another variant is introducing random decay to the filter. Once a
certain
fill percentage is reached, random bits are turned off. This allows the
system
to keep jitting code without getting false positives by filling the filter
up
too much. This approach is also sensitive to changing hot paths without
needing
historical using stochastic approximation.



2. trace based compilation
--------------------------

Another interesting direction is using type information to explode the code
path, much like psyco does.

This techniques allows tight loops in dynamic code to be have the dynamic
typing semantics moved out of the loops as invariants, producing dramatic
performance improvements for simple arithmetic.

This can be done by generating guard clauses that inspect the SV flags, and
then producing code that treats these parameters as invariants:

if ( SvIOK(sv) ) {
/* generate code in which this condition is an invariant */

This technique has been used by the psyco project to produce python code
that
computes simple arithmetic with speeds that are competitive with C/Java,
without any code changes in the python.

The drawback is potentially huge memory usage due to many variants being
generated. This combined with profile based compilation could be an
interesting
future direction.

LLVM's optimizers are supposed to be clever enough to perform high quality
constant detection, even accross procedure boundries, and using that for
branch
elimination, so it may be a matter of just generating const variants.

Otherwise, further dumbed down versions of the ops are in order (for
instance,
r_add on IVs, with two code paths, one for overflows and one without, nd a
separate r_add for NVs which does floating point math).

By having these variants the emitted machine code could potentially be much
smaller/faster, and the various sub opcodes would be better candidates for
inlining by llvm.



3. Inter procedure optimization
-------------------------------

entersub ops can be analyzed in depth.

If the CV of the target is known at compile time (the call is made by using
a
GV pointer instead of a symbol name), the LLVM variant of that sub can be
called directly, skipping entersub altogether.

This could be facilitated using some aggressive inlining pragma.



4. Stack overlflow guards
-------------------------

To prevent overflowing the C stack, entersub ops can have an additional
guard
clause added, to fall "traditional" interpretation if the C stack is running
out of room.

this allows the system to make use of slower but safe code as necessary, if
the
recursion limits are reached.

A related approach is using cheney-on-the-mta techniques to swap out the C
stack, or possibly overflow it into the perl stack, to avoid this limit
without
the performance tradeoffs of switching back to interpreting optrees.



5. Other targets
----------------

The work on refactoring the pp functions, creating an emitter etc will
likely
pave the path towards enabling other VMs in the future too.

The hardest part of targetting Perl to other VMs is that the opcodes are
very
intricately defined, and are highly dependent on the execution context (the
state of the stacks, the data in the optree, the op flags, etc).

LLVM should be able to provide better performance by refactoring large pp
functions into smaller, simpler ones.

This is a much bigger undertaking than using LLVM with existing because it
requires reimplementing, not just refactoring the various opcode
implementations, but immediate performance benefits might provide an
additional
incentive towards making these changes.

This is an incremental step towards cleaning up the runtime so that the
existing OP* structures can be sanely compiled to other targets.

This will also greatly simplify generating optrees with B generate, and
analyzing existing optrees.

This is the most pie in the sky idea, but in the long run probably the most
important.

My interest in this work was as a project that could potentially catalyze
some
improvement to the hackability of the optree, the performance aspects are
minor.


Prior Art
=========

lua-jit 2.x has trace based compilation that produces assembley code only
for
excercised code paths. it's very clever and clean.
http://lua-users.org/lists/lua-l/2008-02/msg00051.html contains lots of info
on
the internals

psyco uses trace caching to generate static variants of dynamic python code,
to
eliminate branches early on: http://psyco.sourceforge.net/

http://home.pipeline.com/~hbaker1/CheneyMTA.html is a technique for
compiling
continuation passing style code to native C code without using a
trampoline function to prevent the C stack from growing.


Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About