develooper Front page | perl.perl5.porters | Postings from June 2008

More results from llvm-gcc

Thread Next
Yuval Kogman
June 3, 2008 19:18
More results from llvm-gcc
Message ID:

A while ago Claes compiled Perl 5.6 with llvm-gcc and got some
performance improvements. This is a continuation of that effort
using the 5.10 source tree.

Executive summary: this gives even better results than just plain
llvm-gcc, and theoretically opens up the way for much more. Results
are PerlBench improvements of 15% over standard gcc compilation,
using llvm's optimizing linker.

ANd now for the details:

llvm-gcc is basically gcc 4.2 with the backend switched to use
llvm's native code generation, instead of gcc's.

Normally running

	llvm-gcc -o foo.o foo.c

generates native code, which is then linked by the normal linker.

However, if you run it as

	llvm-gcc -emit-llvm -o foo.o foo.c

then foo.o is not a real object file, but an llvm bytecode file.

This file is then linkable with llvm-ld, allowing interprocedural

The result of linking perl like this

	llvm-ld -native -O5 -o perl blah blah blah

is an executable that is on average 15-20% faster as measured by
PerlBench on my machine, than the perl I compiled with gcc 4.0 and
use normally.

I blame this 10% improvement over plain llvm-gcc (without linking
bytecode, but native .o files) to LLVM's extensive link time

When linking without -native the perl executable is actually a
shell script that runs lli on perl.bc. This has a very slow startup
(about 3.5 seconds) but after that it's just as fast and sometimes
faster than the -native executable. Unfortunately it cannot used
with the -e command line option (filter_del emits an error about
removing fitlers). I haven't debugged this yet.

In order for dynamic loading of modules to work llvm-ld has to be
told to -disable-internalize (basically it needs to keep all the
external symbols still available for dynamic linking) and then the
Perl test suite passes except for one error relating to sdbm (output
below). Without this fix the results are faster, but of course the
test suite fails when loading XS code. I suppose whatever it takes
to build a static perl could fix this but i haven't actually tried.

Apple's iPhone SDK ships with an llvm-gcc that does the linking part
automatically but exhibits some breakage. I filed a bug report, and
once they fix it theoretically you could get the same speed
improvements by using llvm-gcc -O4 and changing nothing else.

The steps to run this are replacing ld and cc with the attached
script, and making sure that ar is llvm-ar. I couldn't get this to
work consistently without editing myself (Configure didn't
respect changing ar or ld, i don't know what the right solution is).

And now for the bad news:

	malloc: *** error for object 0x200f07: Non-aligned pointer being
	*** set a breakpoint in malloc_error_break to debug
	Use of uninitialized value $Dfile in stat at
	../ext/SDBM_File/t/sdbm.t line 47.
	Use of uninitialized value $mode in bitwise and (&) at
	../ext/SDBM_File/t/sdbm.t line 49.
	perl(83924) malloc: *** error for object 0x200f67: Non-aligned
	pointer being freed
	*** set a breakpoint in malloc_error_break to debug

I havne't looked into this yet. This repeats with several other DBM
related tests, but other then that the whole test suite passes.

The iphone sdk llvm-gcc -O4 compilation exhibits a few other test

And lastly, the future directions:

I hope to embed LLVM's bytecode loading and JIT support in the perl
executable, and patch XSLoader and DynaLoader to support loading of
LLVM bytecode, allowing LLVM based XS modules (could be interesting
for PAR like efforts, not just for optimization), and to retain the
bytecode output of compiling Perl itself so that it's also available
for the JIT.

When I have the opcode definitions (pp_*) available as llvm bytecode
functions I want to try and emit very naive  threaded bytecode from
the optree on a per subroutine basis, and transforming these
subroutines to XSUBs with the function pointer returned from the
llvm JIT.

For example, the body of sub { $x + 3 } would become similar to the
definition of:

	/* PL_op == cv->START; the nextstate op*/
	PL_op = pp_nextstate(aTHX);
	PL_op = pp_padsv(aTHX);
	PL_op = pp_const(aTHX);
	PL_op = pp_add(aTHX);
	return pp_leave(aTHX);

assuming that all the op->pp_addr == PL_pp_addr[op->type]. Hopefully
LLVM will be able to perform interprocedural optimizations between
the defintions of the various pp_*.

After that is in place the bytecode emitter can be extended, by
refactoring pp_* into smaller, non stack based functions, that are
not as reliant on the global environment, so that the above code can
actually become more like:

	SV tmp1 = opcode_padsv(aTHX_, pad_op);
	stack_push(opcode_add(tmp1, const_op_sv);
	free_tmp(tmp1); /* free if it has a PV */
	PL_op = next_op;

Allowing simple ops to avoid the overhead of pushing/popping data on
the stack, mortalizing, etc.

Lastly, I hope to base this emitter on Runops::Trace's recently
added features to get trace caching like compilation for just the
hotpath, to avoid unnecessary JIT optimization of seldom used


P.S. my new favourite command is make clean -j50 (yes, fifty).

  Yuval Kogman <>  0xEBD27418

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About