Front page | perl.perl6.internals |
Postings from December 2001
Re: JIT me some speed!
From: Nicholas Clark
December 26, 2001 08:59
Re: JIT me some speed!
Message ID: 20011226164034.A11082@Bagpuss.unfortu.net
On Fri, Dec 21, 2001 at 12:03:51AM +0000, Tom Hughes wrote:
> In message <firstname.lastname@example.org>
> Dan Sugalski <email@example.com> wrote:
> > To run a program with the JIT, pass test_parrot the -j flag and watch it
> > scream. Well, scream if you're on x86 Linux or BSD (I get a speedup on
> > mops.pbc of 35x) but it's a darned good place to start.
> It does seem to be quite impressively fast. Faster even than the
> compiled version of mops on my machine...
> It looks like it is going to need some work before it can work for
> other instruction sets though, at least for RISC systems where the
> operands are typically encoded with the opcode as part of a single
> word and the range of immediate constants is often restricted.
> I'm thinking it will need some way of indicating field widths and
> shifts for the operands and opcode so they can be merged into an
> instruction word and also some way of handling a constant pool so
> that arbitrary addresses can be loaded using PC relative loads.
I've been thinking about this, and worrying somewhat if I'm in danger of trying
to make a something run before it can walk, but in trying to think how to make
an ARM jit. I don't know how typical ARM is of RISC CPUs, but it's probably
enough food for thought. I may have made some minor technical errors in this:
All our JIT code would be running in user mode on ARM, so we only get to play
with the user mode bank of 16 "general purpose" registers, named r0-r15
"general purpose" in quotes because r15 is the program counter, one (now always
r13) is the stack pointer and the ABI specifies uses for some others. I
assume that we're not generating shared library code (if we are, the ABI
claims r9 from us)
The ABI needs 6 registers to call another function, a called function has the
first 4 integer arguments passed in using r0-r3, and returns a result in r0.
However in the body of a function we can arrange to use 2 of those reserved 6,
so we have 12 registers for our use, of which r4-r9 will be preserved across
Memory loads and stores have to be done relative to a base register (which can
be r15, the PC), +- a constant (in the range 0-4095) or another register
(optionally shifted by a constant).
There is no direct way to load a register with an arbitrary 32 bit constant.
Registers can be loaded with an immediate constant stored in 12 bits as
<8 bit value> rotate right 2 * <4 bit value>
(So you can load 0xFF in one instruction (4 bytes), for something like 0xFFF
it's best to express it in 2 instructions (8 bytes) as load 0xFF, add 0xF00,
and for complex values stuff it in a data word within 4092 bytes of the PC and
load it PC relative, which is also 8 bytes, but possibly slower)
The upshot of this is that loading integer register values by substituting
addresses into template code is not an easy way to go. Much better would be
to load up r4 to r7 with the base addresses of the integer, floating point,
string and PMC registers and do a "load integer register 16" into r0 as
ldr r0, [r4, #60] @ parrot registers start at 1, don't they?
[maybe I am already in danger of trying to run before I can walk, as I suspect
that the address substitution can be shoe-horned as
ldr r0, [pc, #-4] @ oops - may have got the pipeling wrong here
.word XXXX @ substitute your address here
..L1 ldr r0, [r0]
but that's something like 5 CPU cycles rather than 2.]
I'm in danger of wanting to run before I can walk. And maybe I should shut up
for now as I'm describing a possible next generation JIT, rather than this one:
The reason I'd feel I'd naturally want to define Parrot_set_i_i not as 1
code snippet, but as
load parrot integer register (from memory) into CPU register n
# do nothing here
store CPU register n to parrot integer register (to memory)
is because even with a more complex JIT syntax that lets me translate
set I2, I4
ldr r0, [r4, #12]
str r0, [r4, #4]
with r4-r7 loaded with those base addresses, I'm wasting most of the ARM CPU:
I've got r1-r3, r8, r9, r12 and r14 that I'm not even using.
I've got to be careful in what I'm suggesting here:
I'm *not* suggesting that I (or anyone else) immediately writes a JIT that
maps CPU registers to parrot registers and attempts to keep values in the CPU
I *am* suggesting that if Parrot_set_i_i is defined as
input: load parrot integer register (from memory) into CPU register n
output: store CPU register n to parrot integer register (to memory)
where the JIT provides that load and store code (rather than each op)
then the work that goes into writing body code for the ops is still useful
and usable with a second (third?) generation JIT that does know how to combine
2 or more adjacent Parrot ops.
Possibly also useful thoughts:
For things like generating constants for ARM, it would be useful to be able
to specify alternative code generation snippets, with some sort of constraint
on when they should be used.
So the first to be tried would be
mov r0, #XXX
with the constraint that XXX must be expressible as
<8 bit value> rotate right 2 * <4 bit value>
mvn r0, #XXX
where XXX must be logical not of <8 bit value> rotate right 2 * <4 bit value>
and then a third that copes with arbitrary 32 bit values.
Note that currently it's the assembler at assemble time that works out how to
decompose XXX into the 8 bit constant and 4 bit shift. Our JIT code generator
is going to have to be able to do this.
The constraints might get quite complicated - it may be useful if a
snippet of parrot assembler could be used to express them.
The code generation itself could be quite complex - it might be useful if
as an alternative to the existing fast fixed system one could also define for
any parrot op a snippet of parrot assembler to call to try to generate code.
I presume that we can JIT the code for the JIT (running the JITting code as
interpreted code while building the JIT to avoid infinite recursion), and that
if that's still not fast enough we can write custom parrot ops in C for the
(not sure what API would be needed, sorry, but I suspect it's going to be
important to distinguish between an empty string result (0 bytes of code will
do the job) and an undef result [constraint failure, try the next alternative].
And it may be useful to also return a "quality" result to allow a -O2 type
optimiser to try different alternatives and then use the best)
In the few times I've written inline assembler for gcc, the three things I've
missed most are
1: being able to ask for temporary registers (that are neither input nor
output, and don't need to be explicitly initialised to zero or used in a
dummy way to avoid compiler warnings)
2: being able to specify constraints about which input registers can't be the
same as which output registers.
3: being able to specify alternatives, when I knew I could code the same
assembler in 2 ways depending on whether (say) the compiler's optimiser
wanted input and output in the same CPU register, or whether it would
benefit from them being different, and so getting a "free" copy into