Front page | perl.perl6.internals |
Postings from July 2002
ARM Jit v2
Thread Next
From:
Nicholas Clark
Date:
July 29, 2002 15:06
Subject:
ARM Jit v2
Message ID:
20020729220305.GC354@Bagpuss.unfortu.net
Here's a very minimal ARM jit framework. It does work (at least as far as
passing all 10 t/op/basic.t subtests, and running mops.pbc)
As you can see from the patch all it does is implement the end and noop ops.
Everything else is being called. Interestingly, JITing like this is slower
than computed goto:
computed goto:
$ ./parrot examples/assembly/mops.pbc
Iterations: 100000000
Estimated ops: 200000000
Elapsed time: 37.209835
M op/s: 5.374923
no computed goto:
$ ./parrot -g examples/assembly/mops.pbc
Iterations: 100000000
Estimated ops: 200000000
Elapsed time: 71.245085
M op/s: 2.807211
JIT:
$ ./parrot -j examples/assembly/mops.pbc
Iterations: 100000000
Estimated ops: 200000000
Elapsed time: 53.474880
M op/s: 3.740074
JIT with ARM_K_BUG, to generate code that doesn't tickle the page faulting
related bug in the K StrongARM:
$ ./parrot -j examples/assembly/mops.pbc
Iterations: 100000000
Estimated ops: 200000000
Elapsed time: 56.142425
M op/s: 3.562368
I doubt in its current form this is quite ready to go in. Points I'd like to
raise
0: I've only implemented generator code fully for 1 class of instructions
(load/store multiple registers), partially for a second (load/store
single registers, and hard coded the minimal set of other things I
needed. I'll replaced these with fully featured versions, now that I'm
happy that the concept works
1: The most optimal code I could think of to call external functions sets
everything up by loading arguments into registers and function address
into PC a single load multiple instruction. (plus setting the return
address in the link register, by using the link register as the base
register for the load). All that in 1 instruction, plus a second to prime
LR for the load. (This is why I like it)
However, this is the form of instruction that can trigger bugs on the
(very early) K version StrongARMs. (if it page faults midway) Probably
the rest of the world doesn't have these (unless they have machines
dating from 1996 or so) but I do have one, so it is an important itch for
me. ARM_K_BUG is a symbol to define to generate code that cannot cause
the bug.
2: This code probably is the ARM assembler version of a JAPH, in that I've
not actually found the need (yet) to use any branch instructions. They
do exist! It's just that I find I can do it all so far with loads. :-)
3: The code as is issues casting warnings and 3 warnings about unprototyped
functions. (which I think can be static)
4: I'd really like the type of the pointer for the native code to be
machine chosen. char* isn't the most appropriate type for ARM code -
all instructions are word sized (32 bits) and must all be word aligned,
so I'd really like to be fabricating them in ints, and writing to an int*
in one blat.
5: The symbol TESTING was so that I could #include "jit_emit.h" in a test C
program to check my generator (by spitting a buffer out into a $file, and
then disassembling it with objdump -b binary -m arm -D $file
6: ARMs with separate I and D caches need to sync them before running code.
(else it all goes SEGV shaped with really really weird backtraces)
I don't think there's any official Linux function wrapper round the
ARM Linux syscal necessary to do this, hence the function with the inline
assembler. I'm not sure if there is a better way to do this.
[optional .s file in the architecture's jit directory, which the jit
installer can copy if it finds?]
7: Debian define the archname on their perl as "arm", whereas building from
the source tree gets me armv4l (from uname) hence the substitution for
armv[34]l? down to arm. I do have a machine with an ARM3 here (which I
think would be armv2) but it is 14 years old, and doesn't currently have
Linux on it (or a compiler for RISC OS, and I'm not feeling up to
attempting a RISC OS port for parrot just to experiment with JITs)
It's probably quite feasible to make the JIT work on everything back to
the ARM2 (ARM1 was the prototype and I believe was never used in any
hardware available outside Acorn, and IIRC all ARM1 doesn't have is the
multiply instruction, so it could be done)
Apart from all of that, the JIT version 2 looks much more flexible than
JIT version 1 - thanks Daniel.
I'll start writing some real JIT ops over the next few days, although
possibly only for the ops mops and life use :-)
[although I strongly suspect that JITting the ops the regexps compile down
to would be the first real world JIT priority. How fast would perl6 regexps
be with that?]
Oh, and prepare an acceptable version of this patch once people decide what
is acceptable
Nicholas Clark
--
Even better than the real thing: http://nms-cgi.sourceforge.net/
--- /dev/null Mon Jul 16 22:57:44 2001
+++ jit/arm/core.jit Mon Jul 29 00:14:30 2002
@@ -0,0 +1,26 @@
+;
+; arm/core.jit
+;
+; $Id: core.jit,v 1.4 2002/05/20 05:32:58 grunblatt Exp $
+;
+
+Parrot_noop {
+ emit_nop(jit_info->native_ptr);
+}
+
+; ldmea fp, {r4, r5, r6, r7, fp, sp, pc
+; but K bug Grr if I load pc direct.
+
+Parrot_end {
+ jit_info->native_ptr = emit_ldmstm (jit_info->native_ptr,
+ cond_AL, is_load, dir_EA, 0, 0,
+ REG11_fp,
+ reg2mask(4) | reg2mask(REG11_fp)
+ | reg2mask(REG13_sp)
+ #ifndef ARM_K_BUG
+ | reg2mask(REG15_pc));
+ #else
+ | reg2mask(REG14_lr));
+ emit_mov(jit_info->native_ptr, REG15_pc, REG14_lr);
+ #endif
+}
--- /dev/null Mon Jul 16 22:57:44 2001
+++ jit/arm/jit_emit.h Mon Jul 29 22:23:37 2002
@@ -0,0 +1,293 @@
+/*
+** jit_emit.h
+**
+** ARM (v3 and later - maybe this can easily be unified to v1)
+**
+** $Id: jit_emit.h,v 1.3 2002/07/04 21:32:12 mrjoltcola Exp $
+**/
+
+/* I'll use mov r0, r0 as my NOP for now. */
+
+typedef enum {
+ cond_EQ = 0x00,
+ cond_NE = 0x10,
+ cond_CS = 0x20,
+ cond_CC = 0x30,
+ cond_MI = 0x40,
+ cond_PL = 0x50,
+ cond_VS = 0x60,
+ cond_VC = 0x70,
+ cond_HI = 0x80,
+ cond_LS = 0x90,
+ cond_GE = 0xA0,
+ cond_LT = 0xB0,
+ cond_GT = 0xC0,
+ cond_LE = 0xD0,
+ cond_AL = 0xE0,
+/* cond_NV = 0xF0, */
+ cond_HS = 0x20,
+ cond_LO = 0x30
+} cont_t;
+
+typedef enum {
+ REG10_sl = 10,
+ REG11_fp = 11,
+ REG12_ip = 12,
+ REG13_sp = 13,
+ REG14_lr = 14,
+ REG15_pc = 15
+} arm_register_t;
+
+#define emit_nop(pc) emit_mov (pc, 0, 0)
+
+#define emit_mov(pc, dest, src) { \
+ *(pc++) = 0x00 | src; \
+ *(pc++) = dest << 4; \
+ *(pc++) = 0xA0; \
+ *(pc++) = cond_AL | 1; }
+
+#define emit_sub4(pc, dest, src) { \
+ *(pc++) = 0x04; \
+ *(pc++) = dest << 4; \
+ *(pc++) = 0x40 | src; \
+ *(pc++) = cond_AL | 2; }
+
+#define emit_add4(pc, dest, src) { \
+ *(pc++) = 0x04; \
+ *(pc++) = dest << 4; \
+ *(pc++) = 0x80 | src; \
+ *(pc++) = cond_AL | 2; }
+
+#define emit_dcd(pc, word) { \
+ *((int *)pc) = word; \
+ pc+=4; }
+
+#define reg2mask(reg) (1<<(reg))
+
+#define is_store 0x00
+#define is_load 0x10
+#define is_writeback 0x20
+#define is_caret 0x40 /* assembler syntax is ^ - load sets status flags in
+ USR mode, or load/store use user bank registers
+ in other mode. IIRC. */
+#define is_byte 0x40
+#define is_pre 0x01 /* pre index addressing. */
+#define is_post 0x00 /* post indexed addressing. ie arithmetic for free */
+
+/* multiple register transfer direction.
+ D = decrease, I = increase
+ A = after, B = before
+ or the stack notation
+ FD = full descending (the usual)
+ ED = empty descending
+ FA = full ascending
+ FD = full descending
+ values for stack notation are 0x10 | (ldm type) << 2 | (stm type)
+*/
+typedef enum {
+ dir_DA = 0,
+ dir_IA = 1,
+ dir_DB = 2,
+ dir_IB = 3,
+ dir_FD = 0x10 | (1 << 2) | 2,
+ dir_FA = 0x10 | (0 << 2) | 3,
+ dir_ED = 0x10 | (3 << 2) | 0,
+ dir_EA = 0x10 | (2 << 2) | 1
+} ldm_stm_dir_t;
+
+typedef enum {
+ dir_Up = 0x80,
+ dir_Down = 0x00
+} ldr_str_dir_t;
+
+char *
+emit_ldmstm(char *pc,
+ int cond,
+ int l_s,
+ ldm_stm_dir_t direction,
+ int caret,
+ int writeback,
+ int base,
+ int regmask) {
+ if ((l_s == is_load) && (direction & 0x10))
+ direction >>= 2;
+
+ *(pc++) = regmask;
+ *(pc++) = regmask >> 8;
+ /* bottom bit of direction is the up/down flag. */
+ *(pc++) = ((direction & 1) << 7) | caret | writeback | l_s | base;
+ /* binary 100x is code for stm/ldm. */
+ /* Top bit of direction is pre/post increment flag. */
+ *(pc++) = cond | 0x8 | ((direction >> 1) & 1);
+ return pc;
+}
+
+char *
+emit_ldrstr(char *pc,
+ int cond,
+ int l_s,
+ ldr_str_dir_t direction,
+ int pre,
+ int writeback,
+ int byte,
+ int dest,
+ int base,
+ int offset_type,
+ unsigned int offset) {
+
+ *(pc++) = offset;
+ *(pc++) = ((offset >> 8) & 0xF) | (dest << 4);
+ *(pc++) = direction | byte | writeback | l_s | base;
+ *(pc++) = cond | 0x4 | offset_type | pre;
+ return pc;
+}
+
+char *
+emit_ldrstr_offset (char *pc,
+ int cond,
+ int l_s,
+ int pre,
+ int writeback,
+ int byte,
+ int dest,
+ int base,
+ int offset) {
+ ldr_str_dir_t direction = dir_Up;
+#ifndef TESTING
+ if (offset > 4095 || offset < -4095) {
+ internal_exception(JIT_ERROR,
+ "Unable to generate offsets > 4095\n" );
+ }
+#endif
+ if (offset < 0) {
+ direction = dir_Down;
+ offset = -offset;
+ }
+ return emit_ldrstr(pc, cond, l_s, direction, pre, writeback, byte, dest,
+ base, 0, offset);
+}
+
+void Parrot_jit_dofixup(Parrot_jit_info *jit_info,
+ struct Parrot_Interp * interpreter)
+{
+ /* Todo. */
+}
+/* My entry code is create a stack frame:
+ mov ip, sp
+ stmfd sp!, {r4, fp, ip, lr, pc}
+ sub fp, ip, #4
+ Then store the first parameter (pointer to the interpreter) in r4.
+ mov r4, r0
+*/
+
+void
+Parrot_jit_begin(Parrot_jit_info *jit_info,
+ struct Parrot_Interp * interpreter)
+{
+ emit_mov (jit_info->native_ptr, REG12_ip, REG13_sp);
+ jit_info->native_ptr = emit_ldmstm (jit_info->native_ptr,
+ cond_AL, is_store, dir_FD, 0,
+ is_writeback,
+ REG13_sp,
+ reg2mask(4) | reg2mask(REG11_fp)
+ | reg2mask(REG12_ip)
+ | reg2mask(REG14_lr)
+ | reg2mask(REG15_pc));
+ emit_sub4 (jit_info->native_ptr, REG11_fp, REG12_ip);
+ emit_mov (jit_info->native_ptr, 4, 0);
+}
+
+/* I'm going to load registers to call functions in general like this:
+ adr r14, .L1
+ ldmia r14!, {r0, r1, r2, pc} ; register list built by jit
+ .L1: r0 data
+ r1 data
+ r2 data
+ <where ever> ; address of function.
+ .L2: ; next instruction - return point from func.
+
+ # here I'm going to do
+
+ mov r1, r4 ; current interpreter is arg 1
+ adr r14, .L1
+ ldmia r14!, {r0, pc}
+ .L1: address of current opcode
+ <where ever> ; address of function for op
+ .L2: ; next instruction - return point from func.
+*/
+
+/*
+XXX no.
+need to adr beyond:
+
+ mov r1, r4 ; current interpreter is arg 1
+ adr r14, .L1
+ ldmda r14!, {r0, ip}
+ mov pc, ip
+ .L1 address of current opcode
+ dcd <where ever> ; address of function for op
+ .L2: ; next instruction - return point from func.
+*/
+void
+Parrot_jit_normal_op(Parrot_jit_info *jit_info,
+ struct Parrot_Interp * interpreter)
+{
+ emit_mov (jit_info->native_ptr, 1, 4);
+#ifndef ARM_K_BUG
+ emit_mov (jit_info->native_ptr, REG14_lr, REG15_pc);
+#else
+ emit_add4 (jit_info->native_ptr, REG14_lr, REG15_pc);
+#endif
+ jit_info->native_ptr = emit_ldmstm (jit_info->native_ptr,
+ cond_AL, is_load, dir_IA, 0,
+ is_writeback,
+ REG14_lr,
+ reg2mask(0)
+#ifndef ARM_K_BUG
+ | reg2mask(REG15_pc)
+#else
+ | reg2mask(REG12_ip)
+#endif
+ );
+#ifdef ARM_K_BUG
+ emit_mov (jit_info->native_ptr, REG15_pc, REG12_ip);
+#endif
+ emit_dcd (jit_info->native_ptr, (int) jit_info->cur_op);
+ emit_dcd (jit_info->native_ptr,
+ (int) interpreter->op_func_table[*(jit_info->cur_op)]);
+}
+
+/* We get back address of opcode in bytecode.
+ We want address of equivalent bit of jit code, which is stored as an
+ address at the same offset in a jit table. */
+void Parrot_jit_cpcf_op(Parrot_jit_info *jit_info,
+ struct Parrot_Interp * interpreter)
+{
+ Parrot_jit_normal_op(jit_info, interpreter);
+
+ /* This is effectively the pseudo-opcode ldr - ie load relative to PC.
+ So offset includes pipeline. */
+ jit_info->native_ptr = emit_ldrstr_offset (jit_info->native_ptr, cond_AL,
+ is_load, is_pre, 0, 0,
+ REG14_lr, REG15_pc, 0);
+ /* ldr pc, [r14, r0] */
+ /* lazy. this is offset type 0, 0x000 which is r0 with zero shift */
+ jit_info->native_ptr = emit_ldrstr (jit_info->native_ptr, cond_AL,
+ is_load, dir_Up, is_pre, 0, 0,
+ REG15_pc, REG14_lr, 2, 0);
+ /* and this "instruction" is never reached, so we can use it to store
+ the constant that we load into r14 */
+ emit_dcd (jit_info->native_ptr,
+ ((long) jit_info->op_map) -
+ ((long) interpreter->code->byte_code));
+}
+
+/*
+ * Local variables:
+ * c-indentation-style: bsd
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ *
+ * vim: expandtab shiftwidth=4:
+ */
--- jit.c~ Tue Jul 23 19:18:41 2002
+++ jit.c Mon Jul 29 21:46:44 2002
@@ -128,6 +128,63 @@ optimize_jit(struct Parrot_Interp *inter
return optimizer;
}
+#ifdef ARM
+static void
+arm_sync_d_i_cache (void *start, void *end) {
+/* Strictly this is only needed for StrongARM and later (not sure about ARM8)
+ because earlier cores don't have separate D and I caches.
+ However there aren't that many ARM7 or earlier devices around that we'll be
+ running on. */
+#ifdef __linux
+#ifdef __GNUC__
+ int result;
+ /* swi call based on code snippet from Russell King. Description
+ verbatim: */
+ /*
+ * Flush a region from virtual address 'r0' to virtual address 'r1'
+ * _inclusive_. There is no alignment requirement on either address;
+ * user space does not need to know the hardware cache layout.
+ *
+ * r2 contains flags. It should ALWAYS be passed as ZERO until it
+ * is defined to be something else. For now we ignore it, but may
+ * the fires of hell burn in your belly if you break this rule. ;)
+ *
+ * (at a later date, we may want to allow this call to not flush
+ * various aspects of the cache. Passing '0' will guarantee that
+ * everything necessary gets flushed to maintain consistency in
+ * the specified region).
+ */
+
+ /* The value of the SWI is actually available by in
+ __ARM_NR_cacheflush defined in <asm/unistd.h>, but quite how to
+ get that to interpolate as a number into the ASM string is beyond
+ me. */
+ /* I'm actually passing in exclusive end address, so subtract 1 from
+ it inside the assembler. */
+ __asm__ __volatile__ (
+ "mov r0, %1\n"
+ "sub r1, %2, #1\n"
+ "mov r2, #0\n"
+ "swi 0x9f0002\n"
+ "mov %0, r0\n"
+ : "=r" (result)
+ : "r" ((long)start), "r" ((long)end)
+ : "r0","r1","r2");
+
+ if (result < 0) {
+ internal_exception(JIT_ERROR,
+ "Synchronising I and D caches failed with errno=%d\n",
+ -result);
+ }
+#else
+#error "ARM needs to sync D and I caches, and I don't know how to embed assmbler on this C compiler"
+#endif
+#else
+/* Not strictly true - on RISC OS it's OS_SynchroniseCodeAreas */
+#error "ARM needs to sync D and I caches, and I don't know how to on this OS"
+#endif
+}
+#endif
/*
** build_asm()
@@ -214,6 +271,9 @@ build_asm(struct Parrot_Interp *interpre
}
}
+#ifdef ARM
+ arm_sync_d_i_cache (jit_info.arena_start, jit_info.native_ptr);
+#endif
return (jit_f)jit_info.arena_start;
}
--- config/auto/jit.pl.orig Sat Jul 13 22:39:40 2002
+++ config/auto/jit.pl Mon Jul 29 00:08:22 2002
@@ -42,11 +42,14 @@ sub runstep {
$cpuarch = 'i386';
}
+ $cpuarch =~ s/armv[34]l?/arm/i;
+
Configure::Data->set(
archname => $archname,
cpuarch => $cpuarch,
osname => $osname,
);
+
my $jitarchname = "$cpuarch-$osname";
$jitarchname =~ s/i[456]86/i386/i;
Thread Next
-
ARM Jit v2
by Nicholas Clark