Front page | perl.perl6.internals |
Postings from August 2001
Re: Opcode Dispatch
Thread Previous
|
Thread Next
From:
Bryan C . Warnock
Date:
August 5, 2001 22:48
Subject:
Re: Opcode Dispatch
Message ID:
0108060147170A.29900@wakko.idiocity.nut
On Saturday 04 August 2001 03:21 pm, Dan Sugalski wrote:
> At 01:04 PM 7/30/2001 -0400, Bryan C. Warnock wrote:
> >I'll also rework my code generator and the logic to ramp it up to four or
> >five hundred opcodes, which would be a little better test, I think.
> >Anything else I should add or tweak?
>
> See what happens if you do a mixed switch/indirect dispatch table. I think
> we may stuff all the non-overridable opcodes (flow control, register ops,
> stack manipulations, and suchlike things) into the switch as entries with
> the rest as indirect calls in the default fallthrough. That'd get us the
> faster switch dispatch and still leave the flexibility of indirect
> functions.
Okay, next set of results.
508 opcodes, arranged in a primary set of 255, and a secondary set of 253.
The primary set contained the escape opcode (opcode 0), the null opcode
(opcode 254), and the program termination opcode (opcode 255). The
secondary set was mapped onto opcodes 1 - 254. (The opcodes themselves
mapped onto the same set of 255 functions - the data, however, precluded
calling the escape opcode from within an escaped opcode, and there was only
one termination opcode present.) Each of 254 distinct functions were one of
7 randomly generated code sets, each of which consisted of code that was
essential to the final output. Functions 0 and 255 handled the secondary
dispatch and the program termination, respectively.
The data was pseudo-randomly generated, with approximately 25% of the
opcodes being an escape sequence (with the escaped opcode being
pseudo-randonly generated from the escape opcodes), and, due to a typo, null
ops occurred no more often than any other opcode.
The "program" was 80,000 bytes long, and for each run this "program" was
read in, then "run" 4,000 times. The particular iteration was displayed to
the terminal. There was no discernable delay for either I/O operation.
(Even if there were, it's moot, as we're comparing relative times, and these
would be fixed.
For each set of tests, the benchmark was compiled for both debugging, and
gcc's -O2 optimizations. The program was then run five times, and the
average user time of the program was recorded. (System time was negligible.)
For tests that provided unexpected results, the run was repeated.
The two systems were:
Athlon 1000MHz, 512 MB RAM, running Linux Mandrake 8.0, using gcc 2.96
(with -586 instructions).
Sun Blade 100, UltraSPARC-IIe at 502MHz, 128 MB RAM, running Solaris 8,
using gcc 2.95.2.
Tests on the Athlon use 49% of the CPU. Tests on the Sun used 99%.
The following loop configurations were tested:
SWITCH/SWITCH: Both the primary and secondary opcodes were dispatched via
their own switch only.
LOOKUP/LOOKUP: Both the primary and secondary opcodes were dispatched via
one lookup table. [1]
SWITCH/LOOKUP: The primary opcodes were dispatched via a switch only, while
the secondary opcodes were dispatched via the lookup table.
LOOKUP/SWITCH: The primary opcodes were dispatched via the lookup table,
while the secondary opcodes were dispatched via a switch.
MIXED/SWITCH: The primary opcodes were dispatched by a non-contiguous
switch/lookup combination. [2] The secondary opcodes were dispatched via a
single switch.
MIXED/LOOKUP: The primary opcodes were dispatched by a non-contiguous
switch/lookup combination. [2] The secondary opcodes were dispatched via
the lookup table.
BLOCKED-MIXED/LOOKUP/NULL_OPS: The primary opcodes were dispatched by a
contiguous switch/lookup combination. [3] The secondary opcodes were
dispatched via the lookup table. The number of null opcodes was increased
to approximate 7% of the program.
[1] 256 pointers to 256 functions. The primary and secondary dispatches
used the same table. This could potentially skew the results. [Addendum: I
hand-tweaked the code to make it a single table of 512 pointers to 512
functions. The secondary lookup as its on pointer to offset 256.
Rerunning the LL tests actually resulted in a slight improvement across the
board.]
[2] 16 opcodes were fixed for switch dispatch - the escape, the null, and
the termination opcodes, and 13 randomly chosen ones. The default case
resulted in an opcode lookup. (Unfortunately, this results in an extra test
and jump before the opcode lookup. I could not come up with a more
efficient approach to handle the mixed case, however.)
[3] 16 opcodes were fixed switch dispatch - opcodes 0 (escape) though 13,
and, unfortunately, 254 and 255 (since we relied on their specific
functions). The default case resulted in an opcode lookup. (Unfortunately,
this results in an extra test and jump before the opcode lookup. I could
not come up with a more efficient approach to handle the mixed case,
however.)
The Results (times given in user seconds, lower is better)
x86 SPARC
-g -O2 -g -O2
+--------------+--------------+
SS| 16.2 9.8 | 53.5 22.8 |
LL| 14.4 8.6 | 54.2 20.6 |
SL| 15.4 9.2 | 51.7 19.1 |
LS| 15.0 9.2 | 54.1 22.4 |
MS| 16.8 10.7 | 51.3 22.5 |
ML| 15.5 10.6 | 49.8 19.1 |
BMLN| 15.4 10.9 | 46.8 17.9 |
+--------------+--------------+
LOOKUP/LOOKUP was the fastest approach on the x86. The mixed switch and
lookup method provided the worst performance.
The mixed methods (where a switch and memory lookups were both used) were
the fastest on the SPARC, particularly once the switch/lookup combination
was restructured.)
For both machines, the fastest method was ~14% and ~22% faster than the
slowest, for unoptimized and optimized code, respectively.
Conclusions.
None. It's still too early, particularly because it's a loop that really
goes nowhere and does nothing. Event dispatch and the actual opcode
processing will almost certainly flatten out the time differentials.
However, even at 75% normalization, that's still a 3% to 4% performance
difference between the different methods, and that may be worth at least
consideration for having a pluggable dispatch for different architectures.
What may also need to be considered is what the expected growth of opcodes
may actually be in future versions of Perl. In order to maintain upwards
compatability, future opcodes will need to be added around the current set.
If the initial set fills the entire primary table, for instance, then any
future opcodes will need to buried behind an escape code. It may be better
to reserve space in the primary block for future use. In addition, if the
number of opcodes grows beyond 511, you'll need to extend three levels deep
for dispatch, unless you allocate a small block on the primary table for
multiple escape characters.
Anything else to consider? Anything faulty in my approach or conclusions?
--
Bryan C. Warnock
bwarnock@capita.com
Thread Previous
|
Thread Next