develooper Front page | perl.perl6.internals | Postings from August 2001

Re: Opcode Dispatch

Thread Previous | Thread Next
From:
Bryan C . Warnock
Date:
August 5, 2001 22:48
Subject:
Re: Opcode Dispatch
Message ID:
0108060147170A.29900@wakko.idiocity.nut
On Saturday 04 August 2001 03:21 pm, Dan Sugalski wrote:
> At 01:04 PM 7/30/2001 -0400, Bryan C. Warnock wrote:
> >I'll also rework my code generator and the logic to ramp it up to four or
> >five hundred opcodes, which would be a little better test, I think.
> >Anything else I should add or tweak?
>
> See what happens if you do a mixed switch/indirect dispatch table. I think
> we may stuff all the non-overridable opcodes (flow control, register ops,
> stack manipulations, and suchlike things) into the switch as entries with
> the rest as indirect calls in the default fallthrough. That'd get us the
> faster switch dispatch and still leave the flexibility of indirect
> functions.

Okay, next set of results.

508 opcodes, arranged in a primary set of 255, and a secondary set of 253.  
The primary set contained the escape opcode (opcode 0), the null opcode 
(opcode 254), and the program termination opcode (opcode 255).  The 
secondary set was mapped onto opcodes 1 - 254.  (The opcodes themselves 
mapped onto the same set of 255 functions - the data, however, precluded 
calling the escape opcode from within an escaped opcode, and there was only 
one termination opcode present.)  Each of 254 distinct functions were one of 
7 randomly generated code sets, each of which consisted of code that was 
essential to the final output.  Functions 0 and 255 handled the secondary 
dispatch and the program termination, respectively.

The data was pseudo-randomly generated, with approximately 25% of the 
opcodes being an escape sequence (with the escaped opcode being 
pseudo-randonly generated from the escape opcodes), and, due to a typo, null 
ops occurred no more often than any other opcode.

The "program" was 80,000 bytes long, and for each run this "program" was 
read in, then "run" 4,000 times.  The particular iteration was displayed to 
the terminal.  There was no discernable delay for either I/O operation.  
(Even if there were, it's moot, as we're comparing relative times, and these 
would be fixed.

For each set of tests, the benchmark was compiled for both debugging, and 
gcc's -O2 optimizations.  The program was then run five times, and the 
average user time of the program was recorded.  (System time was negligible.)
For tests that provided unexpected results, the run was repeated. 

The two systems were:
    Athlon 1000MHz, 512 MB RAM, running Linux Mandrake 8.0, using gcc 2.96
      (with -586 instructions).
    Sun Blade 100, UltraSPARC-IIe at 502MHz, 128 MB RAM, running Solaris 8,
      using gcc 2.95.2.

Tests on the Athlon use 49% of the CPU.  Tests on the Sun used 99%.

The following loop configurations were tested:

SWITCH/SWITCH: Both the primary and secondary opcodes were dispatched via 
their own switch only.

LOOKUP/LOOKUP:  Both the primary and secondary opcodes were dispatched via 
one lookup table. [1]

SWITCH/LOOKUP:  The primary opcodes were dispatched via a switch only, while 
the secondary opcodes were dispatched via the lookup table.

LOOKUP/SWITCH:  The primary opcodes were dispatched via the lookup table, 
while the secondary opcodes were dispatched via a switch.

MIXED/SWITCH: The primary opcodes were dispatched by a non-contiguous 
switch/lookup combination.  [2]  The secondary opcodes were dispatched via a 
single switch.

MIXED/LOOKUP: The primary opcodes were dispatched by a non-contiguous 
switch/lookup combination.  [2]  The secondary opcodes were dispatched via 
the lookup table.

BLOCKED-MIXED/LOOKUP/NULL_OPS: The primary opcodes were dispatched by a 
contiguous switch/lookup combination.  [3]  The secondary opcodes were 
dispatched via the lookup table.  The number of null opcodes was increased 
to approximate 7% of the program.

[1] 256 pointers to 256 functions.  The primary and secondary dispatches 
used the same table.  This could potentially skew the results.  [Addendum: I 
hand-tweaked the code to make it a single table of 512 pointers to 512 
functions.  The secondary lookup as its on pointer to offset 256.   
Rerunning the LL tests actually resulted in a slight improvement across the 
board.]

[2] 16 opcodes were fixed for switch dispatch - the escape, the null, and 
the termination opcodes, and 13 randomly chosen ones.  The default case 
resulted in an opcode lookup.  (Unfortunately, this results in an extra test 
and jump before the opcode lookup.  I could not come up with a more 
efficient approach to handle the mixed case, however.)

[3] 16 opcodes were fixed switch dispatch - opcodes 0 (escape) though 13, 
and, unfortunately, 254 and 255 (since we relied on their specific 
functions).  The default case resulted in an opcode lookup.  (Unfortunately, 
this results in an extra test and jump before the opcode lookup.  I could 
not come up with a more efficient approach to handle the mixed case, 
however.)

The Results (times given in user seconds, lower is better)

           x86           SPARC    
        -g    -O2      -g    -O2
    +--------------+--------------+
  SS|  16.2   9.8  |  53.5  22.8  |
  LL|  14.4   8.6  |  54.2  20.6  |
  SL|  15.4   9.2  |  51.7  19.1  |
  LS|  15.0   9.2  |  54.1  22.4  |
  MS|  16.8  10.7  |  51.3  22.5  |
  ML|  15.5  10.6  |  49.8  19.1  |
BMLN|  15.4  10.9  |  46.8  17.9  |
    +--------------+--------------+

LOOKUP/LOOKUP was the fastest approach on the x86.  The mixed switch and 
lookup method provided the worst performance.

The mixed methods (where a switch and memory lookups were both used) were 
the fastest on the SPARC, particularly once the switch/lookup combination 
was restructured.)

For both machines, the fastest method was ~14% and ~22% faster than the 
slowest, for unoptimized and optimized code, respectively.

Conclusions.

None.  It's still too early, particularly because it's a loop that really 
goes nowhere and does nothing.  Event dispatch and the actual opcode 
processing will almost certainly flatten out the time differentials.  
However, even at 75% normalization, that's still a 3% to 4% performance 
difference between the different methods, and that may be worth at least 
consideration for having a pluggable dispatch for different architectures.

What may also need to be considered is what the expected growth of opcodes 
may actually be in future versions of Perl.  In order to maintain upwards 
compatability, future opcodes will need to be added around the current set.
If the initial set fills the entire primary table, for instance, then any 
future opcodes will need to buried behind an escape code.  It may be better 
to reserve space in the primary block for future use.  In addition, if the 
number of opcodes grows beyond 511, you'll need to extend three levels deep 
for dispatch, unless you allocate a small block on the primary table for 
multiple escape characters.

Anything else to consider? Anything faulty in my approach or conclusions?

-- 
Bryan C. Warnock
bwarnock@capita.com

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About