develooper Front page | perl.perl5.porters | Postings from October 2017

Re: full post: Porting/bench -- utf8 to code point speed up

Thread Previous | Thread Next
From:
Dave Mitchell
Date:
October 30, 2017 10:51
Subject:
Re: full post: Porting/bench -- utf8 to code point speed up
Message ID:
20171030105119.GU3083@iabyn.com
On Sun, Oct 29, 2017 at 10:17:19PM -0600, Karl Williamson wrote:
> People who aren't really into the nitty gritty of instruction timing, such
> as myself, may have a hard time of getting to the bottom line of what the
> output signifies.  I suggested that we move the discussion to the
> perl5-porter list so that Dave could do that and anyone, not just me, could
> benefit from his knowledge.

Well, the benchmarks seem to indicate that this is a win in every
circumstance which matters. Presumably we don't care about the slowdown on
oversized chars.

> Key:
>     Ir   Instruction read
>     Dr   Data read
>     Dw   Data write
>     COND conditional branches
>     IND  indirect branches
>     _m   branch predict miss
>     _m1  level 1 cache miss
>     _mm  last cache (e.g. L3) miss
>     -    indeterminate percentage (e.g. 1/0)
> 
> The numbers represent relative counts per loop iteration, compared to
> blead at 100.0%.
> Higher is better: for example, using half as many instructions gives 200%,
> while using twice as many gives 50%.
> 
> unicode::0x1
> unicode code point 0x1 translation
> 
>         blead inline
>        ------ ------
>     Ir 100.00 133.88
>     Dr 100.00 135.71
>     Dw 100.00 155.56
>   COND 100.00 153.85
>    IND 100.00 100.00
> 
> COND_m 100.00 100.00
>  IND_m 100.00 100.00
> 
>  Ir_m1 100.00 100.00
>  Dr_m1 100.00 100.00
>  Dw_m1 100.00 100.00
> 
>  Ir_mm 100.00 100.00
>  Dr_mm 100.00 100.00
>  Dw_mm 100.00 100.00

My take on the above: the misses are all unchanged (and were probably very
low in both cases). These can usually be ignored for micro-benchmarks
(like measuring 'chr($x)') - they only really become relevant when
executing a large chunk of code and things start not fitting in caches.

Generally the most important numbers are COND and IND, which represent
points where branch prediction may fail and so stall execution and
screw up cache pre-loading.

IND is usually the worst - it represents cases where its hard to know in
advance what's next - typically you get 1 IND per perl op executed,
due to calling PL_op->pp_addr, plus one per case statement where there
are enough cases to require the use of a jump table.

COND represents conditional branches. How likely this is to stall depends
on how how predictable the branch is. For example in a while loop, the
branch will be taken the same way for each iteration except the last, and
prediction quickly starts getting it right. If the branching follows an
unpredictable pattern, then that can slow things down.

In the case of the utf8 code:

    *codep = (*state != UTF8_ACCEPT) ?
        (byte & 0x3fu) | (*codep << 6) :
        (0xff >> type) & (byte);

it will work very well for ASCII chars, since it always takes the same
branch. For others, I guess its down to the CPU. For example if most chars
are 2 bytes, the condition will regularly alternate and its possible
a branch predictor will see this (I'm just speculating here).
For a random mixture of 1,2,3 and 4 byte chars it will probably
start stalling a lot. That could be overcome by eliminating the condition,
e.g. something like

    U32 accept = *state == UTF8_ACCEPT;

    *codep = 
              ( accept * ((byte & 0x3fu) | (*codep << 6)))
            | (!accept * ((0xff >> type) & (byte)));

but you'd probably need to look at the assembly output to see what the
compiler does with those two variants - its possible the compiler already
converts the first to the second.

The downside of that is that it now executes more code - the question is
whether that compensates for any branch stalls.

So in general the dfa code looks better than the current perl variant.
The only possible downside I can think of is if it causes bloating of the
binary - you'd have to look at the sizes of various functions in the perl
executable. If its got a lot larger, that could hurt the instruction
cache.

-- 
You're only as old as you look.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About