develooper Front page | perl.perl5.porters | Postings from October 2017

Re: Porting/bench -- utf8 to code point speed up

Thread Previous | Thread Next
Dave Mitchell
October 31, 2017 11:48
Re: Porting/bench -- utf8 to code point speed up
Message ID:
On Mon, Oct 30, 2017 at 06:48:17PM -0600, Karl Williamson wrote:
> > COND represents conditional branches. How likely this is to stall depends
> > on how how predictable the branch is. For example in a while loop, the
> > branch will be taken the same way for each iteration except the last, and
> > prediction quickly starts getting it right. If the branching follows an
> > unpredictable pattern, then that can slow things down.
> But is this a problem if both sides of the branch are in the cache? as in
> this case?

Yes. Branch prediction isn't about the instructions being in the cache
(at least not after the first couple of iterations) - its about keeping
the instruction pipeline going. A modern CPU will be reading in several
instructions in advance, decoding them, determining data dependencies
(such as which instructions use which results from previous instructions),
reordering them if necessary, then farming them off to potentially
multiple ALUs for parallel execution. It can also be triggering
pre-fetches of cache data lines which it predicts are going to be accessed

However, once it sees a conditional branch instruction ahead, it has to
make a guess as to which way the branch will go, and continue with the
pipeline in that direction. If, once it actually reaches the branch
instruction, it finds that it guessed wrong, it has to throw away all the
work done so far on the wrong branch, and restart the pipeline on the
other branch, stalling the CPU.

The actual details will depend on the particular CPU, and I don't claim to
have any extensive knowledge in this area.

I tried just now to mock up a little C program which runs a loop which
contains code similar to the dfa, but which, based on a cmd-line arg,
either always takes branch 0, branch 1, or random each time (all based on
a 256-byte table containing all 0's, 1's or random data).

The results were not at all what I expected. Using 'perf stat' to view the
CPU's hardware count of cycles and stalled cycles, I got:

    0:    388 stalled Mcycles out of 20,748 Mcycles; 20M branch misses
    1:    518 stalled Mcycles out of 16,234 Mcycles; 20M branch misses
    rand:  68 stalled Mcycles out of 17,885 Mcycles; 0.2M branch misses

Changing the code to eliminate the branch as I suggested

> >      *codep =
> >                ( accept * ((byte & 0x3fu) | (*codep << 6)))
> >              | (!accept * ((0xff >> type) & (byte)));


    0:   6,796 stalled Mcycles out of 37,278 Mcycles; 20M branch misses
    1:   6,775 stalled Mcycles out of 37,262 Mcycles; 20M branch misses
    rand:6,722 stalled Mcycles out of 37,203 Mcycles; 20M branch misses

So I think the conclusion is that trying to second-guess the CPU is
probably a waste of time!

> The total size of ./perl increased 100K bytes.  There were 98 calls in it
> that got inlined; only 6 needed the full generality.

Hmmm, that seems a bit excessive. I would be tempted to see what
performance drop you get by not inlining the function, or only inlining
it in a few critical places, or (if technically feasible) only inlining
a smaller subset of the function.

> I have pushed the branch as
> in case anyone wants to look at it.

It looks good, except I don't like the 'real' function being named the
same as the inline function but with an extra _ prefix. It confused me a
lot trying to review the diff. Maybe rename the 'real' function to
Perl_utf8n_to_uvchr_full() (non-API but exported)?

> If we did go with this, I don't know how it would work with the license,
> which reads:

IANAL but I can't see any reason not to include it the way you have.

Diplomacy is telling someone to go to hell in such a way that they'll
look forward to the trip

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About