Front page | perl.perl5.porters |
Postings from October 2017
Re: Porting/bench -- utf8 to code point speed up
From: Karl Williamson
October 31, 2017 00:48
Re: Porting/bench -- utf8 to code point speed up
Message ID: email@example.com
On 10/30/2017 04:51 AM, Dave Mitchell wrote:
> On Sun, Oct 29, 2017 at 10:17:19PM -0600, Karl Williamson wrote:
>> People who aren't really into the nitty gritty of instruction timing, such
>> as myself, may have a hard time of getting to the bottom line of what the
>> output signifies. I suggested that we move the discussion to the
>> perl5-porter list so that Dave could do that and anyone, not just me, could
>> benefit from his knowledge.
> Well, the benchmarks seem to indicate that this is a win in every
> circumstance which matters. Presumably we don't care about the slowdown on
> oversized chars.
That's my position, and if someone does care, they could expand the table.
I went ahead and implemented this to allow all utf8 to code point
translation functions to work. There needed to be an extra test before
starting the translation to ensure there is at least one byte to look
at. This has decreased the actual win, output attached.
It may be that most of the win is that this dfa is small enough that it
can be inlined, with handling any errors found farmed out to a helper
function. There is a function that is called when the UTF-8 is known to
be well formed (like when it has previously been generated by us). It
is already inlined, and is faster by a little than this new one.
>> Ir Instruction read
>> Dr Data read
>> Dw Data write
>> COND conditional branches
>> IND indirect branches
>> _m branch predict miss
>> _m1 level 1 cache miss
>> _mm last cache (e.g. L3) miss
>> - indeterminate percentage (e.g. 1/0)
>> The numbers represent relative counts per loop iteration, compared to
>> blead at 100.0%.
>> Higher is better: for example, using half as many instructions gives 200%,
>> while using twice as many gives 50%.
>> unicode code point 0x1 translation
>> blead inline
>> ------ ------
>> Ir 100.00 133.88
>> Dr 100.00 135.71
>> Dw 100.00 155.56
>> COND 100.00 153.85
>> IND 100.00 100.00
>> COND_m 100.00 100.00
>> IND_m 100.00 100.00
>> Ir_m1 100.00 100.00
>> Dr_m1 100.00 100.00
>> Dw_m1 100.00 100.00
>> Ir_mm 100.00 100.00
>> Dr_mm 100.00 100.00
>> Dw_mm 100.00 100.00
> My take on the above: the misses are all unchanged (and were probably very
> low in both cases). These can usually be ignored for micro-benchmarks
> (like measuring 'chr($x)') - they only really become relevant when
> executing a large chunk of code and things start not fitting in caches.
> Generally the most important numbers are COND and IND, which represent
> points where branch prediction may fail and so stall execution and
> screw up cache pre-loading.
> IND is usually the worst - it represents cases where its hard to know in
> advance what's next - typically you get 1 IND per perl op executed,
> due to calling PL_op->pp_addr, plus one per case statement where there
> are enough cases to require the use of a jump table.
> COND represents conditional branches. How likely this is to stall depends
> on how how predictable the branch is. For example in a while loop, the
> branch will be taken the same way for each iteration except the last, and
> prediction quickly starts getting it right. If the branching follows an
> unpredictable pattern, then that can slow things down.
But is this a problem if both sides of the branch are in the cache? as
in this case?
> In the case of the utf8 code:
> *codep = (*state != UTF8_ACCEPT) ?
> (byte & 0x3fu) | (*codep << 6) :
> (0xff >> type) & (byte);
> it will work very well for ASCII chars, since it always takes the same
> branch. For others, I guess its down to the CPU. For example if most chars
> are 2 bytes, the condition will regularly alternate and its possible
> a branch predictor will see this (I'm just speculating here).
> For a random mixture of 1,2,3 and 4 byte chars it will probably
> start stalling a lot. That could be overcome by eliminating the condition,
> e.g. something like
> U32 accept = *state == UTF8_ACCEPT;
> *codep =
> ( accept * ((byte & 0x3fu) | (*codep << 6)))
> | (!accept * ((0xff >> type) & (byte)));
> but you'd probably need to look at the assembly output to see what the
> compiler does with those two variants - its possible the compiler already
> converts the first to the second.
> The downside of that is that it now executes more code - the question is
> whether that compensates for any branch stalls.
> So in general the dfa code looks better than the current perl variant.
> The only possible downside I can think of is if it causes bloating of the
> binary - you'd have to look at the sizes of various functions in the perl
> executable. If its got a lot larger, that could hurt the instruction
The total size of ./perl increased 100K bytes. There were 98 calls in
it that got inlined; only 6 needed the full generality.
I have pushed the branch as
in case anyone wants to look at it.
If we did go with this, I don't know how it would work with the license,
Copyright (c) 2008-2009 Bjoern Hoehrmann <firstname.lastname@example.org>
Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.