develooper Front page | perl.perl5.porters | Postings from October 2017

full post: Porting/bench -- utf8 to code point speed up

Thread Next
From:
Karl Williamson
Date:
October 30, 2017 04:17
Subject:
full post: Porting/bench -- utf8 to code point speed up
Message ID:
6d3d2668-a274-cc45-101d-f10f335e0fc6@khwilliamson.com
I hit a key combination that accidentally sent the draft I was starting. 
  This is the whole thing:

I think the API in 5.26 for handling UTF-8 is finally good enough that 
it's time to work on speeding up.

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

is a free dfa converter that has long been known to various people on 
this list, and is considered the one to beat by various people on the 
Unicode mailing list

In fact someone recently came up with a branchless decoder which did not 
beat this one.

I set out to compare it versus our current, klunky, one using 
Porting/bench.pl.  It showed what seemed to me to be barely an 
improvement.  I then asked Dave Mitchell offlist if I was doing anything 
wrong.  It turns out I was including all the overhead from the setup, 
etc in the calculations.  He showed me how to isolate just the ord() 
function, like this:

     push @benchmarks,
         "unicode::$cp" => {
             desc    => "unicode code point $cp translation",
             setup   => 'no warnings; my ($a);',
             pre     => "\$a = chr($cp); utf8::upgrade(\$a);",
             code    => 'ord($a);',
         };

and to create various copies of this for various $cp's, which he 
recommended be the first and last code points whose utf8 representation 
is the same length, 1 byte, 2 bytes, ....  Doing this led to much more 
decisive results; attached.

The dfa decoder is designed to only work on non-surrogate Unicode code 
points, whereas Perl uses an extended UTF-8 that includes many more code 
points.  It was easy to change the tables so that surrogates are 
accepted, and I wrote a wrapper function that, when the decoder fails, 
then calls our klunky function.  This allows us to handle Perl's 
extended UTF-8, at the cost that those above the legal Unicode max would 
have the overhead of having this function fail before the current 
function gets called as a fallback.  In principle, it would be possible 
to extend the dfa to handle larger code points, but it's not an itch 
that I think needs to be scratched anytime soon.

People who aren't really into the nitty gritty of instruction timing, 
such as myself, may have a hard time of getting to the bottom line of 
what the output signifies.  I suggested that we move the discussion to 
the perl5-porter list so that Dave could do that and anyone, not just 
me, could benefit from his knowledge.


Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About