Front page | perl.perl5.porters |
Postings from October 2017
full post: Porting/bench -- utf8 to code point speed up
Thread Next
From:
Karl Williamson
Date:
October 30, 2017 04:17
Subject:
full post: Porting/bench -- utf8 to code point speed up
Message ID:
6d3d2668-a274-cc45-101d-f10f335e0fc6@khwilliamson.com
I hit a key combination that accidentally sent the draft I was starting.
This is the whole thing:
I think the API in 5.26 for handling UTF-8 is finally good enough that
it's time to work on speeding up.
http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
is a free dfa converter that has long been known to various people on
this list, and is considered the one to beat by various people on the
Unicode mailing list
In fact someone recently came up with a branchless decoder which did not
beat this one.
I set out to compare it versus our current, klunky, one using
Porting/bench.pl. It showed what seemed to me to be barely an
improvement. I then asked Dave Mitchell offlist if I was doing anything
wrong. It turns out I was including all the overhead from the setup,
etc in the calculations. He showed me how to isolate just the ord()
function, like this:
push @benchmarks,
"unicode::$cp" => {
desc => "unicode code point $cp translation",
setup => 'no warnings; my ($a);',
pre => "\$a = chr($cp); utf8::upgrade(\$a);",
code => 'ord($a);',
};
and to create various copies of this for various $cp's, which he
recommended be the first and last code points whose utf8 representation
is the same length, 1 byte, 2 bytes, .... Doing this led to much more
decisive results; attached.
The dfa decoder is designed to only work on non-surrogate Unicode code
points, whereas Perl uses an extended UTF-8 that includes many more code
points. It was easy to change the tables so that surrogates are
accepted, and I wrote a wrapper function that, when the decoder fails,
then calls our klunky function. This allows us to handle Perl's
extended UTF-8, at the cost that those above the legal Unicode max would
have the overhead of having this function fail before the current
function gets called as a fallback. In principle, it would be possible
to extend the dfa to handle larger code points, but it's not an itch
that I think needs to be scratched anytime soon.
People who aren't really into the nitty gritty of instruction timing,
such as myself, may have a hard time of getting to the bottom line of
what the output signifies. I suggested that we move the discussion to
the perl5-porter list so that Dave could do that and anyone, not just
me, could benefit from his knowledge.
Thread Next
-
full post: Porting/bench -- utf8 to code point speed up
by Karl Williamson