Front page | perl.perl5.porters |
Postings from February 2007
Re: unicode regex performance (was Re: Future Perl development)
February 8, 2007 04:06
Re: unicode regex performance (was Re: Future Perl development)
Message ID: email@example.com
On 2/8/07, Gerard Goossen <firstname.lastname@example.org> wrote:
> On Wed, Feb 07, 2007 at 11:52:48AM +0000, Nicholas Clark wrote:
> > On Wed, Feb 07, 2007 at 10:44:55AM +0100, demerphq wrote:
> > > On 2/7/07, Marvin Humphrey <email@example.com> wrote:
> > > >
> > > >
> > > >However, all that encode/decode overhead would kill the performance
> > > >of these libraries, rendering them far less useful. It would be nice
> > > >it Perl's internal encoding was always, officially UTF-8 -- then
> > > >there wouldn't be a conflict. But I imagine that might be very hard
> > > >to pull off on EBCDIC systems, so maybe it's better this way -- I get
> > > >to choose not to support EBCDIC systems (along with systems that
> > > >don't use IEEE 754 floats, and systems where chars are bigger than a
> > > >byte).
> > >
> > > I for one would argue that if we were going to go to a single internal
> > > encoding that utf8 would be the wrong one. Utf-16 would be much
> > > better. It would allow us to take advantage of the large amount of
> > > utf-16 code out there, ranging from DFA regexp engines to other
> > > algorithms and libraries. On Win32 the OS natively does utf-16 so much
> > > of the work would be done by the OS. Id bet that this was also a
> > > reason why other languages choose to use utf-16. In fact i wouldnt be
> > > surprised if we were the primary language using utf8 internally at
> > > all.
> > Jarkko's view, based on the battle scars from the dragons in the regexp
> > engine, was that fixed width 32 bit was better than anything 16 bit variable
> > width. Doing the latter properly *still* requires dealing with surrogates.
> If you mean easier to implement, then fixed width is much easier,
> especially if you try to add it to a regex engine which assumes
> characters have a fixed width.
Much easier is a massive understatement.
> Unfortunatly UTF-32 would be very bad for performance. The problem is
> that your strings use much more memory. It is not that I care that much
> about memory usage. But for example when doing a regex for a fixed string
> like m/aap/ in UTF-8 you just have to do a memory search for the bytes
> representing 'aap' in UTF-32 you can do the same, but you have much more
> memory to search through.
This is a false argument, almost every CPU out there that perl runs on
can handle word size data chunks more efficiently than byte size data
chunks. The memory usage will not make things slower except in the
respect of the original allocation of the memory, and even that would
depend on the OS I should think.
Not only that the very nature of utf8 means that its not unlikely that
high codepoints will be distributed accross two seperate words, which
would make the CPU do even more work, as to get the byts in a word it
needs to load the word, mask off the appropriate bytes then load the
next word get some bytes from it and etc.
> Whether UTF-8 or UTF-16 is faster depends on the content of your
> strings, if you mostly have ASCII data UTF-8 would be shorter and thus
> faster, if you are dealing with non western languages, the UTF-16
> encoding will probably be shorter and thus faster.
You should forget this idea that smaller per character memory
representation means faster processing. It doesnt at all. On the
> The current regex engine does not work very fast for unicode, because it
> was build on top of the latin1 regex engine.
> Some things that can greatly improve the performance of the current
> regex engine:
> - The regex engine does not know whether the regex will be used for
> latin1 or unicode, knowing that we are going to match UTF-8 we can
> optimize for that.
And throw out all the really nice optimisations that use latin_1. Like
And FBM matching is the secret sauce that makes our engine goooooooo!
> - For character classes: A bitmap is only used for codepoints < 256 to
> see wheter they match, the bitmap could be extended for larger
> codepoints. (when the bitmap does not match a very expensive swatch is
> - For character classes: For the bitmap test the UTF-8 is decoded
> to a codepoint and then looked up. Instead the codepoints of the
> character class could be encoded to UTF-8 only once during compilation
> and then used in the bitmap.
You are correct that our charclass representation is particularly
sucky. Jarkko and I have discussed this in the past and have done some
research into how to do it right.
It looks like the right approach to doing unicode char classes is to
use inversion lists.
Bitmaps are obviously out, we cant have a 138k bitmap for each
character (there are 1,114,111 unicode characters). And i think its
obvious that the way to ameliorate this problem effectively destroys
the utility of using a bitmap at all.
There are other approaches, specifically using double folder tries.
But they have other properties that make them less desirable than
using inversion lists.
I welcome some bright spark recoding our unicode charclass handling to
use inversion lists and getting us up to full level 2 compliance for
unicode char class set operations. Its something on my todo list, but
as i rarely use unicode its not an itch for me, just a todo item for
PLEASE PLEASE, do some research on this stuff before making your
hypothesis, there is lots of published work on this subject.
> - Counting/tracking codepoints is expensive, so we should not do it when it
> is not required.
I dont understand this comment. What do you mean by it.
> In my branch I already started working on these things (mostly to
> seperate the unicode and byte matching, and not really optimizing for
> speed), and I already got more then 20% speed increase.
Yes, removing the repeated "are we utf8" on practically every op would
be an improvement. But a 20% speed increase sounds high. Do you mean
20% faster when doing utf8 matches? Or 20% faster than doing latin_1
> Of course when doing Level 2 or 3 unicode regular expression
> (see http://unicode.org/reports/tr18/ for the levels), things get a lot
> more complicated and thus slower.
We ARE mostly Level-2 compliant.
Ill just add, there is certainly room for improving some of the
unicode support in the regex engine to make it more efficient.
But this is a seperate point from the original, which was that we
would be better of using a 32 bit representation, which you havent
really addressed, or addressed using afaik provably incorrect
assumptions about CPU design.
I appreciate your boldness in tackling this issue, but i think you
have some core misunderstandings that are leading you to make
incorrect evaluations. Operating on bytes is not more efficient than
operating on words, and characters are not bytes, despite the choice
of name for byte sized data types in C. The cost of reading a single
utf8 character stream is not going to be less than reading a character
from a utf32 character stream. Have a look at what is required to
covert a utf8 sequence into its equivelent codepoint and ask yourself
how that could ever be faster than reading an alligned word.
perl -Mre=debug -e "/just|another|perl|hacker/"