develooper Front page | perl.perl5.porters | Postings from February 2007

unicode regex performance (was Re: Future Perl development)

Gerard Goossen
February 7, 2007 17:19
unicode regex performance (was Re: Future Perl development)
Message ID:
On Wed, Feb 07, 2007 at 11:52:48AM +0000, Nicholas Clark wrote:
> On Wed, Feb 07, 2007 at 10:44:55AM +0100, demerphq wrote:
> > On 2/7/07, Marvin Humphrey <> wrote:
> > >
> > >
> > >However, all that encode/decode overhead would kill the performance
> > >of these libraries, rendering them far less useful.  It would be nice
> > >it Perl's internal encoding was always, officially UTF-8 -- then
> > >there wouldn't be a conflict.  But I imagine that might be very hard
> > >to pull off on EBCDIC systems, so maybe it's better this way -- I get
> > >to choose not to support EBCDIC systems (along with systems that
> > >don't use IEEE 754 floats, and systems where chars are bigger than a
> > >byte).
> > 
> > I for one would argue that if we were going to go to a single internal
> > encoding that utf8 would be the wrong one. Utf-16 would be much
> > better. It would allow us to take advantage of the large amount of
> > utf-16 code out there, ranging from DFA regexp engines to other
> > algorithms and libraries. On Win32 the OS natively does utf-16 so much
> > of the work would be done by the OS. Id bet that this was also a
> > reason why other languages choose to use utf-16. In fact i wouldnt be
> > surprised if we were the primary language using utf8 internally at
> > all.
> Jarkko's view, based on the battle scars from the dragons in the regexp
> engine, was that fixed width 32 bit was better than anything 16 bit variable
> width. Doing the latter properly *still* requires dealing with surrogates.

If you mean easier to implement, then fixed width is much easier,
especially if you try to add it to a regex engine which assumes
characters have a fixed width.
Unfortunatly UTF-32 would be very bad for performance. The problem is
that your strings use much more memory. It is not that I care that much
about memory usage. But for example when doing a regex for a fixed string
like m/aap/ in UTF-8 you just have to do a memory search for the bytes 
representing 'aap' in UTF-32 you can do the same, but you have much more
memory to search through. 
Whether UTF-8 or UTF-16 is faster depends on the content of your
strings, if you mostly have ASCII data UTF-8 would be shorter and thus
faster, if you are dealing with non western languages, the UTF-16
encoding will probably be shorter and thus faster.

The current regex engine does not work very fast for unicode, because it
was build on top of the latin1 regex engine.
Some things that can greatly improve the performance of the current
regex engine:
- The regex engine does not know whether the regex will be used for
  latin1 or unicode, knowing that we are going to match UTF-8 we can
  optimize for that.
- For character classes: A bitmap is only used for codepoints < 256 to
  see wheter they match, the bitmap could be extended for larger
  codepoints. (when the bitmap does not match a very expensive swatch is
- For character classes: For the bitmap test the UTF-8 is decoded
  to a codepoint and then looked up. Instead the codepoints of the
  character class could be encoded to UTF-8 only once during compilation
  and then used in the bitmap.
- Counting/tracking codepoints is expensive, so we should not do it when it 
  is not required.

In my branch I already started working on these things (mostly to
seperate the unicode and byte matching, and not really optimizing for
speed), and I already got more then 20% speed increase.

Of course when doing Level 2 or 3 unicode regular expression
(see for the levels), things get a lot
more complicated and thus slower.

> The *best* solution might well be fixed 7/8/16/32, using the smallest that
> fits.

Find the smallest fixed width and converting your string to it, is
probably more overhead, then to just use the standard encoding.
And not knowing during compilation of the regex what the encoding will also
make things very difficult.

Gerard Goossen Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About