develooper Front page | perl.perl5.porters | Postings from February 2007

Re: unicode regex performance (was Re: Future Perl development)

February 8, 2007 11:52
Re: unicode regex performance (was Re: Future Perl development)
Message ID:
On Thu, Feb 08, 2007 at 04:57:20PM +0100, Juerd Waalboer wrote:
> Which was part of my proposal: upgrade both the string and the pattern
> to UTF8 (if necessary), and then do naive byte matching. This should be
> explicitly enabled, because it causes havoc if you're not aware of the
> internals. Optimizations like this are very nice to have, but should
> only be used in extreme cases. Any use of such an optimization (unless
> it can safely be done automatically) is probably premature.

FYI: Juerd. I agree with you. In glib, I have written C code that
searches for characters such as '\n' using the utf-8 functions to walk
through the string. Glib does a good job of optimizing the walk
through process. They use a lookup table to determine the next length
of the next character in bytes and so on. But, however you work it, if
you can know that the utf-8 is well formed, searching for a '\n' is
faster by searching for the byte 0x0A, then walking through a variable
number of bytes at a time.

Depending on the regular expression, at least the *tests* should often
be possible to do without treating the string as utf-8. A single
post-process walk on successful match to ensure that pos(), and $n are
all correct might be required.

A bit of hand waving. I'm not going to open up the perl RE engine to
work on it. I think it would scare me. :-)


-- / /     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                  Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About