develooper Front page | perl.perl5.porters | Postings from February 2007

Re: unicode regex performance (was Re: Future Perl development)

Thread Previous | Thread Next
February 8, 2007 11:52
Re: unicode regex performance (was Re: Future Perl development)
Message ID:
On Thu, Feb 08, 2007 at 04:57:20PM +0100, Juerd Waalboer wrote:
> Which was part of my proposal: upgrade both the string and the pattern
> to UTF8 (if necessary), and then do naive byte matching. This should be
> explicitly enabled, because it causes havoc if you're not aware of the
> internals. Optimizations like this are very nice to have, but should
> only be used in extreme cases. Any use of such an optimization (unless
> it can safely be done automatically) is probably premature.

FYI: Juerd. I agree with you. In glib, I have written C code that
searches for characters such as '\n' using the utf-8 functions to walk
through the string. Glib does a good job of optimizing the walk
through process. They use a lookup table to determine the next length
of the next character in bytes and so on. But, however you work it, if
you can know that the utf-8 is well formed, searching for a '\n' is
faster by searching for the byte 0x0A, then walking through a variable
number of bytes at a time.

Depending on the regular expression, at least the *tests* should often
be possible to do without treating the string as utf-8. A single
post-process walk on successful match to ensure that pos(), and $n are
all correct might be required.

A bit of hand waving. I'm not going to open up the perl RE engine to
work on it. I think it would scare me. :-)


-- / /     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About