develooper Front page | perl.perl5.porters | Postings from February 2007

Re: unicode regex performance (was Re: Future Perl development)

Juerd Waalboer
February 8, 2007 07:57
Re: unicode regex performance (was Re: Future Perl development)
Message ID:
Gerard Goossen skribis 2007-02-08 16:49 (+0100):
> use strict;

Whoops! You got me there!

> If you fix you script you will see that the unicode matching is a lot slower.

Then my point still stands.

> But it is a lot slower not because the matching is in unicode. But because Perl 5,
> has to do a lot to make sure all string are unicode, for example E probably
> has to upgraded to latin1.

There is no upgrading to latin1. AFAIK, Perl never downgrades
automatically. Can anyone confirm or negate this?

> If you turn of mixing latin1 and unicode matching, things get a _lot_
> simpler and you can do better optimalizations. 

Which was part of my proposal: upgrade both the string and the pattern
to UTF8 (if necessary), and then do naive byte matching. This should be
explicitly enabled, because it causes havoc if you're not aware of the
internals. Optimizations like this are very nice to have, but should
only be used in extreme cases. Any use of such an optimization (unless
it can safely be done automatically) is probably premature.

> my branch:

Unfortunately, you use a similar thing by default. If I understand
correctly, your branch does UTF8, not Unicode. This is a bit like PHP's
mb_ functions. Real Perl does Unicode, while internally encoding it as

> When refering to my branch, I will do so explicit (by saying something
> like my branch, my patch).

Thanks for clarifying that. I was confused by your reference to \x[].
korajn salutojn,

  juerd waalboer:  perl hacker  <>  <>
  convolution:     ict solutions and consultancy <>

Ik vertrouw stemcomputers niet.
Zie <>. Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About