develooper Front page | perl.perl5.porters | Postings from February 2007

Re: unicode regex performance (was Re: Future Perl development)

Thread Previous | Thread Next
From:
Gerard Goossen
Date:
February 8, 2007 08:29
Subject:
Re: unicode regex performance (was Re: Future Perl development)
Message ID:
20070208163150.GD4898@ostwald
On Thu, Feb 08, 2007 at 04:57:20PM +0100, Juerd Waalboer wrote:
> Gerard Goossen skribis 2007-02-08 16:49 (+0100):
> > use strict;
> 
> Whoops! You got me there!
> 
> > If you fix you script you will see that the unicode matching is a lot slower.
> 
> Then my point still stands.
> 
> > But it is a lot slower not because the matching is in unicode. But because Perl 5,
> > has to do a lot to make sure all string are unicode, for example E probably
> > has to upgraded to latin1.
> 
> There is no upgrading to latin1. AFAIK, Perl never downgrades
> automatically. Can anyone confirm or negate this?

Sorry, I meant upgraded to utf8. Perl does sometimes downgrade automaticly, 
for example in 'crypt'. But I don't think the regex engine does it.
 
> > If you turn of mixing latin1 and unicode matching, things get a _lot_
> > simpler and you can do better optimalizations. 
> 
> Which was part of my proposal: upgrade both the string and the pattern
> to UTF8 (if necessary), and then do naive byte matching. This should be
> explicitly enabled, because it causes havoc if you're not aware of the
> internals. Optimizations like this are very nice to have, but should
> only be used in extreme cases. Any use of such an optimization (unless
> it can safely be done automatically) is probably premature.
> 
> > my branch:
> 
> Unfortunately, you use a similar thing by default. If I understand
> correctly, your branch does UTF8, not Unicode. This is a bit like PHP's
> mb_ functions. Real Perl does Unicode, while internally encoding it as
> UTF8.

I do Unicode, internally by using UTF-8. Only thing is that (for now) I
turned it off by default. But the script used C<use utf8> which turns on
Unicode, so it does real Unicode matching. (all about my branch).


Gerard Goossen


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About