On Thu, Feb 08, 2007 at 04:57:20PM +0100, Juerd Waalboer wrote: > Gerard Goossen skribis 2007-02-08 16:49 (+0100): > > use strict; > > Whoops! You got me there! > > > If you fix you script you will see that the unicode matching is a lot slower. > > Then my point still stands. > > > But it is a lot slower not because the matching is in unicode. But because Perl 5, > > has to do a lot to make sure all string are unicode, for example E probably > > has to upgraded to latin1. > > There is no upgrading to latin1. AFAIK, Perl never downgrades > automatically. Can anyone confirm or negate this? Sorry, I meant upgraded to utf8. Perl does sometimes downgrade automaticly, for example in 'crypt'. But I don't think the regex engine does it. > > If you turn of mixing latin1 and unicode matching, things get a _lot_ > > simpler and you can do better optimalizations. > > Which was part of my proposal: upgrade both the string and the pattern > to UTF8 (if necessary), and then do naive byte matching. This should be > explicitly enabled, because it causes havoc if you're not aware of the > internals. Optimizations like this are very nice to have, but should > only be used in extreme cases. Any use of such an optimization (unless > it can safely be done automatically) is probably premature. > > > my branch: > > Unfortunately, you use a similar thing by default. If I understand > correctly, your branch does UTF8, not Unicode. This is a bit like PHP's > mb_ functions. Real Perl does Unicode, while internally encoding it as > UTF8. I do Unicode, internally by using UTF-8. Only thing is that (for now) I turned it off by default. But the script used C<use utf8> which turns on Unicode, so it does real Unicode matching. (all about my branch). Gerard GoossenThread Previous | Thread Next