develooper Front page | perl.perl5.porters | Postings from February 2007

Re: unicode regex performance (was Re: Future Perl development)

Juerd Waalboer
February 8, 2007 05:46
Re: unicode regex performance (was Re: Future Perl development)
Message ID:
Gerard Goossen skribis 2007-02-08 14:22 (+0100):
> > You'd look for \x{20ac}, you'd be looking for \xe2\x82\xac.
> Do NOT use \xe2\x82\xac to create bytes. Use pack (or
> \x[e2]\x[82]\x[ac]) to create bytes.

As if all previous discussion never happened. Tiresome.

I've also said to avoid \x for creating bytes. Later I learned that it
is a safe way to create bytes, as long as you're not under "use

\xff HAS TO BE a safe way to create a single byte, because otherwise it
would not be backwards compatible with a decade of pre-existing code.

Of course, these bytes are upgraded to UTF8 (**INTERNALLY**!!) if you
use them with strings that are also in UTF8 (again, internally).

That's perfectly okay, because one cannot mix byte strings like
"\xe2\x82\xac" with text strings like "3,00: goedkóóp!", in any
meaningful way, because "3,00: goedkóóp" makes no sense in the context
of bytes, if you do not encode it.

\x[] does not exist in Real Perl, mind you!

> But looking for this byte sequence is already what the current regex
> engine does:

I don't know how exactly your benchmark turns out those results. Maybe
because the single match is at position 0, maybe because you "use
bytes", maybe for some other reason.

But I'll just show you one of the benchmarks that I did before:

    juerd@lanova:~$ perl -MBenchmark=cmpthese -MEncode -e'
        my $unicode = "f\x{20ac}oo"; 
        Encode::_utf8_off(my $utf8 = $unicode); 
        my $re_unicode = qr/\x{20ac}/; 
        my $re_utf8 = qr/\xe2\x82\xac/; 
        cmpthese -1, { 
            unicode => sub { (my $dummy = $unicode) =~ s/$unicode_re/E/; }, 
            utf8    => sub { (my $dummy = $utf8) =~ s/$utf8_re/E/; } 
                Rate unicode    utf8
    unicode 314139/s      --    -27%
    utf8    428740/s     36%      --


    unicode: Unicode string ("text string", "character string")
    utf8: The same unicode string, encoded to utf8 (by the ugly means of
    removing the UTF8 flag from the aforementioned unicode string). It
    is now a byte string.

By the way, when you say "current perl", do you refer to stable, blead,
or your own branch? I'm currently using 5.8.8.
korajn salutojn,

  juerd waalboer:  perl hacker  <>  <>
  convolution:     ict solutions and consultancy <>

Ik vertrouw stemcomputers niet.
Zie <>. Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About