develooper Front page | perl.perl5.porters | Postings from February 2007

Re: unicode regex performance (was Re: Future Perl development)

Juerd Waalboer
February 8, 2007 04:51
Re: unicode regex performance (was Re: Future Perl development)
Message ID:
demerphq skribis 2007-02-08 13:38 (+0100):
> Not sure. Since pos would only be updated on a successful match, and
> both the pattern and string would be in utf8 doesnt it mean that the
> pos would be set to a valid utf8 sequence start?

The trouble is that pos() normally returns a *character* index, not a
*byte* index.

    juerd@lanova:~$ perl -le'use utf8; ($a = "f€oo") =~ /o/g; print pos($a)'

If $a is the 4 character unicode string, pos must be 3. But if you
ignore that it's UTF8 internally, it will be 5, or has to be
re-calculated after the succesful match (which is a performance hit again).

    juerd@lanova:~$ perl -le'($a = "f€oo") =~ /o/g; print pos($a)'
    # $a is a 6 byte string, no euro sign there, but three bytes
    # representing it in UTF8.

Note that I say all this without knowledge of the actual internals.
Perhaps the actual internal pos value is already in bytes only, and
recalculated to characters only when needed.
korajn salutojn,

  juerd waalboer:  perl hacker  <>  <>
  convolution:     ict solutions and consultancy <>

Ik vertrouw stemcomputers niet.
Zie <>. Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About