develooper Front page | perl.perl5.porters | Postings from February 2007

Re: unicode regex performance (was Re: Future Perl development)

From:
Juerd Waalboer
Date:
February 8, 2007 04:51
Subject:
Re: unicode regex performance (was Re: Future Perl development)
Message ID:
20070208125159.GX25362@c4.convolution.nl
demerphq skribis 2007-02-08 13:38 (+0100):
> Not sure. Since pos would only be updated on a successful match, and
> both the pattern and string would be in utf8 doesnt it mean that the
> pos would be set to a valid utf8 sequence start?

The trouble is that pos() normally returns a *character* index, not a
*byte* index.

    juerd@lanova:~$ perl -le'use utf8; ($a = "f€oo") =~ /o/g; print pos($a)'
    3

If $a is the 4 character unicode string, pos must be 3. But if you
ignore that it's UTF8 internally, it will be 5, or has to be
re-calculated after the succesful match (which is a performance hit again).

    juerd@lanova:~$ perl -le'($a = "f€oo") =~ /o/g; print pos($a)'
    5
    # $a is a 6 byte string, no euro sign there, but three bytes
    # representing it in UTF8.

Note that I say all this without knowledge of the actual internals.
Perhaps the actual internal pos value is already in bytes only, and
recalculated to characters only when needed.
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerd@juerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy <sales@convolution.nl>

Ik vertrouw stemcomputers niet.
Zie <http://www.wijvertrouwenstemcomputersniet.nl/>.



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About