develooper Front page | perl.perl5.porters | Postings from February 2007

Re: unicode regex performance (was Re: Future Perl development)

From:
Gerard Goossen
Date:
February 10, 2007 05:01
Subject:
Re: unicode regex performance (was Re: Future Perl development)
Message ID:
20070210130409.GE1868@ostwald
On Thu, Feb 08, 2007 at 01:51:59PM +0100, Juerd Waalboer wrote:
> demerphq skribis 2007-02-08 13:38 (+0100):
> > Not sure. Since pos would only be updated on a successful match, and
> > both the pattern and string would be in utf8 doesnt it mean that the
> > pos would be set to a valid utf8 sequence start?
> 
> The trouble is that pos() normally returns a *character* index, not a
> *byte* index.
> 
>     juerd@lanova:~$ perl -le'use utf8; ($a = "f?oo") =~ /o/g; print pos($a)'
>     3
> 
> If $a is the 4 character unicode string, pos must be 3. But if you
> ignore that it's UTF8 internally, it will be 5, or has to be
> re-calculated after the succesful match (which is a performance hit again).
> 
>     juerd@lanova:~$ perl -le'($a = "f?oo") =~ /o/g; print pos($a)'
>     5
>     # $a is a 6 byte string, no euro sign there, but three bytes
>     # representing it in UTF8.
> 
> Note that I say all this without knowledge of the actual internals.
> Perhaps the actual internal pos value is already in bytes only, and
> recalculated to characters only when needed.

It is. from pp_pos:
                if (DO_UTF8(sv))
                    sv_pos_b2u(sv, &i);

Also byte 2 codepoints conversions are cached, so the codepoints might not
to be recalculated, or only a very small part.


Gerard Goossen




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About