On Thu, Feb 08, 2007 at 01:51:59PM +0100, Juerd Waalboer wrote: > demerphq skribis 2007-02-08 13:38 (+0100): > > Not sure. Since pos would only be updated on a successful match, and > > both the pattern and string would be in utf8 doesnt it mean that the > > pos would be set to a valid utf8 sequence start? > > The trouble is that pos() normally returns a *character* index, not a > *byte* index. > > juerd@lanova:~$ perl -le'use utf8; ($a = "f?oo") =~ /o/g; print pos($a)' > 3 > > If $a is the 4 character unicode string, pos must be 3. But if you > ignore that it's UTF8 internally, it will be 5, or has to be > re-calculated after the succesful match (which is a performance hit again). > > juerd@lanova:~$ perl -le'($a = "f?oo") =~ /o/g; print pos($a)' > 5 > # $a is a 6 byte string, no euro sign there, but three bytes > # representing it in UTF8. > > Note that I say all this without knowledge of the actual internals. > Perhaps the actual internal pos value is already in bytes only, and > recalculated to characters only when needed. It is. from pp_pos: if (DO_UTF8(sv)) sv_pos_b2u(sv, &i); Also byte 2 codepoints conversions are cached, so the codepoints might not to be recalculated, or only a very small part. Gerard Goossen