demerphq skribis 2007-02-08 13:38 (+0100): > Not sure. Since pos would only be updated on a successful match, and > both the pattern and string would be in utf8 doesnt it mean that the > pos would be set to a valid utf8 sequence start? The trouble is that pos() normally returns a *character* index, not a *byte* index. juerd@lanova:~$ perl -le'use utf8; ($a = "f€oo") =~ /o/g; print pos($a)' 3 If $a is the 4 character unicode string, pos must be 3. But if you ignore that it's UTF8 internally, it will be 5, or has to be re-calculated after the succesful match (which is a performance hit again). juerd@lanova:~$ perl -le'($a = "f€oo") =~ /o/g; print pos($a)' 5 # $a is a 6 byte string, no euro sign there, but three bytes # representing it in UTF8. Note that I say all this without knowledge of the actual internals. Perhaps the actual internal pos value is already in bytes only, and recalculated to characters only when needed. -- korajn salutojn, juerd waalboer: perl hacker <juerd@juerd.nl> <http://juerd.nl/sig> convolution: ict solutions and consultancy <sales@convolution.nl> Ik vertrouw stemcomputers niet. Zie <http://www.wijvertrouwenstemcomputersniet.nl/>.