develooper Front page | perl.perl5.porters | Postings from March 2011

Re: [perl #85034] tring match position on utf8 upgrade

Thread Previous | Thread Next
Aristotle Pagaltzis
March 1, 2011 21:20
Re: [perl #85034] tring match position on utf8 upgrade
Message ID:
* Ton Hospel <> [2011-02-28 23:50]:
> In article <>,
> 	demerphq <> writes:
> >> [Please describe your issue here]
> >>
> >> perl -wle '$_="\xce" x 20; pos($_) = 12; utf8::upgrade($_); print pos $_'
> >> 6
> >>
> >> This is because the PERL_MAGIC_regex_global value is in
> >> bytes even if the string is internally UTF8. If the string
> >> gets upgraded this position ought to be recalculated
> >
> > Or should it be treated as a character count?
> >
> byte count is probably more practical so that you immediately
> know where to continue matching even if you lose or don't have
> the utf8 offset cache. No utf8 offset cache seems to be pretty
> normal if you get PERL_MAGIC_regex_global due to a //g match
> instead of explicitely setting pos()
> perl -wle 'use Devel::Peek; $_=join("\xce", "a" .. "z"); utf8::upgrade($_); /q/g; Dump($_)'

Ideally there would be a byte offset stored internally but the
`pos` function would return and expect a character offset. (The
user should never be exposed to the underlying implementation.)
That means recalculating the byte offset when up- or downgrading
a string (which is almost zero extra cost since you have to scan
it anyway) and doing a char→byte conversion when the user sets it
using `pos`.

Aristotle Pagaltzis // <>

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About