develooper Front page | perl.perl5.porters | Postings from March 2011

Re: [perl #85034] tring match position on utf8 upgrade

Thread Previous | Thread Next
From:
Aristotle Pagaltzis
Date:
March 1, 2011 21:20
Subject:
Re: [perl #85034] tring match position on utf8 upgrade
Message ID:
20110302052031.GV19109@klangraum.plasmasturm.org
* Ton Hospel <perl5-porters@ton.iguana.be> [2011-02-28 23:50]:
> In article <AANLkTimREgEmsxbGN5_Gt+pJk9svLu1u972VQ0JxBZKR@mail.gmail.com>,
> 	demerphq <demerphq@gmail.com> writes:
> >> [Please describe your issue here]
> >>
> >> perl -wle '$_="\xce" x 20; pos($_) = 12; utf8::upgrade($_); print pos $_'
> >> 6
> >>
> >> This is because the PERL_MAGIC_regex_global value is in
> >> bytes even if the string is internally UTF8. If the string
> >> gets upgraded this position ought to be recalculated
> >
> > Or should it be treated as a character count?
> >
> byte count is probably more practical so that you immediately
> know where to continue matching even if you lose or don't have
> the utf8 offset cache. No utf8 offset cache seems to be pretty
> normal if you get PERL_MAGIC_regex_global due to a //g match
> instead of explicitely setting pos()
>
> perl -wle 'use Devel::Peek; $_=join("\xce", "a" .. "z"); utf8::upgrade($_); /q/g; Dump($_)'

Ideally there would be a byte offset stored internally but the
`pos` function would return and expect a character offset. (The
user should never be exposed to the underlying implementation.)
That means recalculating the byte offset when up- or downgrading
a string (which is almost zero extra cost since you have to scan
it anyway) and doing a char→byte conversion when the user sets it
using `pos`.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About