develooper Front page | perl.perl5.porters | Postings from December 2016

Re: Does the range operator still have the Unicode Bug?

Thread Previous | Thread Next
Karl Williamson
December 14, 2016 18:21
Re: Does the range operator still have the Unicode Bug?
Message ID:
On 11/20/2016 11:03 AM, Karl Williamson wrote:
> On 11/20/2016 08:20 AM, Aaron Crane wrote:

I think the patch should be committed.

>> Sawyer X <> wrote:
>>> On 10/30/2016 07:10 PM, Aristotle Pagaltzis wrote:
>>>> I would prefer to see this just fixed, for everyone, with cleaner code.
>>>> And it’s very *likely* that that can be done… just not *known*. A cycle
>>>> or two with warnings would give us data to calibrate the guess.
>>> Again, I'm not necessarily against that. I'm trying to add more
>>> considerations here. Perhaps the feature is the right place for it,
>>> using "unicode_strings".
>> On the assumption that a concrete change is easier to reason about
>> than the abstract situation, I attach a proposed patch for the Unicode
>> Bug in the range operator.
>> The patch itself is fairly straightforward; its guts look like this:
>> --- a/pp_ctl.c
>> +++ b/pp_ctl.c
>> @@ -1222,6 +1222,8 @@ PP(pp_flop)
>>             const char * const tmps = SvPV_nomg_const(right, len);
>>             SV *sv = newSVpvn_flags(lpv, llen, SvUTF8(left)|SVs_TEMP);
>> +            if (DO_UTF8(right) && IN_UNI_8_BIT)
>> +                len = sv_len_utf8_nomg(right);
>>             while (!SvNIOKp(sv) && SvCUR(sv) <= len) {
>>                 XPUSHs(sv);
>>                 if (strEQ(SvPVX_const(sv),tmps))
>> (Except twice, because "foreach ($x .. $y)" has an independent
>> implementation that takes constant memory.)
>> That is, this change makes stringy $x..$y honour the unicode_strings
>> feature, without any warning.
>> FWIW, my own view is that this change is simply a bugfix for ranges
>> under the unicode_strings feature, and that the current behaviour is
>> so bizarre and unpredictable that no warning is necessary. (Or even
>> entirely useful, since we can't distinguish between code that wants
>> the current behaviour (but neglected to utf8::decode the RHS) and code
>> that's been updated to take advantage of the new behaviour.)
> As I believe it has been pointed out before, the use of that feature
> implies that the user wants proper handling of unicode strings.  That is
> why in earlier releases, it was enhanced to include more things, like
> quotemeta as they were unearthed, instead of creating extra features.
> Thus, treating this as a bug fix follows the existing  paradigm that we
> followed.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About