develooper Front page | perl.perl5.porters | Postings from November 2016

Re: Does the range operator still have the Unicode Bug?

Thread Previous | Thread Next
From:
Karl Williamson
Date:
November 20, 2016 18:04
Subject:
Re: Does the range operator still have the Unicode Bug?
Message ID:
4e5dabae-8021-16c8-2297-af30c7d640a1@khwilliamson.com
On 11/20/2016 08:20 AM, Aaron Crane wrote:
> Sawyer X <xsawyerx@gmail.com> wrote:
>> On 10/30/2016 07:10 PM, Aristotle Pagaltzis wrote:
>>> I would prefer to see this just fixed, for everyone, with cleaner code.
>>> And it’s very *likely* that that can be done… just not *known*. A cycle
>>> or two with warnings would give us data to calibrate the guess.
>>
>> Again, I'm not necessarily against that. I'm trying to add more
>> considerations here. Perhaps the feature is the right place for it,
>> using "unicode_strings".
>
> On the assumption that a concrete change is easier to reason about
> than the abstract situation, I attach a proposed patch for the Unicode
> Bug in the range operator.
>
> The patch itself is fairly straightforward; its guts look like this:
>
> --- a/pp_ctl.c
> +++ b/pp_ctl.c
> @@ -1222,6 +1222,8 @@ PP(pp_flop)
>             const char * const tmps = SvPV_nomg_const(right, len);
>
>             SV *sv = newSVpvn_flags(lpv, llen, SvUTF8(left)|SVs_TEMP);
> +            if (DO_UTF8(right) && IN_UNI_8_BIT)
> +                len = sv_len_utf8_nomg(right);
>             while (!SvNIOKp(sv) && SvCUR(sv) <= len) {
>                 XPUSHs(sv);
>                 if (strEQ(SvPVX_const(sv),tmps))
>
> (Except twice, because "foreach ($x .. $y)" has an independent
> implementation that takes constant memory.)
>
> That is, this change makes stringy $x..$y honour the unicode_strings
> feature, without any warning.
>
> FWIW, my own view is that this change is simply a bugfix for ranges
> under the unicode_strings feature, and that the current behaviour is
> so bizarre and unpredictable that no warning is necessary. (Or even
> entirely useful, since we can't distinguish between code that wants
> the current behaviour (but neglected to utf8::decode the RHS) and code
> that's been updated to take advantage of the new behaviour.)
>

As I believe it has been pointed out before, the use of that feature 
implies that the user wants proper handling of unicode strings.  That is 
why in earlier releases, it was enhanced to include more things, like 
quotemeta as they were unearthed, instead of creating extra features. 
Thus, treating this as a bug fix follows the existing  paradigm that we 
followed.


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About