develooper Front page | perl.perl5.porters | Postings from December 2014

Re: What should tainting behavior be? Was [perl #120675] Unexpectedtainting via regex using locale

Thread Previous | Thread Next
From:
Karl Williamson
Date:
December 29, 2014 20:54
Subject:
Re: What should tainting behavior be? Was [perl #120675] Unexpectedtainting via regex using locale
Message ID:
54A1BF60.8090201@khwilliamson.com
On 01/27/2014 11:30 AM, Karl Williamson wrote:
> On 01/25/2014 04:06 PM, Aristotle Pagaltzis wrote:
>> * Father Chrysostomos <sprout@cpan.org> [2014-01-16 15:30]:
>>> Zefram wrote:
>>>> This sounds broken: in your example Latin-7 locale you now have two
>>>> representations of U+3c7 Greek small letter chi (at 0xf7 and 0x3c7)
>>>> and no representations of U+f7 division sign.
>>>
>>> But that is the only sane model that sufficiently preserves backward
>>> compatibility.
>>
>> So far, I agree.
>>
>>>> I think we should have a big notice in the documentation to the
>>>> effect that non-UTF-8 locales and Unicode don't mix.
>>>
>>> Agreed. But we cannot expect people to prevent strings from being
>>> upgraded, since perl does that transparently. (E.g., put your locale
>>> and Unicode string in the same data store, and then extract them.
>>> Flags could flip either way depending on how the data are stored.)
>>
>> No, we cannot expect people to prevent strings from being upgraded. But
>> *can* expect them not to perform locale-based processing on character
>> strings (rather than octet strings). Now we cannot know what strings are
>> octet strings, we do have at least one class of strings that cannot
>> possibly be octet strings: those that contain elements > 0xFF. So when
>> a locale-based regexp sees one of those, something is unambiguously
>> broken.
>>
>>>> Possibly more sensible behaviour for this situation would be that
>>>> 0x0 to 0xff get locale behaviour and 0x100 upwards get null
>>>> behaviour (don't match any properties, case convert to self, as if
>>>> unassigned by Unicode).
>>>
>>> That would be something completely new, which would likely break
>>> existing programs.
>>
>> I agree this isn’t feasible. But maybe we can follow the precedent of
>> other octet-based functions: have locale-based regexps throw a “wide
>> character” warning any time they encounter one.
>
> That seems reasonable to me.
>


Now done in 613abc6d16e99bd9834fe6afd79beb61a3a4734d


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About