On 01/27/2014 11:30 AM, Karl Williamson wrote: > On 01/25/2014 04:06 PM, Aristotle Pagaltzis wrote: >> * Father Chrysostomos <sprout@cpan.org> [2014-01-16 15:30]: >>> Zefram wrote: >>>> This sounds broken: in your example Latin-7 locale you now have two >>>> representations of U+3c7 Greek small letter chi (at 0xf7 and 0x3c7) >>>> and no representations of U+f7 division sign. >>> >>> But that is the only sane model that sufficiently preserves backward >>> compatibility. >> >> So far, I agree. >> >>>> I think we should have a big notice in the documentation to the >>>> effect that non-UTF-8 locales and Unicode don't mix. >>> >>> Agreed. But we cannot expect people to prevent strings from being >>> upgraded, since perl does that transparently. (E.g., put your locale >>> and Unicode string in the same data store, and then extract them. >>> Flags could flip either way depending on how the data are stored.) >> >> No, we cannot expect people to prevent strings from being upgraded. But >> *can* expect them not to perform locale-based processing on character >> strings (rather than octet strings). Now we cannot know what strings are >> octet strings, we do have at least one class of strings that cannot >> possibly be octet strings: those that contain elements > 0xFF. So when >> a locale-based regexp sees one of those, something is unambiguously >> broken. >> >>>> Possibly more sensible behaviour for this situation would be that >>>> 0x0 to 0xff get locale behaviour and 0x100 upwards get null >>>> behaviour (don't match any properties, case convert to self, as if >>>> unassigned by Unicode). >>> >>> That would be something completely new, which would likely break >>> existing programs. >> >> I agree this isn’t feasible. But maybe we can follow the precedent of >> other octet-based functions: have locale-based regexps throw a “wide >> character” warning any time they encounter one. > > That seems reasonable to me. > Now done in 613abc6d16e99bd9834fe6afd79beb61a3a4734dThread Previous | Thread Next