Front page | perl.perl5.porters |
Postings from June 2011
Re: RFC: Handling utf8 locales
From: Karl Williamson
June 26, 2011 21:03
Re: RFC: Handling utf8 locales
Message ID: 4E0800F0.firstname.lastname@example.org
On 06/26/2011 04:32 PM, Zefram wrote:
> Karl Williamson wrote:
>> People still use locales to get, e.g., the proper date format,
>> for their location. But they can't currently do that if the locale is
>> utf8 because regex matching and casing don't work well with those.
> Are you saying that LC_TIME et al aspects of locales don't work if the
> locale's character set can't be handled? If so, I support making LC_TIME
> et al work independently of character encoding.
I am essentially saying that, and that is what this proposal was meant
>> in which the parameter indicates that the user promises that they won't
>> use a non-utf8 locale, and so Perl can ignore CTYPE for locale purposes,
>> and in fact that strings can be assumed to be encoded as Unicode
> In this you appear to be restating the fundamental brokenness that I
> initially perceived. If the locale prefers UTF-8 encoding, that means
> we're liable to see UTF-8-encoded text in the input. For example,
> the orange one's name, which includes an e-acute character (U+e9), will
> appear as the string "L\xc3\xa9on". Ideally we'd like the /l modifier in
> this situation to make "\xc3\xa9" match as a single alphabetic character,
> and match /\xc3\x89/ if we also apply the /i modifier. As I understand
> it, you are proposing that /l behave identically to /u, and thus that
> we treat the "L\xc3\xa9on" string as containing an A-tilde (U+c3) and
> copyright symbol (U+a9), which in character classes and case-insensitive
> matching will give a very different effect.
> The only locale charset for which we can ignore locale encoding is
> Latin-1. This is because Latin-1 encoding and decoding, as viewed
> by Perl, is the identity function (modulo encoding range errors).
> UTF-8 encoding and decoding are distinctly non-identity operations.
>> What I was originally proposing would work well with the :locale I/O
> Surely any particular behaviour for regexp character classes will be
> utterly broken by any change in the encoding regime that generates
> the strings on which it operates. /u works naturally with :locale, by
> virtue of having character strings always represented in native Unicode
> form where visible to the program. Any form of locale-encoded-string
> handling, such as /l, however, is fundamentally predicated on *not*
> decoding inputs that are expected to be locale-encoded.
Currently, under locale, the user is warranting that the strings are
correctly encoded in the specified locale. If the real encoding is
Hebrew and we are told that it is Greek, the results are almost certain
to be wrong. My proposal had nothing to do with input/output. "use
locale" as far as documented has no current effect on that. What I was
saying was that under utf8 locales, which are currently documented as
not working, the regex engine and the casing functions would assume that
their strings were properly Unicode-encoded. It's up to the user to get
them that way. The :locale layer would be a convenient way to do this.
So, sure, if the string is in utf8, but the utf8 flag is not set (or
vice versa), the results will be wrong. The proposal is not an
end-to-end solution where suddenly "use locale" takes on more meaning
than it currently does. It is "if are in a utf8 locale, and you've
arranged things so that the strings are Unicode-encoded, then operations
on them will work correctly" which is not the case currently.