develooper Front page | perl.perl5.porters | Postings from June 2011

Re: RFC: Handling utf8 locales

Thread Previous | Thread Next
From:
Zefram
Date:
June 26, 2011 15:32
Subject:
Re: RFC: Handling utf8 locales
Message ID:
20110626223214.GI14337@lake.fysh.org
Karl Williamson wrote:
>         People still use locales to get, e.g., the proper date format,  
>for their location.  But they can't currently do that if the locale is  
>utf8 because regex matching and casing don't work well with those.

Are you saying that LC_TIME et al aspects of locales don't work if the
locale's character set can't be handled?  If so, I support making LC_TIME
et al work independently of character encoding.

>in which the parameter indicates that the user promises that they won't  
>use a non-utf8 locale, and so Perl can ignore CTYPE for locale purposes,  
>and in fact that strings can be assumed to be encoded as Unicode  
>characters.

In this you appear to be restating the fundamental brokenness that I
initially perceived.  If the locale prefers UTF-8 encoding, that means
we're liable to see UTF-8-encoded text in the input.  For example,
the orange one's name, which includes an e-acute character (U+e9), will
appear as the string "L\xc3\xa9on".  Ideally we'd like the /l modifier in
this situation to make "\xc3\xa9" match as a single alphabetic character,
and match /\xc3\x89/ if we also apply the /i modifier.  As I understand
it, you are proposing that /l behave identically to /u, and thus that
we treat the "L\xc3\xa9on" string as containing an A-tilde (U+c3) and
copyright symbol (U+a9), which in character classes and case-insensitive
matching will give a very different effect.

The only locale charset for which we can ignore locale encoding is
Latin-1.  This is because Latin-1 encoding and decoding, as viewed
by Perl, is the identity function (modulo encoding range errors).
UTF-8 encoding and decoding are distinctly non-identity operations.

>What I was originally proposing would work well with the :locale I/O  
>layer.

Surely any particular behaviour for regexp character classes will be
utterly broken by any change in the encoding regime that generates
the strings on which it operates.  /u works naturally with :locale, by
virtue of having character strings always represented in native Unicode
form where visible to the program.  Any form of locale-encoded-string
handling, such as /l, however, is fundamentally predicated on *not*
decoding inputs that are expected to be locale-encoded.

-zefram

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About