develooper Front page | perl.perl5.porters | Postings from June 2011

Re: RFC: Handling utf8 locales

Thread Previous | Thread Next
Karl Williamson
June 27, 2011 17:09
Re: RFC: Handling utf8 locales
Message ID:
On 06/27/2011 10:07 AM, Zefram wrote:
> Karl Williamson wrote:
>>                                                            In all cases,
>> the programmer is warranting that the string is correctly encoded in the
>> specified locale.  It's just that UTF-8 locales ARE in native Unicode
>> form.
> This sounds wrong.  Perl's native format for Unicode characters is
> the *decoded* form.  Any encoded form (except Latin-1 where the encoding
> is identity) is not native.  (I'm referring to Perl-visible encoding,
> of course, not the internal encoding.)
>> I don't know how to explain it more clearly.
> Try specific examples.  Taking the example I used in my previous message,
> suppose Acme inputs his name in the locale-appropriate manner, in
> an ISO-646-FR locale and in a UTF-8 locale.  Perl will read the name
> from STDIN and store the input without altering it, thus yielding a
> locale-encoded string.  Then consider what character semantics a /u
> regexp would assign to it.
>                         ISO-646-FR     UTF-8
>                         ----------     -----
> octets on STDIN        4c 7b 6f 6e    4c c3 a9 6f 6e
> Perl-visible string    "L{on"         "L\xc3\xa9on"
> how /u interprets it   open brace     A-tilde, copyright
> In both cases, Perl's native representation of the name is "L\xe9on",
> which is different from the locale-encoded representation.  In both cases,
> /u perceives a character or characters other than the desired e-acute.
> As I understand it, you are claiming that in the UTF-8 case /u will give
> the correct character semantics to a locale-encoded string.  I claim that
> it will be just as incorrect as it would be with an ISO-646-FR locale.
> If I have misunderstood your proposal, please explain with a worked
> example.
> -zefram

Perhaps you are forgetting about the -C option.

If perl is called with the -C option, it will take the appropriate 
action based on the user's locale.  It distinguishes between utf8 and 
non-utf8 locales, adding a UTF8 layer automatically if called for.

I tested your example under a utf8 locale.  The string that is read in 
is in fact the 6 octets: \x4c\xc3\xa9\x6f\x6e\x0a (the test had a 
trailing \n), but since it is marked as encoded in utf8 format, it is 
interpreted correctly as the 5 characters \x4c\xe9\x6f\x6e\x0a.

In a non-utf8 locale, the -C option should read in the octets you 
mentioned for the non-utf8 case, and we can hope that the platform's 
locale software interprets it correctly.

My original proposal allows the "-C + use locale" combination to work 
correctly for utf8 locales.  Right now it doesn't, because /l doesn't 
work correctly for them.

The "use locale 'NO_CTYPE'" proposal allows a :locale layer to work and 
still get LC_TIME, etc.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About