develooper Front page | perl.perl5.porters | Postings from June 2011

Re: RFC: Handling utf8 locales

Thread Previous | Thread Next
From:
Karl Williamson
Date:
June 27, 2011 17:09
Subject:
Re: RFC: Handling utf8 locales
Message ID:
4E091B5E.9070002@khwilliamson.com
On 06/27/2011 10:07 AM, Zefram wrote:
> Karl Williamson wrote:
>>                                                            In all cases,
>> the programmer is warranting that the string is correctly encoded in the
>> specified locale.  It's just that UTF-8 locales ARE in native Unicode
>> form.
>
> This sounds wrong.  Perl's native format for Unicode characters is
> the *decoded* form.  Any encoded form (except Latin-1 where the encoding
> is identity) is not native.  (I'm referring to Perl-visible encoding,
> of course, not the internal encoding.)
>
>> I don't know how to explain it more clearly.
>
> Try specific examples.  Taking the example I used in my previous message,
> suppose Acme inputs his name in the locale-appropriate manner, in
> an ISO-646-FR locale and in a UTF-8 locale.  Perl will read the name
> from STDIN and store the input without altering it, thus yielding a
> locale-encoded string.  Then consider what character semantics a /u
> regexp would assign to it.
>
>                         ISO-646-FR     UTF-8
>                         ----------     -----
> octets on STDIN        4c 7b 6f 6e    4c c3 a9 6f 6e
> Perl-visible string    "L{on"         "L\xc3\xa9on"
> how /u interprets it   open brace     A-tilde, copyright
>
> In both cases, Perl's native representation of the name is "L\xe9on",
> which is different from the locale-encoded representation.  In both cases,
> /u perceives a character or characters other than the desired e-acute.
>
> As I understand it, you are claiming that in the UTF-8 case /u will give
> the correct character semantics to a locale-encoded string.  I claim that
> it will be just as incorrect as it would be with an ISO-646-FR locale.
>
> If I have misunderstood your proposal, please explain with a worked
> example.
>
> -zefram
>

Perhaps you are forgetting about the -C option.

If perl is called with the -C option, it will take the appropriate 
action based on the user's locale.  It distinguishes between utf8 and 
non-utf8 locales, adding a UTF8 layer automatically if called for.

I tested your example under a utf8 locale.  The string that is read in 
is in fact the 6 octets: \x4c\xc3\xa9\x6f\x6e\x0a (the test had a 
trailing \n), but since it is marked as encoded in utf8 format, it is 
interpreted correctly as the 5 characters \x4c\xe9\x6f\x6e\x0a.

In a non-utf8 locale, the -C option should read in the octets you 
mentioned for the non-utf8 case, and we can hope that the platform's 
locale software interprets it correctly.

My original proposal allows the "-C + use locale" combination to work 
correctly for utf8 locales.  Right now it doesn't, because /l doesn't 
work correctly for them.

The "use locale 'NO_CTYPE'" proposal allows a :locale layer to work and 
still get LC_TIME, etc.


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About