develooper Front page | perl.perl5.porters | Postings from June 2011

Re: RFC: Handling utf8 locales

Thread Previous | Thread Next
June 27, 2011 09:07
Re: RFC: Handling utf8 locales
Message ID:
Karl Williamson wrote:
>                                                           In all cases,  
>the programmer is warranting that the string is correctly encoded in the  
>specified locale.  It's just that UTF-8 locales ARE in native Unicode  

This sounds wrong.  Perl's native format for Unicode characters is
the *decoded* form.  Any encoded form (except Latin-1 where the encoding
is identity) is not native.  (I'm referring to Perl-visible encoding,
of course, not the internal encoding.)

>I don't know how to explain it more clearly.

Try specific examples.  Taking the example I used in my previous message,
suppose Acme inputs his name in the locale-appropriate manner, in
an ISO-646-FR locale and in a UTF-8 locale.  Perl will read the name
from STDIN and store the input without altering it, thus yielding a
locale-encoded string.  Then consider what character semantics a /u
regexp would assign to it.

                       ISO-646-FR     UTF-8
                       ----------     -----
octets on STDIN        4c 7b 6f 6e    4c c3 a9 6f 6e
Perl-visible string    "L{on"         "L\xc3\xa9on"
how /u interprets it   open brace     A-tilde, copyright

In both cases, Perl's native representation of the name is "L\xe9on",
which is different from the locale-encoded representation.  In both cases,
/u perceives a character or characters other than the desired e-acute.

As I understand it, you are claiming that in the UTF-8 case /u will give
the correct character semantics to a locale-encoded string.  I claim that
it will be just as incorrect as it would be with an ISO-646-FR locale.

If I have misunderstood your proposal, please explain with a worked


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About