On 06/27/2011 10:07 AM, Zefram wrote: > Karl Williamson wrote: >> In all cases, >> the programmer is warranting that the string is correctly encoded in the >> specified locale. It's just that UTF-8 locales ARE in native Unicode >> form. > > This sounds wrong. Perl's native format for Unicode characters is > the *decoded* form. Any encoded form (except Latin-1 where the encoding > is identity) is not native. (I'm referring to Perl-visible encoding, > of course, not the internal encoding.) > >> I don't know how to explain it more clearly. > > Try specific examples. Taking the example I used in my previous message, > suppose Acme inputs his name in the locale-appropriate manner, in > an ISO-646-FR locale and in a UTF-8 locale. Perl will read the name > from STDIN and store the input without altering it, thus yielding a > locale-encoded string. Then consider what character semantics a /u > regexp would assign to it. > > ISO-646-FR UTF-8 > ---------- ----- > octets on STDIN 4c 7b 6f 6e 4c c3 a9 6f 6e > Perl-visible string "L{on" "L\xc3\xa9on" > how /u interprets it open brace A-tilde, copyright > > In both cases, Perl's native representation of the name is "L\xe9on", > which is different from the locale-encoded representation. In both cases, > /u perceives a character or characters other than the desired e-acute. > > As I understand it, you are claiming that in the UTF-8 case /u will give > the correct character semantics to a locale-encoded string. I claim that > it will be just as incorrect as it would be with an ISO-646-FR locale. > > If I have misunderstood your proposal, please explain with a worked > example. > > -zefram > Perhaps you are forgetting about the -C option. If perl is called with the -C option, it will take the appropriate action based on the user's locale. It distinguishes between utf8 and non-utf8 locales, adding a UTF8 layer automatically if called for. I tested your example under a utf8 locale. The string that is read in is in fact the 6 octets: \x4c\xc3\xa9\x6f\x6e\x0a (the test had a trailing \n), but since it is marked as encoded in utf8 format, it is interpreted correctly as the 5 characters \x4c\xe9\x6f\x6e\x0a. In a non-utf8 locale, the -C option should read in the octets you mentioned for the non-utf8 case, and we can hope that the platform's locale software interprets it correctly. My original proposal allows the "-C + use locale" combination to work correctly for utf8 locales. Right now it doesn't, because /l doesn't work correctly for them. The "use locale 'NO_CTYPE'" proposal allows a :locale layer to work and still get LC_TIME, etc.Thread Previous | Thread Next