develooper Front page | perl.perl5.porters | Postings from June 2011

Re: RFC: Handling utf8 locales

Thread Previous | Thread Next
From:
Zefram
Date:
June 26, 2011 01:33
Subject:
Re: RFC: Handling utf8 locales
Message ID:
20110626083336.GA10075@lake.fysh.org
Karl Williamson wrote:
>It's a lot of work to handle multi-byte locales in general, but Perl  
>already knows how to handle Unicode utf8.  This leads to my proposal: If  
>under "use locale", a locale name ends in '.utf8', then Perl treats it  
>for purposes of cytpe-only as regular Unicode.

This sounds wrong: it'll be a source of double-encoding bugs.
Locale-encoded text input will, for a UTF-8 locale, be a sequence
of octets obeying UTF-8 syntactic rules.  If you treat those octets
as Unicode characters, using Perl's aliasing of octets to characters
U+00 to U+ff, then they'll look like very strange character sequences
(with lots of C1 controls), on which case folding (for example) won't
give locale-correct results.  Outputting Unicode text will often not
generate correct locale-encoded text output.

We should discourage the use of locale-encoded strings within Perl space.
We should encourage decoding on input, encoding on output, and using
native Unicode representation in the middle.  To this end, there should
be a PerlIO layer :locale, which {de,en}codes according to the locale's
preferred encoding.  The locale's encoding may perfectly well be UTF-8,
and in *this* context we can handle it in an entirely regular manner,
on a par with ISO-8859-*.

-zefram

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About