develooper Front page | perl.perl5.porters | Postings from June 2011

Re: RFC: Handling utf8 locales

Thread Previous | Thread Next
Karl Williamson
June 26, 2011 12:37
Re: RFC: Handling utf8 locales
Message ID:
On 06/26/2011 02:33 AM, Zefram wrote:
> Karl Williamson wrote:
>> It's a lot of work to handle multi-byte locales in general, but Perl
>> already knows how to handle Unicode utf8.  This leads to my proposal: If
>> under "use locale", a locale name ends in '.utf8', then Perl treats it
>> for purposes of cytpe-only as regular Unicode.
> This sounds wrong: it'll be a source of double-encoding bugs.
> Locale-encoded text input will, for a UTF-8 locale, be a sequence
> of octets obeying UTF-8 syntactic rules.  If you treat those octets
> as Unicode characters, using Perl's aliasing of octets to characters
> U+00 to U+ff, then they'll look like very strange character sequences
> (with lots of C1 controls), on which case folding (for example) won't
> give locale-correct results.  Outputting Unicode text will often not
> generate correct locale-encoded text output.
> We should discourage the use of locale-encoded strings within Perl space.
> We should encourage decoding on input, encoding on output, and using
> native Unicode representation in the middle.  To this end, there should
> be a PerlIO layer :locale, which {de,en}codes according to the locale's
> preferred encoding.  The locale's encoding may perfectly well be UTF-8,
> and in *this* context we can handle it in an entirely regular manner,
> on a par with ISO-8859-*.
> -zefram

I think you're misunderstanding my proposal, or I don't grok what you're 
saying.  People still use locales to get, e.g., the proper date format, 
for their location.  But they can't currently do that if the locale is 
utf8 because regex matching and casing don't work well with those.  I 
was proposing something that would fix that (and not have the downsides 
that you think it does), but rather than push that for now, a much 
easier to implement proposal that would work for most people would be to 
modify the locale pragma to have a single parameter, something like either

use locale "NO_CTYPE";


use locale "utf8";

in which the parameter indicates that the user promises that they won't 
use a non-utf8 locale, and so Perl can ignore CTYPE for locale purposes, 
and in fact that strings can be assumed to be encoded as Unicode 
characters.  That would mean that a regex compiled under such a pragma 
would automatically have /u instead of /l.  Casing would also assume 
Unicode characters.

The consequence is that if the user used a non-UTF-8 locale or switched 
at run-time to such, Perl wouldn't notice in regards to matching and 
casing.  I don't think that's a big loss.

What I was originally proposing would work well with the :locale I/O 
layer.  This restricts it, but would work for most practical situations, 
and the original proposal could be implemented later, if desired.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About