develooper Front page | perl.perl5.porters | Postings from June 2011

Re: RFC: Handling utf8 locales

Thread Previous | Thread Next
From:
Zefram
Date:
June 28, 2011 05:00
Subject:
Re: RFC: Handling utf8 locales
Message ID:
20110628120012.GW9463@lake.fysh.org
Karl Williamson wrote:
>Perhaps you are forgetting about the -C option.

Ah, this is what I was missing.  You didn't mention -C before.

>If perl is called with the -C option, it will take the appropriate  
>action based on the user's locale.

Not in the general case.  The "appropriate action based on the user's
locale" would involve the :locale layer.  -C doesn't do that: the only
layer it ever applies is the :utf8 layer.  Its locale awareness consists
of making the addition of the :utf8 layer conditional on whether the
locale is a UTF-8 locale.

So in this respect Perl is already treating locales inconsistently,
specifically treating UTF-8 locales in a qualitatively different way
from other locales.  Your proposal to treat UTF-8 locales differently
in regexps now makes some sense.  Your model is that all I/O must take
place through a :utf8-or-identity layer mediated by -CL.  For non-UTF-8
locales strings will internally be locale-encoded, but for UTF-8 locales
strings will internally be decoded (native Unicode).  In this situation it
would be sane for regexps to operate on the locale charset for non-UTF-8
locales but on native Unicode for UTF-8 locales.

I have two areas of concern about this scheme.  Firstly about the
practicality of the -C option, and secondly about what happens when -C
is not used.

-C does both too much and too little.  It does too much because, where it
is applied to streams, its effect is global and so applies whether the
code opening/using a stream is expecting it or not.  This would be fine
if all streams ever opened were for text, and text were always encoded
according to the prevailing standard of the host system.  But not only
are these conditions not true, they're not even close to being true.
Unix involves quite a lot of binary files.  A default :utf8 layer is only
OK if the whole program, including all loaded modules, is expecting it,
either by virtue of doing only text I/O or by taking explicit action to
squash the default layer for binary I/O.

Meanwhile, -C does too little because it only affects streams and @ARGV.
To maintain a consistent picture, any I/O by other routes has to apply
a matching layer.  That's not too difficult when the layer required is
just UTF-8 encoding, but in the -CL case it needs to be UTF-8 encoding
or nothing, conditional on the locale.  Do we even have a name for the
UTF-8-encode-or-nothing transform?  It's looking rather as though a
program using -C, and especially -CL, needs to be very aware of that
option; it's not something you can freely turn on for a transparent
effect.

If -CL or a similarly favoured option were to apply a default :locale I/O
layer, rather than :utf8-or-nothing, this would ameliorate the problem
with non-stream routes of I/O.  It would mean that the {en,de}coding that
the program needs to apply to other kinds of I/O is consistently locale
charset {en,de}coding.  (For which there ought to be {en,de}code_locale()
functions, and may well already be.)  But it wouldn't help with the
general problem of needing to know whether there is a default text
transformation on I/O, nor of having to selectively disable it.

So all in all -C, and especially -CL, seems lacking in the usability
department.  This raises the obvious question of whether we should be
encouraging its use.  The proposal for /l to conditionally behave like
/u would constitute an encouragement to use -CL.

And this brings us to the fundamental issue that -C is an option.
Not only is it not the default, it's not necessary or even particularly
useful in dealing with either UTF-8 or locales.  It is normal for
Unicode-processing, locale-aware programs to not use -C (or especially
-CL).  The correctness of the proposed /l behaviour is predicated
entirely on the use of -CL, and so enshrining that /l behaviour would
force the use of this complicated mechanism where it might reasonably
be otherwise eschewed.  When the maybe-locale-encoded mechanism is not
being used, the /l behaviour would be wrong and cause subtle bugs.

As I've previously said, I think we should encourage the use of :locale
and /u, which would make /l irrelevant.  The question about the proper
behaviour of /l is relevant only for programs that do not do the :locale
decode-on-input encode-on-output dance.  Do these programs do the -CL
conditional-default-{en,de}coding, or do they not {en,de}code their I/O
at all?  I don't have numbers for this, but I'm inclined to expect that
they'd favour the traditional consistently-locale-encoded arrangement.

-zefram

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About