develooper Front page | perl.perl5.porters | Postings from March 2011

RESOLVED: C<use locale> Considered Harmful

Thread Next
From:
Tom Christiansen
Date:
March 11, 2011 20:09
Subject:
RESOLVED: C<use locale> Considered Harmful
Message ID:
19770.1299902960@chthon
After playing around with the new /l pattern modifier a bit,
I’d a hunch that we’re making a mistake by telling people ever
to C<use locale> at all.

I’ve now come to the conclusion that it’s far worse than that.

To start with, the following little manpage is both wrong and
misleading, and in several separate places and ways — which is
pretty darned tough for so short a manpage:

    =head1 NAME

    locale - Perl pragma to use and avoid POSIX locales for built-in operations

    =head1 SYNOPSIS

        @x = sort @y;       # ASCII sorting order
        {
            use locale;
            @x = sort @y;   # Locale-defined sorting order
        }
        @x = sort @y;       # ASCII sorting order again

    =head1 DESCRIPTION

    This pragma tells the compiler to enable (or disable) the use of POSIX
    locales for built-in operations (LC_CTYPE for regular expressions, and
    LC_COLLATE for string comparison).  Each "use locale" or "no locale"
    affects statements to the end of the enclosing BLOCK.

    See L<perllocale> for more detailed information on how Perl supports
    locales.

If I am correct, a bunch of documentation needs updating:

    1.  There is no such thing as ASCII sorting order.  It’s sorting
        according to numeric code point. The locale manpage needs
        correcting.

    2.  Perl only (correctly) supports single‐byte locales.  Multibyte
        ‘locales’ like UTF‑8 don’t work, yet no warning is given.
        This should be very strongly stated in all five of the locale,
        perlop, perlfunc/sort, perllocale, and POSIX/setlocale manpages.
        Probably also in those from 3 immediately below.

    3.  Even system locales are broken on UTF‑8.  This needs to be
        talked about in all those places plus also in perlunicode,
        perlre, and perlrecharclass.  At least.

    4.  Only Unicode::Collate works correctly on Unicode data.
        It needs to be mentioned in lieu of locales, or at least
        in addition to the same.

I believe that we should not be telling people to use locales at all.
They nearly never work!  At best we should call them Legacy Locales.
I think that *that’s* what /l must really stand for.  Or maybe, Lame.

I’ve created a minimal test set to demonstrate the problem. I’ve used
sorting, but it also applies to things like eq, lt, case conversion,
regexes, etc.  Tested under 5.12.3 on Linux, Solaris, and Darwin.

First, a ‘regular’ Unicode situation.

  * The first column in each set is by code point order — and,
    thanks to the design of UTF‑8, also by byte order.

  * The second column in each set uses the system sort(1) program.

  * The third column in each set uses perl’s sort under use locale.

  * The fourth column uses the sort method from Unicode::Collate 
    and Unicode::Collate::Locale, respectively.

Columns underlined with "===" I consider correct; those with "---",
incorrect.  I’ve added sequence numbers so you can more easily follow
where each record moved to.

    -------------      ==============     --------------    ============
    A: sort by         B: en_US.UTF-8     C: en_US.UTF-8    D: UCA sort
       code point         sys sort            perl sort        no locale
    -------------      ==============     --------------    ============
  1 bat              1 bat              1 bat              1 bat
  2 cat              2 cat              3 ca\x{308}t       2 cat
  3 ca\x{308}t       3 ca\x{308}t       2 cat              3 ca\x{308}t
  4 czt              5 c\x{E4}t         5 c\x{E4}t         5 c\x{E4}t
  5 c\x{E4}t         4 czt              4 czt              4 czt
  6 dat              6 dat              6 dat              6 dat

That shows that as Karl’s documentation for /l states, Perl can
handle only 8‑bit locales — and that UTF‑8 is not an 8‑bit locale.

                                    (Please, no ‘Duh!’ awards. :)

Yep, that’s right:  Perl’s C<use locale> sort here is even *worse* than
a mere code‐point sort.  Talk about adding insult to injury!  This must
be what Karl means about Perl’s supporting 8‑bit locales only.

That’s under Linux; under Darwin and Solaris, columns B and C are
the same as A, but D remains correct.

THEREFORE: You can count only on D to guarantee correct behavior.

Ok, then.

Now for ‘real’ locales.  The only thing you need to know is that in
Swedish, an Ä — that is, an A with a diaeresis (well, umlaut here, but
same difference) — is *not* deemed a letter with a diacritic but rather
a separate letter altogether just as Ñ is in Spanish, and that that 
new letter Ä now sorts *after* the letter Z in Swedish.  

Watch what happens:

    -------------      --------------     --------------    ============
    A/W: sort by       X: sv_SE.utf8      Y: sv_SE.utf8     Z: UCA sort
       code point         sys sort           perl sort         locale=sv
    -------------      --------------     --------------    ============
  1 bat              1 bat              1 bat             1 bat
  2 cat              2 cat              3 ca\x{308}t      2 cat
  3 ca\x{308}t       3 ca\x{308}t       2 cat             4 czt
  4 czt              4 czt              4 czt             5 c\x{304}t
  5 c\x{E4}t         5 c\x{E4}t         5 c\x{E4}t        3 ca\x{E4}t
  6 dat              6 dat              6 dat             6 dat

Here again, Perl’s C<use locale> is even worse than the already
execrable code‐point sort.

Again, that’s on Linux. There’s no way to guess a correct locale name
without inspecting them, and Linux uses different locale names than
Darwin.  Under Darwin, that same locale is called "sv_SE.UTF-8". There,
column Y becomes the same as column X.  Z of course remains the same.
And there *are* no Swedish locales under Solaris. :(

THEREFORE: You can count only on Z to guarantee correct behavior.

CONCLUSION: Do *NOT* use system locales.  If you need reliable
            sorting of Unicode text data, you *must* use the UCA.
            (And if you don’t need reliability, why are you bothering?)

I really don’t think we should be talking about Lamentably Lame locales
at all — except perhaps in disparaging terms like broken, problematic,
and Legacy.  They just don’t work.  We should be warning people away
from things that don’t work, and instead suggesting things that do.

Let’s do that.  I volunteer.

I feel like this is something we all have pretty much always known, but
that this community lore hasn't made it into the perl documentation. If
I have understood the problem correctly, and there is no disagreement,
then I’ll go ahead and prepare appropriate patches for the various bits
of documentation I mentioned at the beginning.

Thanks for your patience in reading this far. 

I’d like to fix that.

--tom

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About