develooper Front page | perl.perl5.porters | Postings from March 2011

Re: use locale

Thread Previous | Thread Next
From:
Tom Christiansen
Date:
March 12, 2011 15:12
Subject:
Re: use locale
Message ID:
21699.1299971404@chthon
Ambrus <ambrus@math.bme.hu> wrote:

> On Sat, Mar 12, 2011 at 12:37 PM, demerphq <demerphq@gmail.com> wrote:
>> I consider "use locale" broken and preserved only for backwards
>> compatibility. IMO we should get rid of it, deprecate it, whatever.

> Er what?  How do I do a locale-aware string compare without it?  I'm
> not supposed to call POSIX::strcoll?

Looking at the 146 locale tailorings for Hungarian found in 
in Unicode/Collate/Locale/hu.pl, I can certainly see why this
is important to you!  Wikipedia says the Hungarian alphabet
has 46(!) different letters:

    A       Á       B       C       Cs      D       Dz      Dzs     
    E       É       F       G       Gy      H       I       Í       J
    K       L       Ly      M       N       Ny      
    O       Ó       Ö       Ő       P       (Q)     R
    S       Sz      T       Ty      U       Ú       Ü       Ű       V       
    (W)     (X)     (Y)     Z       Zs

> Sorry, it seems the other thread answers this question.  It claims I
> should be using Unicode::Collate and somehow tell it what locale to
> use.  I don't quite understand how that works (how do I tell it what
> locale to use) but I'll try to check it.

If you do not specify a locale, then Unicode::Collate::Locale falls
back to the normal untailored UCA algorithm from Unicode::Collate.

You do have to choose what sort of comparison you want, though.
The last one is the normal one, and it includes everything previous.

    # compare letters only, not case or diacritics or noise
    my $hucol__diacritic_insensitive = new Unicode::Collate::Locale (
	   locale => "hu",
           level  => 1, 
    );

    # this one is now diacritic SENsitive but 
    my $hucol__case_insensitive = new Unicode::Collate::Locale (
	   locale => "hu",
           level  => 2, 
    );

    # this one is case SENsitive but 
    my $hucol__noise_insensitive = new Unicode::Collate::Locale (
	   locale => "hu",
	   level  => 3,
    );

    
    # this one considers noise (variable) for break ties
    my $hungarian_collator = new Unicode::Collate::Locale (
	   locale => "hu",
	   level  => 4,
    );


More exposition follows.  Also, I include a (nearly) undocumented rough
draft of a ucsort program to help you play around with the UCA and all its
options, including locales.  I use it every single day, actually.   It
accepts a

    --locale=xyz

argument, but deosn't require it.  In truth, I almost always call it
without any arguments, just because the UCA does such great things with
regular text.

That said, I do use the --preprocess option a good bit, to cheat on the
sort.  The --preprocess option is nicer at the command-line than in the
module:

    % ucsort --preprocess='s/^(a|an|the)\s+//i'  /tmp/titles

Isn't that nice and easy?  That way you can feed it 

    Asfaloth
    An Apple a Day
    A Bridge Too Far
    Born Free
    The Big Bad Wolf

and get back

    An Apple a Day
    Asfaloth
    The Big Bad Wolf
    Born Free
    A Bridge Too Far

Sorting fields right-to-left is also useful, yielding in that case:

    % ucsort --reverse-fields /tmp/titles
    Asfaloth
    An Apple a Day
    A Bridge Too Far
    Born Free
    The Big Bad Wolf

--tom

=============

MORE EXPOSITION ON Unicode::Collate objects

If you add 

    normalizion => undef,

Then you can call methods like match() and gmatch() on your object
and get your whatever-insensitive strings back.

If collation strength/weight 3 or 4 isn't quite to your liking, you can
play around with the "variable" parameters; it takes four possible
values: "blanked", "non-ignorable", "shifted", or "shift-trimmed".  It
defaults to shifted, which means that punctuation (not diacritics) and
symbols and such get ignored at the first three levels but considered
at the fourth.  

The Unicode Standard only requires 4 levels. One can actually have more
than than 4 levels, but I seem to recall  that this will blow up your
memory profile.  The four (or more) levels are:

     1 Primary     consider only alphabetic ordering,
		   so ignore diacritics, case distinctions, and non-letters

     2 Secondary   also consider diacritics,
		   so ignore case distinctions and non-letters

     3 Tertiary    also consider case distinctions
		   so ignore non-letters

     4 Quaternary  consider other code points for tie-breaking

Here's an example using only English words for simplicity's sake:

                            When Compared at Collation Strength...
                        ______________________________________________
                          Primary     Secondary   Tertiary  Quaternary
    String#1  String#2  (alphabetic)  (accents)    (case)     (etc)
    ‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾  ‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
     Bob       Alice     DIFFER         --	   --         --
     resume    résumé     SAME        DIFFER       --         --
     Bob       bob        SAME         SAME      DIFFER       --
     I'll      ill        SAME         SAME      DIFFER       --
     I'll      Ill        SAME         SAME       SAME      DIFFER
     re-invent reinvent   SAME         SAME       SAME      DIFFER
     re-invent reïnvent   SAME        DIFFER       --         --

        (sure hope you're reading this in a fixed-width font! :)


More exposition:

    if  (the letters are the same)		# level 1
    and (the accents are the same)		# level 2
    and (the casing   is the same)		# level 3
    and (the non-ignorables are the same)	# level 4

The thing about the UCA (which includes collating and (g)matching
and such) is that it is always a four-level thing and in that
oder.  That means that with a pair of words like "Renée" and
"renege", we never consider casing differences unless the
accents are the same, and also that we never consider accents
unless the letters are the same.

Going one at a time, with parens around the parts not considered:

    "R"(enée)    and
    "r"(enege)   are the same letters.
     ‾
    (R)"e"(née)  and
    (r)"e"(nege) are the same letters.
        ‾
    (Re)"n"(ée)  and
    (re)"n"(ege) are the same letters.
         ‾
    (Ren)"é"(e)  and
    (ren)"e"(ge) are the same letters.
          ‾
    (René)"e"    and
    (rene)"g"(e) are different letters.
           ‾

We use collator objects when we are sorting or comparing
text and we don't want trivial variations in the precise
representation to produce results different from those
that noncomputer people are expecting.

If I make four collators

    $col_1 = Unicode::Collator->new(level => 1);
    $col_2 = Unicode::Collator->new(level => 2);
    $col_3 = Unicode::Collator->new(level => 3);
    $col_4 = Unicode::Collator->new(level => 4);

I can then say

    if ($col_1->eq($a, $b)) {
        say "$a and $b have the same letters";

        if ($col_2->eq($a, $b)) {
            say "$a and $b have the same diacritics";

            if ($col_3->eq($a, $b)) {
                say "$a and $b have the same casing";

                if ($col_4->eq($a, $b)) {
                    say "$a and $b have the same nonignorables";

                    if ($a eq $b)) {
                        say "$a and $b have the same code points";
                    }
                }
            }
        }
    }

Notice you can't just compare diacritics but ignore letter distinctions,
just as you can't compare casing but ignore diacritic distinctions nor
compare punctuation but ignore casing.  It's a cascade.

People get into trouble because they just compare two strings using "eq",
which only does a very literal one-for-one code point comparison.  It fails
on forms that are supposed to be considered canonically identical but whose
code points differ, like:

    Ren\N{LATIN SMALL LETTER E WITH ACUTE}e
    Rene\N{COMBINING ACUTE ACCENT}e

When you're comparing "grapheme units" "\N{LATIN SMALL LETTER E WITH
ACUTE}" and "e\N{COMBINING ACUTE ACCENT}" should be the same.  Those are
the same letters because they have the same canonical forms even though
the code points differ.  This is always what people want.  Code point
comparisons suck for the human experience.

Naïve code point comparison also fails to take the expected cascading
comparison I've given above into account.  What part of the cascade you
want to stop at is what the collator strength (or level) is all about:

A normal collator->eq comparison:

    % perl -Mcharnames=:full -MUnicode::Collate -E 'say Unicode::Collate->new->eq("Ren\N{LATIN SMALL LETTER E WITH ACUTE}e", "Rene\N{COMBINING ACUTE ACCENT}e") || 0'
    1

For the rest of these, I've uc()d the RHS and lc()d the LHS.
  
A normal collator->eq comparison of uc vs lc:

    % perl -Mcharnames=:full -MUnicode::Collate -E 'say Unicode::Collate->new->eq(uc "Ren\N{LATIN SMALL LETTER E WITH ACUTE}e", lc "Rene\N{COMBINING ACUTE ACCENT}e") || 0'
    0

A collator->eq checking all same letters (level 1):

    % perl -Mcharnames=:full -MUnicode::Collate -E 'say Unicode::Collate->new(level=>1)->eq(uc "Ren\N{LATIN SMALL LETTER E WITH ACUTE}e", lc "Rene\N{COMBINING ACUTE ACCENT}e") || 0'
    1

A collator->eq checking all same diacritics (level 2):

    % perl -Mcharnames=:full -MUnicode::Collate -E 'say Unicode::Collate->new(level=>2)->eq(uc "Ren\N{LATIN SMALL LETTER E WITH ACUTE}e", lc "Rene\N{COMBINING ACUTE ACCENT}e") || 0'
    1

A collator->eq checking all same casing (level 3):

    % perl -Mcharnames=:full -MUnicode::Collate -E 'say Unicode::Collate->new(level=>3)->eq(uc "Ren\N{LATIN SMALL LETTER E WITH ACUTE}e", lc "Rene\N{COMBINING ACUTE ACCENT}e") || 0'
    0

Note that if there various combining characters, like both a
cedilla and an acute on the same letter, that it should not
matter the order:

    % perl -Mcharnames=:full -MUnicode::Collate -E 'say Unicode::Collate->new->eq("c\N{COMBINING ACUTE ACCENT}\N{COMBINING CEDILLA}", "c\N{COMBINING CEDILLA}\N{COMBINING ACUTE ACCENT}") || 0'
    1

    % perl -Mcharnames=:full -E 'say "c\N{COMBINING ACUTE ACCENT}\N{COMBINING CEDILLA}" eq "c\N{COMBINING CEDILLA}\N{COMBINING ACUTE ACCENT}" || 0'
    0

This is only true because an acute accent goes on top and a
cedilla beneath, which makes them in different combining classes.
If you had two in the same combining class, like a tilde with an
acute, then it *does* matter the order:

    % perl -Mcharnames=:full -MUnicode::Collate -E 'say Unicode::Collate->new->eq("a\N{COMBINING ACUTE ACCENT}\N{COMBINING TILDE}", "a\N{COMBINING TILDE}\N{COMBINING ACUTE ACCENT}") || 0'
    0

That's because the order matters within the same canonical
combining class, because the glyphs should look different:

    % perl -Mcharnames=:full -E 'say for "a\N{COMBINING ACUTE ACCENT}\N{COMBINING TILDE}", "a\N{COMBINING ACUTE ACCENT}\N{COMBINING TILDE}"'
    á̃
    á̃

Here's all four:

    % perl -Mcharnames=:full -E 'say for "c\N{COMBINING CEDILLA}\N{COMBINING ACUTE ACCENT}", "c\N{COMBINING ACUTE
    ACCENT}\N{COMBINING CEDILLA}", "a\N{COMBINING ACUTE ACCENT}\N{COMBINING TILDE}", "a\N{COMBINING TILDE}\N{COMBINING ACUTE ACCENT}"'     
    ḉ
    ḉ
    á̃
    ã́

Which is the same when uniquoted as what I put into it:

    % perl -Mcharnames=:full -E 'say for "c\N{COMBINING CEDILLA}\N{COMBINING ACUTE ACCENT}", "c\N{COMBINING ACUTE
    ACCENT}\N{COMBINING CEDILLA}", "a\N{COMBINING ACUTE ACCENT}\N{COMBINING TILDE}", "a\N{COMBINING TILDE}\N{COMBINING ACUTE ACCENT}"' | uniquote -v
    c\N{COMBINING CEDILLA}\N{COMBINING ACUTE ACCENT}
    c\N{COMBINING ACUTE ACCENT}\N{COMBINING CEDILLA}
    a\N{COMBINING ACUTE ACCENT}\N{COMBINING TILDE}
    a\N{COMBINING TILDE}\N{COMBINING ACUTE ACCENT}

However, notice how the first two have the same canonical
decompositions, but the second two do not:

    % perl -Mcharnames=:full -E 'say for "c\N{COMBINING CEDILLA}\N{COMBINING ACUTE ACCENT}", "c\N{COMBINING ACUTE
    ACCENT}\N{COMBINING CEDILLA}", "a\N{COMBINING ACUTE ACCENT}\N{COMBINING TILDE}", "a\N{COMBINING TILDE}\N{COMBINING ACUTE ACCENT}"' | nfd | uniquote -v
    c\N{COMBINING CEDILLA}\N{COMBINING ACUTE ACCENT}
    c\N{COMBINING CEDILLA}\N{COMBINING ACUTE ACCENT}
    a\N{COMBINING ACUTE ACCENT}\N{COMBINING TILDE}
    a\N{COMBINING TILDE}\N{COMBINING ACUTE ACCENT}

Similarly therefore for canonical composition, too:

    % perl -Mcharnames=:full -E 'say for "c\N{COMBINING CEDILLA}\N{COMBINING ACUTE ACCENT}", "c\N{COMBINING ACUTE
    ACCENT}\N{COMBINING CEDILLA}", "a\N{COMBINING ACUTE ACCENT}\N{COMBINING TILDE}", "a\N{COMBINING TILDE}\N{COMBINING ACUTE ACCENT}"' | nfd | uniquote -v
    \N{LATIN SMALL LETTER C WITH CEDILLA AND ACUTE}
    \N{LATIN SMALL LETTER C WITH CEDILLA AND ACUTE}
    \N{LATIN SMALL LETTER A WITH ACUTE}\N{COMBINING TILDE}
    \N{LATIN SMALL LETTER A WITH TILDE}\N{COMBINING ACUTE ACCENT}

I haven't talked about "default ignorable code points".  This is just
internal formatting noise that shouldn't interfere with a match.  That means
Things can be the same at level 4 but different code point wise, such as:

    perl -Mcharnames=:full -MUnicode::Collate -E 'say Unicode::Collate->new->eq("abcd", "ab\N{ZWJ}cd") || 0'
    1

    % perl -Mcharnames=:full -E 'say "abcd" eq "ab\N{ZWJ}cd" || 0'
    0

That's because ZWJ (ZERO WIDTH JOINER) is a default ignorable
codepoint. Others like that include the SOFT HYPHEN.  If you're
searching for "posthole", and somebody has "post\N{SOFT HYPHEN}hole"
or "post\N{ZWNJ}hole" (for a ZERO WIDTH NON-JOINER) you still want
to find it.  Default ignorable code points include things like:

    00AD \p{Cf} SOFT HYPHEN

    200B \p{Cf} ZERO WIDTH SPACE
    FEFF \p{Cf} ZERO WIDTH NO-BREAK SPACE
    200C \p{Cf} ZERO WIDTH NON-JOINER

    2060 \p{Cf} WORD JOINER
    200D \p{Cf} ZERO WIDTH JOINER
    034F \p{Mn} COMBINING GRAPHEME JOINER

    200E \p{Cf} LEFT-TO-RIGHT MARK
    200F \p{Cf} RIGHT-TO-LEFT MARK
    206C \p{Cf} INHIBIT ARABIC FORM SHAPING
    206D \p{Cf} ACTIVATE ARABIC FORM SHAPING

    FE00 \p{Mn} VARIATION SELECTOR-1
    FE01 \p{Mn} VARIATION SELECTOR-2
     ...
    FE0F \p{Mn} VARIATION SELECTOR-16

Do also remember that gmatch and the other methods in the Unicode::Collate
and Unicode::Collate::Locale classes are for literal strings only, not for
regexes.  You can compare and search for strings in a very loose way with
them, but it is not the same as regexes.


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About