develooper Front page | perl.perl5.porters | Postings from July 2011

FMTEYEWTK about Unicode Grapheme Matching (was: Solving the *real* Dot Problem)

Thread Previous | Thread Next
From:
Tom Christiansen
Date:
July 7, 2011 10:13
Subject:
FMTEYEWTK about Unicode Grapheme Matching (was: Solving the *real* Dot Problem)
Message ID:
19627.1310058640@chthon

    /* 
     *
     * Please please please read this, 
     * although ok fine not necessarily all 
     * at one sitting.  
     * 
     * But do please read it.
     *
     */

SUMMARY: When you start thinking about dot, you have to think about
         graphemes and when and whether it’s tolerable to break them.
         With that comes far more: things like case-insensitivity
         and letter-equivalence -- WHICH ARE NOT WHAT YOU THINK THEY ARE!!!

CULPA MEA: This is late.  I feel asleep while typing the reply.  I mean 
           that completely literally: I awoke with my head face down on
           the keyboard. Ouch!  

Johan Vromans <jvromans@squirrel.nl> wrote on Thu, 07 Jul 2011 09:17:25 +0200: 

Tom Christiansen <tchrist@perl.com> writes:

>> The Dot Problem will never be solved until people start thinking in
>> Unicode not ASCII. Otherwise you’ll “solve” the “wrong” “problem”.

> Not quite. I think you had the tiger by the tail one sentence earlier:

>> […] let’s please step back and evaluate the original sense of “.” […]

> This is what matters. What is the intended purpose of “.”?

> Originally, the intention was to be able to match ‘lines’ in a blob of
> data slurped from a disk file. Files at the time were newline separated
> streams of single-byte characters, so “.” matched any byte except \x0a
> (newline). That this assumption would not hold in the longer term became
> apparent when Windows, Mac, VMS and EBCDIC files came into
> consideration.

But it’s worse that that.  You cannot write /.*(jose)/i because
if José is coming, you just grabbed only part of the name, word, 
glyph, grapheme, character, thingamaheLLO.

It isn’t even good enough to write 

    NFD =~ / \X*  ( (?=j)\X (?=o)\X (?=s)\X (?=e)\X ) /xi

Now ok, that *would* work here, but it does not work in the general or
even a very familiar case.   Before I consider our dear own porter "Ævar
Arnfjörð Bjarmason", our v5.13.10 pumpking, let’s look at something
easier, something you *can* do, like trying to find which languages
start his with "Cristo" or with "Colon". 

    use utf8;
    use v5.14;

    my %viceroy = (
        Latin       => "Christophorus Columbus",
        Italian     => "Cristoforo    Colombo",
        Spanish     => "Cristóbal     Colón",
        Portuguese  => "Cristóvão     Colombo",
        Catalan     => "Cristòfor     Colom",
        English     => "Christopher   Columbus",
    );

    my $mask = " … the viceroy in %-10s has ‹%s› in %s.";

    say "Checking christo…";
    for my $lang (keys %viceroy) {
        my $name = $viceroy{$lang};
        if ($name =~ / \b christo /pix) {
            printf $mask, $lang, ${^MATCH}, $name;
        } 
    } 

    say "Checking colon…";
    for my $lang (keys %viceroy) {
        my $name = $viceroy{$lang};
        if ($name =~ / \b colon /pix) {
            printf $mask, $lang, $name;
            printf $mask, $lang, ${^MATCH}, $name;
        } 
    } 

That won’t work, and for several reasons.  

you might think the right answer is to do this, which is what
everyone does, but shouldn’t:

    use utf8;
    use v5.14;
    use strict;
    use warnings;
    use open qw( :encoding(UTF-8) :std );

    use Unicode::Normalize;

    my %viceroy = (
        Latin       => "Christophorus Columbus",
        Italian     => "Cristoforo Colombo",
        Spanish     => "Cristóbal Colón",
        Portuguese  => "Cristóvão Colombo",
        Catalan     => "Cristòfor Colom",
        English     => "Christopher Columbus",
    );

    my $mask = " … the viceroy in %-10s has ‹%s› in %s.";

    say "Checking christo…";
    for my $lang (keys %viceroy) {
        my $name = (NFD $viceroy{$lang}) =~ s/\pM//rg;
        if ($name =~ / \b cristo /pix) {
            say sprintf $mask, $lang, ${^MATCH}, $name;
        } 
    } 

    say "Checking colon…";
    for my $lang (keys %viceroy) {
        my $name = (NFD $viceroy{$lang}) =~ s/\pM//rg;
        if (NFD($name) =~ / \b colon /pix) {
            say sprintf $mask, $lang, ${^MATCH}, $name;
        } 
    } 

They shouldn’t, because it gives lame/wrong/broken/stupid output:

    Checking christo…
     … the viceroy in Catalan    has ‹Cristo› in Cristofor Colom.
     … the viceroy in Portuguese has ‹Cristo› in Cristovao Colombo.
     … the viceroy in Italian    has ‹Cristo› in Cristoforo Colombo.
     … the viceroy in Spanish    has ‹Cristo› in Cristobal Colon.
    Checking colon…
     … the viceroy in Spanish    has ‹Colon› in Cristobal Colon.


And you aren’t really helping things if you do change the mutilated
name back to the original:

    say sprintf $mask, $lang, ${^MATCH}, $viceroy{$lang};

Because now you get

    Checking christo…
     … the viceroy in Catalan    has ‹Cristo› in Cristòfor Colom.
     … the viceroy in Portuguese has ‹Cristo› in Cristóvão Colombo.
     … the viceroy in Italian    has ‹Cristo› in Cristoforo Colombo.
     … the viceroy in Spanish    has ‹Cristo› in Cristóbal Colón.
    Checking colon…
     … the viceroy in Spanish    has ‹Colon› in Cristóbal Colón.

See?  The strings in the angle-quotes are wrong.

*This* is the output you want:

    christo check…
     … the viceroy in Catalan    has ‹Cristò› in Cristòfor Colom.
     … the viceroy in Portuguese has ‹Cristó› in Cristóvão Colombo.
     … the viceroy in Italian    has ‹Cristo› in Cristoforo Colombo.
     … the viceroy in Spanish    has ‹Cristó› in Cristóbal Colón.
    colon check…
     … the viceroy in Spanish    has ‹Colón› in Cristóbal Colón.

And one way to get it is like this:

    use utf8;
    use v5.14;
    use strict;
    use warnings;
    use open qw( :encoding(UTF-8) :std );

    use Unicode::Normalize;

    my %viceroy = (
        Latin       => "Christophorus Columbus",
        Italian     => "Cristoforo Colombo",
        Spanish     => "Cristóbal Colón",
        Portuguese  => "Cristóvão Colombo",
        Catalan     => "Cristòfor Colom",
        English     => "Christopher Columbus",
    );

    my $O = qr/(?=o)\X/i;

    my $mask = " … the viceroy in %-10s has ‹%s› in %s.";

    say "christo check…";
    for my $lang (keys %viceroy) {
        my $name = $viceroy{$lang};
        if (NFD($name) =~ / \b crist${O} /pix) {
            say NFC sprintf $mask, $lang, ${^MATCH}, $name;
        } 
    } 

    say "colon check…";
    for my $lang (keys %viceroy) {
        my $name = $viceroy{$lang};
        if (NFD($name) =~ / \b col${O}n /pix) {
            say NFC sprintf $mask, $lang, ${^MATCH}, $name;
        } 
    } 


I’m the first admit that it gets a little tedious to write for each 
letter C, (?:(?=C)\X) instead.  That does lead to stuff like this:

    my $V = qr/(?=[aeiouy])\X/i;

    for my $lang (keys %viceroy) {
        my $name = NFD $viceroy{$lang};
        say NFC "\n  …Check $lang $name…";
        while ($name =~ /$V/gp) {
            my $spot = pos($name) - length(${^MATCH});
            say NFC sprintf "    …found ‹%s› in %s.", ${^MATCH}, $name, $spot;
        } 
    } 

which leads to things like this:

  …Check Portuguese Cristóvão Colombo…
    …found ‹i› in Cristóvão Colombo.
    …found ‹ó› in Cristóvão Colombo.
    …found ‹ã› in Cristóvão Colombo.
    …found ‹o› in Cristóvão Colombo.
    …found ‹o› in Cristóvão Colombo.
    …found ‹o› in Cristóvão Colombo.
    …found ‹o› in Cristóvão Colombo.

You may think you have solved your problem, but you haven’t.
The pos() positions are by code point, not by grapheme.
I can solve that, too, but that’s not the point here.

Consider rather, how you would search for "smørrebrød", 
a noun that the OED says to pronounce /ˈsmœrəbrœð/
and which is said to mean a Danish open sandwich.

*NOW* you have *REALLY* big problems.  How are you going to allow
people to search for "brod" and get that word?  And yes, you *do* have
to be able to that.  Here’s the problem:

    ORIG: smørrebrød
    NFD:  sm\x{F8}rrebr\x{F8}d
    NFC:  sm\x{F8}rrebr\x{F8}d

And no, K-decomps won’t help you, because here NFD is the same as NFKD,
NFC the same as NFKC.   So what do you do?  

Even normalization won’t help you here.  I hope you see now that
normalization is NECESSARY BUT *NOT* SUFFICIENT.  You must do more:
*much* more.

Which I shall now do.

I’d like to go back to Ævar to better illustrate not just the problem, but
indeed, *the* solution.  The issues is that you really, really want people
to be Suppose you want to search for him using things like "aevar" or
"jord"?  No regex match as currently implemented by core Perl will do that.
Consider:

    ORIG: Ævar Arnfjörð Bjarmason
    NFD:  \x{C6}var Arnfjo\x{308}r\x{F0} Bjarmason
    NFKD: \x{C6}var Arnfjo\x{308}r\x{F0} Bjarmason

Here again the K-versions are no different than w/o the K.

The problem, and it is a completely legit problem, is how would you
search for his name (or pieces of it) using *letter distinctions alone*,
without regard to anything else including case and certainly without
considering diacritics or ligatures? The point is that if it counts as
the same letter, it *counts*.

You really can’t do (?i:(?=[aeoiuy])\X) because while you could get in
the trivial case of Cristóbal Colón approach it that way if you’re
careful enough, you could *not* get smørrebrød or Ævar Arnfjörð
Bjarmason that way.

That’s because there’s no NF${whatever} that maps Æ => AE, ð => d + mark, ø
=> o + Mark.  So if you want to find an "d" or an "o", let alone an "ae",
and no matter the diacritic or case, the **ONLY** way to currently do that
is this way.

    use v5.14;
    use utf8;
    use Unicode::Normalize;
    use Unicode::Collate;

    my  $collator = new Unicode::Collate::
                    level           => 1,
                    normalization   => undef,
           ### pick one of: blanked, non-ignorable, shifted, shift-trimmed
                 ## variable        => "non-ignorable",  
                 ;


    my $name = "Ævar Arnfjörð Bjarmason";

    my @searches = ( 
        qw<aevar jord bjarm>, # part strings
        qw<AR JO OR RD>, 
        qw<DB RAR>,            # this is because I didn't set variable
    );

    say "Searching $name…";

    for my $search (@searches) {
        next unless my @hits = gmatch $collator $name => $search;
        say " … found /$search/ at ", join(", " => map { "‹$_›" } @hits), ".";
    } 


When run, that rightly produces:

    Searching Ævar Arnfjörð Bjarmason…
     … found /aevar/ at ‹Ævar›.
     … found /jord/ at ‹jörð›.
     … found /bjarm/ at ‹Bjarm›.
     … found /AR/ at ‹ar›, ‹Ar›, ‹ar›.
     … found /JO/ at ‹jö›.
     … found /OR/ at ‹ör›.
     … found /RD/ at ‹rð›.
     … found /DB/ at ‹ð B›.
     … found /RAR/ at ‹r Ar›.

Isn’t that just spiff?  Yeah find, it you don’t like the last 
two answers, then uncomment the (variable => "non-ignorable") pair.

And yes, so it’s not quite a regex, but *it* *does* *work*.  

We have to do it this way all because we do not yet implement
UTS#18’s RL3.4 on Tailored Loose Matches.

    http://www.unicode.org/reports/tr18/#Tailored_Loose_Matches

If so, we might write those

    /\v{PRIMARY}jord/

and be done with it.  However, I am not very fond of regex pragmas that
turn on like that.  I’d prefer things like

    / \F{uca=primary} jord  /x
    / \F{case=uca1}   jord  /x
    / (?u{case=uca1}: jord) /x

Or as lexicals, things like this:

    use re case => "insensitive";
    use re case => "insensitive", folding => "full";

    use re compare => "uca";
    use re compare => "uca", uca => "primary";
    use re compare => "uca", uca => "primary", uca_locale => "en";
    use re compare => "uca", uca => "primary", uca_locale => "is";

Note that in uca_locale "is" ‹d› and ‹ð› are no longer the same the way
they are in normal Unicode, nor are ‹Æ› and ‹AE›, etc.  But under the
default UCA, they most certainly are.

More demos of where the problems lie.  And the data. The issue is that
there are many many many many more code points that "count as letter C"
than just /C/i, even when simple and full casefolding are *both* taken
into account.  Why, just for the simple letter "d" alone, there are 64
matching UCA1 code points:

    % unichars -a 'UCA eq UCA("d")'

Although that becomes more manageable if you throw out those whose
NFKD includes "d", like all the MATHEMATICAL ones, like these:

            Double-Struck: 𝔻 𝕕
                Monospace: 𝙳 𝚍
               Sans-Serif: 𝖣 𝖽
        Sans-Serif Italic: 𝘋 𝘥
          Sans-Serif Bold: 𝗗 𝗱
   Sans-Serif Bold Italic: 𝘿 𝙙
                   Script: 𝒟 𝒹
                   Italic: 𝐷 𝑑
                     Bold: 𝐃 𝐝
              Bold Italic: 𝑫 𝒅
                  Fraktur: 𝔇 𝔡
             Bold Fraktur: 𝕯 𝖉

Those I do not are care about, because those all have NFKD
forms that return us real d’s.

*However*, many do not, and this is the big issue.  If you run:

    % unichars -a 'UCA eq UCA("d")' 'NFKD !~ /d/i'
     Ð  U+000D0 LATIN CAPITAL LETTER ETH
     ð  U+000F0 LATIN SMALL LETTER ETH
     Đ  U+00110 LATIN CAPITAL LETTER D WITH STROKE
     đ  U+00111 LATIN SMALL LETTER D WITH STROKE
     ◌ͩ  U+00369 COMBINING LATIN SMALL LETTER D
     ᶞ  U+01D9E MODIFIER LETTER SMALL ETH
     ◌ᷘ  U+01DD8 COMBINING LATIN SMALL LETTER INSULAR D
     ◌ᷙ  U+01DD9 COMBINING LATIN SMALL LETTER ETH
     Ꝺ  U+0A779 LATIN CAPITAL LETTER INSULAR D
     ꝺ  U+0A77A LATIN SMALL LETTER INSULAR D
     🅓  U+1F153 NEGATIVE CIRCLED LATIN CAPITAL LETTER D
     🅳  U+1F173 NEGATIVE SQUARED LATIN CAPITAL LETTER D
     🇩  U+1F1E9 REGIONAL INDICATOR SYMBOL LETTER D

Those are all D’s but you cannot get that way but through the UCA.

You may tell unichars what locale to use.   Note that EN is the same
as the default locale, although not out of overt cultural bigotry.
It’s just that we use the standard rules.  Do does Dutch.

That means that:

    % unichars 'UCA eq UCA("ae")'

is the same as

    % unichars --locale=en 'UCA eq UCA("ae")'
     Æ  U+00C6 LATIN CAPITAL LETTER AE
     æ  U+00E6 LATIN SMALL LETTER AE
     Ǣ  U+01E2 LATIN CAPITAL LETTER AE WITH MACRON
     ǣ  U+01E3 LATIN SMALL LETTER AE WITH MACRON
     Ǽ  U+01FC LATIN CAPITAL LETTER AE WITH ACUTE
     ǽ  U+01FD LATIN SMALL LETTER AE WITH ACUTE
     ᴭ  U+1D2D MODIFIER LETTER CAPITAL AE
     ◌ᷔ  U+1DD4 COMBINING LATIN SMALL LETTER AE

But not, it apparently turns out, for Ævar himself:

    % unichars --locale=is 'UCA eq UCA("ae")'
     ◌ᷔ  U+1DD4 COMBINING LATIN SMALL LETTER AE

Although if you think *that* is nutty, just try this:

    % unichars --locale=de__phonebook 'UCA eq UCA("ae")' 'NFKD !~ /d/i' | 
        ucsort --locale=de__phonebook
     Ä  U+00C4 LATIN CAPITAL LETTER A WITH DIAERESIS
     Æ  U+00C6 LATIN CAPITAL LETTER AE
     ä  U+00E4 LATIN SMALL LETTER A WITH DIAERESIS
     æ  U+00E6 LATIN SMALL LETTER AE
     Ǟ  U+01DE LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON
     ǟ  U+01DF LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
     Ǣ  U+01E2 LATIN CAPITAL LETTER AE WITH MACRON
     ǣ  U+01E3 LATIN SMALL LETTER AE WITH MACRON
     Ǽ  U+01FC LATIN CAPITAL LETTER AE WITH ACUTE
     ǽ  U+01FD LATIN SMALL LETTER AE WITH ACUTE
     ᴭ  U+1D2D MODIFIER LETTER CAPITAL AE
     ◌ᷔ  U+1DD4 COMBINING LATIN SMALL LETTER AE

Isn’t that, um, special? :)

And now ou see why I have to rewrite all the standard tools to 
handle Unicode.  Otherwise they don’t work.

I know this was long.  I am sorry.  I hope it was worth it.

--tom

PS:  Did *anybody* really read this???

PPS: I am currently here:

        N    38° 52′ 17.16"
        W   106° 59′ 15.72″  
        Z  9000′ above nominal sea‐level

     So do not expect to hear from me again for a long while.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About