develooper Front page | perl.perl5.porters | Postings from March 2011

Catalan collation bug in perl or CLDR?

From:
Tom Christiansen
Date:
March 6, 2011 14:45
Subject:
Catalan collation bug in perl or CLDR?
Message ID:
19828.1299451519@chthon
I believe there's a bug in Collate::Locale's Catalan set-up.  It seems 
to have had too much copied to it from the es_traditional locale.  

Here's ca.pl:

    +{
       backwards => 2,
       entry => <<'ENTRY', # for DUCET v6.0.0
    0063 0068 ; [.15D2.0020.0002.0063] # <LATIN SMALL LETTER C, LATIN SMALL LETTER H>
    0063 0048 ; [.15D2.0020.0007.0063][.0000.0000.0002.0000] # <LATIN SMALL LETTER C, LATIN CAPITAL LETTER H>
    0043 0068 ; [.15D2.0020.0007.0043][.0000.0000.0008.0000] # <LATIN CAPITAL LETTER C, LATIN SMALL LETTER H>
    0043 0048 ; [.15D2.0020.0008.0043] # <LATIN CAPITAL LETTER C, LATIN CAPITAL LETTER H>
    006C 006C ; [.16C5.0020.0002.006C][.0000.0000.0001.0000] # <LATIN SMALL LETTER L, LATIN SMALL LETTER L>
    006C 00B7 006C ; [.16C5.0020.0002.006C][.0000.0000.0007.0000] # <LATIN SMALL LETTER L, MIDDLE DOT, LATIN SMALL LETTER L>
    006C 004C ; [.16C5.0020.0007.006C][.0000.0000.0002.0000][.0000.0000.0001.0000] # <LATIN SMALL LETTER L, LATIN CAPITAL LETTER L>
    006C 00B7 004C ; [.16C5.0020.0007.006C][.0000.0000.0002.0000][.0000.0000.0007.0000] # <LATIN SMALL LETTER L, MIDDLE DOT, LATIN CAPITAL LETTER L>
    004C 006C ; [.16C5.0020.0007.004C][.0000.0000.0008.0000][.0000.0000.0001.0000] # <LATIN CAPITAL LETTER L, LATIN SMALL LETTER L>
    004C 00B7 006C ; [.16C5.0020.0007.004C][.0000.0000.0008.0000][.0000.0000.0007.0000] # <LATIN CAPITAL LETTER L, MIDDLE DOT, LATIN SMALL LETTER L>
    004C 004C ; [.16C5.0020.0008.004C][.0000.0000.0001.0000] # <LATIN CAPITAL LETTER L, LATIN CAPITAL LETTER L>
    004C 00B7 004C ; [.16C5.0020.0008.004C][.0000.0000.0007.0000] # <LATIN CAPITAL LETTER L, MIDDLE DOT, LATIN CAPITAL LETTER L>
    ENTRY
    };

And here's es_trad.pl:

    +{
       entry => <<'ENTRY', # for DUCET v6.0.0
    0063 0068 ; [.15D2.0020.0002.0063] # <LATIN SMALL LETTER C, LATIN SMALL LETTER H>
    0043 0068 ; [.15D2.0020.0007.0043] # <LATIN CAPITAL LETTER C, LATIN SMALL LETTER H>
    0043 0048 ; [.15D2.0020.0008.0043] # <LATIN CAPITAL LETTER C, LATIN CAPITAL LETTER H>
    006C 006C ; [.16C5.0020.0002.006C] # <LATIN SMALL LETTER L, LATIN SMALL LETTER L>
    004C 006C ; [.16C5.0020.0007.004C] # <LATIN CAPITAL LETTER L, LATIN SMALL LETTER L>
    004C 004C ; [.16C5.0020.0008.004C] # <LATIN CAPITAL LETTER L, LATIN CAPITAL LETTER L>
    00F1      ; [.1703.0020.0002.00F1] # LATIN SMALL LETTER N WITH TILDE
    006E 0303 ; [.1703.0020.0002.00F1] # LATIN SMALL LETTER N WITH TILDE
    00D1      ; [.1703.0020.0008.00D1] # LATIN CAPITAL LETTER N WITH TILDE
    004E 0303 ; [.1703.0020.0008.00D1] # LATIN CAPITAL LETTER N WITH TILDE
    ENTRY
    };

However, my bilingual Castilian-Catalan dictionary (Pere Elies i Busqueta;
Barcelona, 1983) draws specific attention to how Catalan does *not* treat
"ll" and "ch" as separate letters for alphabetization the way Castilian
does/did.  There are plenty of places where you can see that they are
following the more normal order; it's not like they don't understand this,
because the Castilian entries follow the other order.

So is this a Perl module bug, or is it really a CLDR bug?

Also, I can find no support for assertion of Frenchlike backwardsness 
at collation strength 2.  

Catalan words *can* have grave or acute accent marks (eg: còdex,
místic), diaereses (eg: genuïnament, oïble), cedillas (eg: jovença,
providença), or middle dots (eg: col·lapse, imbecil·litat).

You can't have two stress marks on the same word, which is all the two
accents are, So I haven't been able to find any words with more than one of
the two accents or the diaeresis, let alone minimal pairs to contrast.

And although there are words with both the middle dot or the cedilla, plus
either of the accents (eg: il·lícit, col·leció), here again I can find
no minimal pairs to allow me to see which way the algorithm runs.

Plus I doubt they would count the middle dot the same way (see the ENTRY),
nor even the cedilla, since they (sometimes) consider c and ç different
letters altogether.

--tom



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About