develooper Front page | perl.perl5.porters | Postings from December 2012

Re: Does Unicode mandate a collation order?

Thread Previous
From:
Tom Christiansen
Date:
December 4, 2012 01:23
Subject:
Re: Does Unicode mandate a collation order?
Message ID:
21472.1354584140@chthon
Karl Williamson <public@khwilliamson.com> wrote
   on Mon, 03 Dec 2012 12:29:02 MST: 

> I have not used Unicode::Collate, but the docs say you can download 
> prior versions of Unicode.  I believe that blead's copy has been updated 
> to Unicode 6.2.  It does not necessarily have to track Perl's version, 
> as it has its own data files from Unicode.  But, I believe it has 
> recently anyway, been kept up to date.

*Actually*, on this occasion, the answer is a bit more complicated 
than for nearly anything else in how Perl and Unicode work together.

It turns out that the way the collation works can be tailored to a particular
version of the DUCET via the class constructor's UCA_Version parameter. From
the manpage from the version of the Unicode::Collate currently in blead, with
is v0.94 if you're keeping track:

    UCA_Version
        If the revision (previously "tracking version") number of UCA is
        given, behavior of that revision is emulated on collating. If
        omitted, the return value of "UCA_Version()" is used.

        The following revisions are supported. The default is 24.

             UCA       Unicode Standard         DUCET (@version)
           -------------------------------------------------------
              8              3.1                3.0.1 (3.0.1d9)
              9     3.1 with Corrigendum 3      3.1.1 (3.1.1)
             11              4.0                4.0.0 (4.0.0)
             14             4.1.0               4.1.0 (4.1.0)
             16              5.0                5.0.0 (5.0.0)
             18             5.1.0               5.1.0 (5.1.0)
             20             5.2.0               5.2.0 (5.2.0)
             22             6.0.0               6.0.0 (6.0.0)
             24             6.1.0               6.1.0 (6.1.0)
             26             6.2.0               6.2.0 (6.2.0)

        * Noncharacters (e.g. U+FFFF) are not ignored, and can be
          overridden since "UCA_Version" 22.

        * Fully ignorable characters were ignored, and would not interrupt
          contractions with "UCA_Version" 9 and 11.

        * Treatment of ignorables after variables and some behaviors were
          changed at "UCA_Version" 9.

        * Characters regarded as CJK unified ideographs (cf. "overrideCJK")
          depend on "UCA_Version".

        * Many hangul jamo are assigned at "UCA_Version" 20, that will
          affect "hangul_terminator".

This dual-lifed module can be updated even on older Perls by pulling
it in from CPAN.  It will use the new DUCET if you ask it to do so, 
and it will use the old one if you ask it to do that, too.

The reason this is absolutely critical is because we do not currently 
have a symbolic way of handling DUCET overrides.  For example, if you
wanted to provide this sort of collation override:

       0063 0068 ; [.1000.0020.0002.0063] # ch
       0043 0068 ; [.1000.0020.0007.0043] # Ch
       0043 0048 ; [.1000.0020.0008.0043] # CH
       006C 006C ; [.10F5.0020.0002.006C] # ll
       004C 006C ; [.10F5.0020.0007.004C] # Ll
       004C 004C ; [.10F5.0020.0008.004C] # LL
       00F1      ; [.112B.0020.0002.00F1] # n-tilde  (NFC)
       006E 0303 ; [.112B.0020.0002.00F1] # n-tilde  (NFD)
       00D1      ; [.112B.0020.0008.00D1] # N-tilde  (NFC)
       004E 0303 ; [.112B.0020.0008.00D1] # N-tilde  (NFD)

See those "magic" numbers in square brackets?  Those numbers in the
middle *only make sense* on a particular verison of the DUCET; in 
this case, 14, in fact.

Yes, it is a messy, messy way to do it.  You would like to just
give it override rules, maybe something like these:

    c < ch < d
    l < ll < m
    n <  ñ < o

But that isn't the way the low-level API works.  I've actually seen
one high-level API that seems to take rules like those above, but 
it is in ICU, not Perl:

    http://userguide.icu-project.org/collation/customization

Fortunately, for "simple" things like the one given above, it is now
possible to use Unicode::Collate::Locale instead, passing it a locale 
of "es__traditional".  At least then you don't have to hack the DUCET 
on your own.  But if you ever do, it is absolutely indispensable that 
you specify which UCA_Version you are hacking. If you don't, it will 
mess everything up, and it will not be at all obvious why.  (Yes, this 
is indeed the voice of unhappy experience talking here. :)

--tom

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About