develooper Front page | perl.perl5.porters | Postings from December 2008

Re: Matching multi-character folds, and FMTEYEWTK on troubles thereof

Thread Previous | Thread Next
From:
Tom Christiansen
Date:
December 2, 2008 00:26
Subject:
Re: Matching multi-character folds, and FMTEYEWTK on troubles thereof
Message ID:
15494.1228206346@chthon
"Rafael Garcia-Suarez" <rgarciasuarez@gmail.com> 
   wrote on "Wed, 26 Nov 2008 08:49:02 +0100.":

> 2008/11/25 Tom Christiansen <tchrist@perl.com>:

( >> Me, I'm wondering the same thing.  See below.  Sometimes I feel lost )
( >> in one of Borges's labyrinths--or Eco's, though these are the same.  )

>> *** Unicode CLDR Project: Common Locale Data Repository
>>        http://unicode.org/cldr/
>>
>> *** CVS Snapshots for CLDR:
>>        ftp://ftp.unicode.org/Public/cldr/cldr-repository-daily.tgz

>> Yves, there're also French versions of some of the above, s'il te plaît, 
>> but I had trouble getting them to download.

>> The last, CLDR, contains *VERY* interesting stuff.  I wish I could figure
>> out how to auto-translate these into Unicode::Collation objects.  For
>> example, here's cldr/common/collation/fr.xml, fycnrdths:

> Strange, it doesn't seem to contain collations for the French "=8C" ("e
> dans l'o"), which sorts exactly as "oe". 

That I found odd, too.  But I'm thinking that OE and the digraph have
different titlecase renderings.  Is that correct?  See here:

      % perl -E 'say ucfirst "oeuf"' 
    Oeuf
      % perl -E 'say ucfirst "\x{152}uf"' 
    Œuf

Reminds me of the old joke:

    Q: Why are the French so svelt?
    A: Light breakfasts, where un œuf is always enough.  :-)

So œ and Œ work more like the English ae digraph, once a separate
letter for the sound of "cat" or "sat", and written Æ and æ.  It wasn't
considered a ligature as it is today, as still seen in Icelandic or in
Old English where you find Ǣ and ǣ or Ǽ and ǽ.

That is, neither French œ nor English æ have the tripartite case
system of Hungarian:

     % perl -E 'say for chr(0x01F3), ucfirst(chr 0x01F3), uc(chr 0x01F3)'
    dz
    Dz
    DZ

Still, people get confused about capitalizing ligatures and digraphs
(like "th" or "ch" in English) even in places they don't go.  Think of
road signs that say that "LLeida" is this way, for example. or "LLiçà
de Vall"?  [Can you tell I was once lost in Andorra, hitchhiking, and
driving around Catalunya, equally lost? :-)]

At least we don't have to decompose "ß" so that "ß" =~ /ſs/i, or 
"Œ" =~ /oe/i, or perhaps even "ÿ" = /ij/ (ducks from Johan and Abigail :-).

I can just see people wanting weird matches on this sort of thing:

    Loſt be yᵉ, and on so trafficked a way?

> Does that mean that the CLDR still have bugs too ?

Yes, I think you are correct.  You can read that in their XML, where they
mention bugfixes from now and then.  And I now know how to write them for
French, using the tabular approach I used for the Iberian tongues.

> Anyway. I don't think that it's the core's job to handle localisation
> data and collations. There are too many of them, not counting the ones
> you might invent for specific purposes (like, where to put Planck's
> constant in a quantum physics book index?) Let us begin by trying to
> get the Turkish capitalisation right. And even for this, I'm not sure
> we want it really in the core.

True enough, and you are nearly certainly correct;  I still am amazed we
get as much right as we do.

But I still long for [=e=] though.  I know, I know: modules.

After the Iberian stuff I did, I **really** wonder whether bending
over backwards for ß to cope with SS and Ss may prove to have
been a bad idea in the long run.  

I wonder what the perl6 folks are thinking re this?  

And, um, er, if? :-(

> A system is nothing more than the subordination of all aspects of
> the universe to any one such aspect.
>    -- Borges

I haven't played with the default DUCET, just my modified one.  
I should change that.  

--tom
-- 

    Como todo poseedor de una biblioteca, Aureliano 
    se sabía culpable de no conocerla hasta el fin. 
	    --Jorge Luis Borges, 'Los teólogos' in _El Aleph_

  ~ Like all those possessing a library, Aurelian was aware 
    that he was guilty of not knowing his in its entirety. ~


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About