develooper Front page | perl.perl5.porters | Postings from February 2000

Unicode character composition

Thread Previous | Thread Next
From:
Jarkko Hietaniemi
Date:
February 13, 2000 11:54
Subject:
Unicode character composition
Message ID:
14503.3075.617619.529131@beta.hut.fi

One thing that may need some consideration now or later is the precomposed
versus decomposed characters, that is, "a grave" versus "a" plus "grave".

For example when searching for "agrave" you would probably want also
"a" plus "grave" to match, and vice versa.  Well, you would want that
most of the time, anyway.

Food for thought: should Perl always make its utf8 data to be in the
decomposed form to be canonical?  Or, the other way, should it always
try to find the composite form (to be more compact)?  A canonical
form would make searching the data rather easier.

Then again, canonizing the data like that would be bad on output: if
an incoming "odiaeresis" would become "o" plus "diaeresis" when coming
out, some external entity could become confused.

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About