Bryan C Warnock <bwarnock@capita.com> writes: > Some additional stuff to ponder over, and maybe Unicode addresses these > - I haven't been able to read *all* the Unicode stuff yet. (And, yes, > Simon, you will see me in class.) > Some languages don't have upper or lower case. Are tests and > translations on caseless characters true or false? (Or undefined?) Caseless characters should be guaranteed unchanged by conversion to upper or lower case, IMO. Case is a normative property of characters in Unicode, so case mappings should actually be pretty well-defined. Note that there are actually three cases in Unicode, upper, lower, and title case, since there are some characters that require the third distinction (stuff like Dz is generally used as an example). > Should the same Unicode character, when used in two different languages, > be string equivalent? The way to start solving this whole problem is probably through normalization; Unicode defines two separate normalizations, one of which collapses more similar characters than the other. One is designed to preserve formatting information while the other loses formatting information. (The best example of how they differ is that one leaves the ffi ligature alone and the other breaks it down into three separate characters.) Perl should allow programmers to choose their preferred normalization schemes or none at all. (There are really four normalization schemes; in two of them, you leave things fully decomposed, and in the other two you recompose characters as much as possible.) -- Russ Allbery (rra@stanford.edu) <http://www.eyrie.org/~eagle/>Thread Previous | Thread Next