-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Moin, On Monday 02 April 2007 12:36:31 demerphq wrote: > On 4/2/07, Tels <nospam-abuse@bloodgate.com> wrote: > > On Sunday 01 April 2007 22:26:26 demerphq wrote: > > > On 3/31/07, Glenn Linderman <perl@nevcal.com> wrote: > > > > [snipalot] > > > > > So to do case insensitive matching in unicode you need to do > > > "foldcase" matching, which is that you convert the sequence into a > > > normalized folded versions and then compare that. Where this gets > > > tricky is that in some languages, German for example, the folded > > > version of a particular letter is in fact more than one letter. So > > > the foldcase of GERMAN-SHARP-ESS aka \x{DF} aka ß is 'ss'. The > > > uppercase of the letter is ß, and unsurprisingly so is the lowercase. > > > Now where this gets really annoying is that \x{DF} is the ONLY letter > > > in unicode that is in latin_1 that has a multibyte foldcase > > > representation, yet at the same time Perl has never considered \x{DF} > > > to match 'ss' in latin_1. > > > > > > So if you have a string that contains \x{DF} youll find it will match > > > case insensitively 'ss' if the string is in unicode, but not if its > > > in latin_1. > > > > As someone with a bit of authority on ß I would like to point out a few > > trivias :-D > > > :-) > : > > * yes the lower case version ß is the same as uppercase (there is no > > uppercase version) > > > > * if you do not have an ß, you can write "ss" (like you can > > write "ae", "ue", or "oe" for "ä", "ü", and "ö", so it is correct to > > write "uebermaessig" for "übermäßig". Trivia of the day "Uber" is often > > used by English speaking people, but still wrong. You can't just leave > > of the two dots :-) > > What i find interesting is that unicode doesnt stipulate that > casefolded ü become 'ue'. I /guess/ this is because other languages > that dont have this equivelency need to be supported, wheras the rules > for german-sharp-ess are general accross all languages that use it. I found this interesting after your wrote about the casefolding (which I didn't know about) but it may be because: * ü has "Ü" so you can just convert to Uppercase and compare them * ß is only used in Germany, anway. And nobody likes the Germans, much (hehe, just kidding) > > However, "ss" is NOT equal to "ß". And if the regexp matched "ß" to > > "ss", it would produce sometimes wrong results. > > Note that we are talking case insensitive matching, and that unicode > stipulates that "ß" *does* match "ss" case insensitively. You can see > the rule in lib\unicore\CaseFolding.txt where it says > > 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S Yeah, but it is still wrong sometimes. "Maßen" and "Massen" are two different words, likewise "Riß" (a river) and "Riss" (a fracture). Using "sz" would maybe solve that issue, however, I find it strange that the German official rules now always use "ss" for "ß", except when it suddenly becomes important to distinguish, then they use "sz". Strange. (I did neither write the Unicode casefolding, nor the German spelling rules, nor the German casefolding rules, I am just observing this from the peanut gallery :-) All the best, Tels - -- Signed on Mon Apr 2 14:40:27 2007 with key 0x93B84C15. Get one of my photo posters: http://bloodgate.com/posters PGP key on http://bloodgate.com/tels.asc or per email. This email violates U.S. patent #6,775,781 <http://tinyurl.com/3khqm>: sudo rm -fR * -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) iQEVAwUBRhEXlncLPEOTuEwVAQLZxQf+KQW9P7Y2p9sXNLhhuXmEMDVc3hfR+/KR MNckL4p6YTevbUkXGHG6UBZC6Zb0iohSnHd1ukVBB5VIJbnZBhMJakyhMtFlWpoj uyEHWikgSc1VDGBn15Ywg0Y3nC+A9H1PASpGnRzoPUDj9go6THTC3k5Ck6k+9l90 QMdwaC9gy2I2Nopz9PEWT1PGQyvvw6Y51ZFfxRObZRzcjnGXhsaBfLztH4bsS6vB Ks4KuPyA63HevN9ArfcZ2Z4xwwRwR38g5VcrqDlOtIfIce37de5i29f2pNZ7ZcE9 KVla027arTX6WpRTPJy2edIWSJavV1ctS97gAli/GfZiObDYNmtTjg== =inYg -----END PGP SIGNATURE-----Thread Previous | Thread Next