On 4/2/07, Tels <nospam-abuse@bloodgate.com> wrote: > On Sunday 01 April 2007 22:26:26 demerphq wrote: > > On 3/31/07, Glenn Linderman <perl@nevcal.com> wrote: > [snipalot] > > So to do case insensitive matching in unicode you need to do > > "foldcase" matching, which is that you convert the sequence into a > > normalized folded versions and then compare that. Where this gets > > tricky is that in some languages, German for example, the folded > > version of a particular letter is in fact more than one letter. So the > > foldcase of GERMAN-SHARP-ESS aka \x{DF} aka ß is 'ss'. The uppercase > > of the letter is ß, and unsurprisingly so is the lowercase. > > Now where this gets really annoying is that \x{DF} is the ONLY letter > > in unicode that is in latin_1 that has a multibyte foldcase > > representation, yet at the same time Perl has never considered \x{DF} > > to match 'ss' in latin_1. > > > > So if you have a string that contains \x{DF} youll find it will match > > case insensitively 'ss' if the string is in unicode, but not if its in > > latin_1. > > As someone with a bit of authority on ß I would like to point out a few > trivias :-D :-) > * yes the lower case version ß is the same as uppercase (there is no > uppercase version) > > * if you do not have an ß, you can write "ss" (like you can > write "ae", "ue", or "oe" for "ä", "ü", and "ö", so it is correct to > write "uebermaessig" for "übermäßig". Trivia of the day "Uber" is often > used by English speaking people, but still wrong. You can't just leave of > the two dots :-) What i find interesting is that unicode doesnt stipulate that casefolded ü become 'ue'. I /guess/ this is because other languages that dont have this equivelency need to be supported, wheras the rules for german-sharp-ess are general accross all languages that use it. > However, "ss" is NOT equal to "ß". And if the regexp matched "ß" to "ss", it > would produce sometimes wrong results. Note that we are talking case insensitive matching, and that unicode stipulates that "ß" *does* match "ss" case insensitively. You can see the rule in lib\unicore\CaseFolding.txt where it says 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S > For instance, either STRASSE or STRAßE are correct ways (after the latest > reform, you are always required to use "SS", though) to write Straße > (street), however, the latter form is usually used in official documents > because if you are named "Peter Böße", you do not want your name misspelled > and be cofused with the bad guy named "PETER BÖSSE" :-) > > Likewise, in official Telex you are also required to replace "ß" with "sz", > to avoid confusion. For instance: > > "in Maßen" (only a bit) and "in Massen" (many of them) become > "in maszen" and "in massen" > > (which can really make a difference if your doctor orders you to drink "Wein > in Maßen" (wine in little quantities) :-) > > Using "sz" for "ß" was also a bit popular on the internet before Unicode > really took off, and one time it was even in the Duden, but it has > essentially never catched really on and after the latest reform you should > always write "ss". Hmm, i did not know that. Interesting. I know that in common conversation ive heard "ß" refered to as "sz", but i didnt realize that it was ever an official equivelency. > All you ever wanted to know about ß and never dared to ask: > > http://de.wikipedia.org/wiki/%C3%9F Cheers, yves -- perl -Mre=debug -e "/just|another|perl|hacker/"Thread Previous | Thread Next