develooper Front page | perl.perl5.porters | Postings from April 2007

Re: German sharp-s: was Re: perl, the data, and the tf8 flag

Thread Previous | Thread Next
April 2, 2007 05:47
Re: German sharp-s: was Re: perl, the data, and the tf8 flag
Message ID:
Hash: SHA1


On Monday 02 April 2007 12:36:31 demerphq wrote:
> On 4/2/07, Tels <> wrote:
> > On Sunday 01 April 2007 22:26:26 demerphq wrote:
> > > On 3/31/07, Glenn Linderman <> wrote:
> >
> > [snipalot]
> >
> > > So to do case insensitive matching in unicode you need to do
> > > "foldcase" matching, which is that you convert the sequence into a
> > > normalized folded versions and then compare that. Where this gets
> > > tricky is that in some languages, German for example, the folded
> > > version of a particular letter is in fact more than one letter. So
> > > the foldcase of GERMAN-SHARP-ESS aka \x{DF} aka ß is 'ss'. The
> > > uppercase of the letter is ß, and unsurprisingly so is the lowercase.
> > > Now where this gets really annoying is that \x{DF} is the ONLY letter
> > > in unicode that is in latin_1 that has a multibyte foldcase
> > > representation, yet at the same time Perl has never considered \x{DF}
> > > to match 'ss' in latin_1.
> > >
> > > So if you have a string that contains \x{DF} youll find it will match
> > > case insensitively 'ss' if the string is in unicode, but not if its
> > > in latin_1.
> >
> > As someone with a bit of authority on ß I would like to point out a few
> > trivias :-D
> >
> :-)
> :
> > * yes the lower case version ß is the same as uppercase (there is no
> > uppercase version)
> >
> > * if you do not have an ß, you can write "ss" (like you can
> > write "ae", "ue", or "oe" for "ä", "ü", and "ö", so it is correct to
> > write "uebermaessig" for "übermäßig". Trivia of the day "Uber" is often
> > used by English speaking people, but still wrong. You can't just leave
> > of the two dots :-)
> What i find interesting is that unicode doesnt stipulate that
> casefolded ü become 'ue'. I /guess/ this is because other languages
> that dont have this equivelency need to be supported, wheras the rules
> for german-sharp-ess are general accross all languages that use it.

I found this interesting after your wrote about the casefolding (which I 
didn't know about) but it may be because:

* ü has "Ü" so you can just convert to Uppercase and compare them
* ß is only used in Germany, anway. And nobody likes the Germans, much 
(hehe, just kidding)

> > However, "ss" is NOT equal to "ß". And if the regexp matched "ß" to
> > "ss", it would produce sometimes wrong results.
> Note that we are talking case insensitive matching, and that unicode
> stipulates that "ß" *does* match "ss" case insensitively. You can see
> the rule in lib\unicore\CaseFolding.txt where it says
> 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S

Yeah, but it is still wrong sometimes. "Maßen" and "Massen" are two 
different words, likewise "Riß" (a river) and "Riss" (a fracture).

Using "sz" would maybe solve that issue, however, I find it strange that the 
German official rules now always use "ss" for "ß", except when it suddenly 
becomes important to distinguish, then they use "sz". Strange.

(I did neither write the Unicode casefolding, nor the German spelling rules, 
nor the German casefolding rules, I am just observing this from the peanut 
gallery :-)

All the best,


- -- 
 Signed on Mon Apr  2 14:40:27 2007 with key 0x93B84C15.
 Get one of my photo posters:
 PGP key on or per email.

 This email violates U.S. patent #6,775,781

  sudo rm -fR *

Version: GnuPG v1.4.2 (GNU/Linux)


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About