develooper Front page | perl.perl5.porters | Postings from April 2007

Re: German sharp-s: was Re: perl, the data, and the tf8 flag

Thread Previous | Thread Next
April 2, 2007 05:36
Re: German sharp-s: was Re: perl, the data, and the tf8 flag
Message ID:
On 4/2/07, Tels <> wrote:
> On Sunday 01 April 2007 22:26:26 demerphq wrote:
> > On 3/31/07, Glenn Linderman <> wrote:
> [snipalot]
> > So to do case insensitive matching in unicode you need to do
> > "foldcase" matching, which is that you convert the sequence into a
> > normalized folded versions and then compare that. Where this gets
> > tricky is that in some languages, German for example, the folded
> > version of a particular letter is in fact more than one letter. So the
> > foldcase of GERMAN-SHARP-ESS aka \x{DF} aka ß is 'ss'. The uppercase
> > of the letter is ß, and unsurprisingly so is the lowercase.
> > Now where this gets really annoying is that \x{DF} is the ONLY letter
> > in unicode that is in latin_1 that has a multibyte foldcase
> > representation, yet at the same time Perl has never considered \x{DF}
> > to match 'ss' in latin_1.
> >
> > So if you have a string that contains \x{DF} youll find it will match
> > case insensitively 'ss' if the string is in unicode, but not if its in
> > latin_1.
> As someone with a bit of authority on ß I would like to point out a few
> trivias :-D


> * yes the lower case version ß is the same as uppercase (there is no
> uppercase version)
> * if you do not have an ß, you can write "ss" (like you can
> write "ae", "ue", or "oe" for "ä", "ü", and "ö", so it is correct to
> write "uebermaessig" for "übermäßig". Trivia of the day "Uber" is often
> used by English speaking people, but still wrong. You can't just leave of
> the two dots :-)

What i find interesting is that unicode doesnt stipulate that
casefolded ü become 'ue'. I /guess/ this is because other languages
that dont have this equivelency need to be supported, wheras the rules
for german-sharp-ess are general accross all languages that use it.

> However, "ss" is NOT equal to "ß". And if the regexp matched "ß" to "ss", it
> would produce sometimes wrong results.

Note that we are talking case insensitive matching, and that unicode
stipulates that "ß" *does* match "ss" case insensitively. You can see
the rule in lib\unicore\CaseFolding.txt where it says


> For instance, either STRASSE or STRAßE are correct ways (after the latest
> reform, you are always required to use "SS", though) to write Straße
> (street), however, the latter form is usually used in official documents
> because if you are named "Peter Böße", you do not want your name misspelled
> and be cofused with the bad guy named "PETER BÖSSE" :-)
> Likewise, in official Telex you are also required to replace "ß" with "sz",
> to avoid confusion. For instance:
>         "in Maßen" (only a bit) and "in Massen" (many of them) become
>         "in maszen" and "in massen"
> (which can really make a difference if your doctor orders you to drink "Wein
> in Maßen" (wine in little quantities) :-)
> Using "sz" for "ß" was also a bit popular on the internet before Unicode
> really took off, and one time it was even in the Duden, but it has
> essentially never catched really on and after the latest reform you should
> always write "ss".

Hmm, i did not know that. Interesting. I know that in common
conversation ive heard "ß" refered to as "sz", but i didnt realize
that it was ever an official equivelency.

> All you ever wanted to know about &szlig; and never dared to ask:


perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About