develooper Front page | perl.perl5.porters | Postings from May 2011

Re: Unicode regex negated case-insensitivity in 5.14.0-RC1

Thread Previous | Thread Next
From:
Father Chrysostomos
Date:
May 1, 2011 13:05
Subject:
Re: Unicode regex negated case-insensitivity in 5.14.0-RC1
Message ID:
D4EE08A3-BB10-4A3C-9B49-861D46D7C44C@cpan.org

On Apr 29, 2011, at 8:46 AM, Karl Williamson wrote:

> On 04/29/2011 07:07 AM, Tom Christiansen wrote:
>> 
>> I'd like to think about how people would use this stuff *in practice*.  The
>> problem  is that in practice, the \xDF case isn't too common, so we don't
>> have many examples to go by.
> 
> Actually, I remember reading the opposite, that ß was the most common and important of the multi-char folds.  I believe that the reason it exists is simply for mathematical completeness.
> 
> I am not a German speaker but we do have some on this list.  My understanding is that ß is already lower case, there is no lower case equivalent to it, but the upper case of ß is 'SS'.  The case fold of 'SS' is 'ss', and therefore by extension so should be the case fold of ß.  But this is in some sense contrary to real German, where there are minimal pair words that differ only by ß and ss and mean different things.  I think maße and masse is an example, or is it müße and müsse (I don't know).
> 
> As an English speaker, I would use /i to try to get the same word in all its possible capitalizations.  (People have pointed out that that isn't really possible in English either due to homonyms and acronyms, but it's something I and others do expect, nonetheless.)  The Unicode practice of assuming transitivity where it really doesn't happen in the native language  leads to the case fold of ß being 'ss', when in fact I don't think it is called for in the language.  I asked Steffen this question on IRC some months ago.
> 
> > there are quite a lot of Greek code
>> points where this arises.  This is due to their weird lowercase Mark,
>> U+0345 COMBINING GREEK YPOGEGRAMMENI, which is both \p{Lower} and \p{Mn}.

It’s the letter iota, which is written as a subscript when it’s not pronounced.

>> 
>>     lower: ᾲ στο διάολο
>>     lower: \x{1FB2} \x{3C3}\x{3C4}\x{3BF} \x{3B4}\x{3B9}\x{3AC}\x{3BF}\x{3BB}\x{3BF}
>> 
>>     title: Ὰͅ Στο Διάολο
>>     title: \x{1FBA}\x{345} \x{3A3}\x{3C4}\x{3BF} \x{394}\x{3B9}\x{3AC}\x{3BF}\x{3BB}\x
>> {3BF}
>> 
>>     upper: ᾺΙ ΣΤΟ ΔΙΆΟΛΟ
>>     upper: \x{1FBA}\x{399} \x{3A3}\x{3A4}\x{39F} \x{394}\x{399}\x{386}\x{39F}\x{39B}\x
>> {39F}
>> 
>> That's because U+1FB2 goes to U+1FBA U+0399 for uppercase, but
>> it goes to U+1FBA U+0345 in titlecase.
>> 
>> I am quite sure that someone would want to use /^\x{1FB2}/i and
>> have it catch all three cases, of
>> 
>>       The lowercase
>        "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}"
>    becomes this two-codepoint sequence in uppercase:
>        "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{GREEK CAPITAL LETTER IOTA}"
>    but becomes this two-codepoint sequence in TITLECASE not uppercase:
>        "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{COMBINING GREEK YPOGEGRAMMENI}"

Accents and breathing marks are almost always dropped in all caps. I’ve only seen one publication, from the nineteenth century, that included them. So Unicode’s case-folding rules for Greek have little practical use.

Some Western academic publications of classical texts use a capital iota for the hypogegrammene in titles. But most of the time, it either remains an hypogegrammene or becomes a lowercase iota (that’s right: the lowercase iota is the ‘capital’ hypogegrammene), depending on the publisher’s choice.

> 
>> But what I don't know is whether they are expecting /^[^\x{1FB2}]/i
>> to rule all three of those *out*, that is, be like !/^[\x{1FB2}]/i.

I would never write such a regular expression except by mistake (and I *do* often write regular expressions for Greek text). If I did do it by mistake, I would expect it to match everything except \x{1fb2}.

(Funny that ᾲ is used in the examples, as I’ve never seen it used in real text. ᾴ and ᾷ are fairly common, though.)


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About