develooper Front page | perl.perl5.porters | Postings from April 2011

Re: Unicode regex negated case-insensitivity in 5.14.0-RC1

Thread Previous | Thread Next
From:
Aristotle Pagaltzis
Date:
April 30, 2011 23:41
Subject:
Re: Unicode regex negated case-insensitivity in 5.14.0-RC1
Message ID:
20110501064114.GA32332@klangraum.plasmasturm.org
* George Greer <perl@greerga.m-l.org> [2011-04-29 15:45]:
> Correct. Going back to the original (somewhat nonsensical[1])
> regex that triggered this problem:
>
> 	/[^\x00-\x1f\x7f-\xff :]+:/i
>
> So "s" is an acceptable part of the regex but due to
> multi-character case folding "ss" is not. So you have the
> peculiar case that:
>
> 	"s s" =~ /^[^\xDF]+$/i => Y
> 	"ss"  =~ /^[^\xDF]+$/i => N
>
> which can end up very surprising when your word isn't German
> and the only reason \xDF is in the list is because it was
> caught in a range.

It’s surprising even when your word is German. I think the
orthography reform has made it so you can always substitute
a double s for a sharp s. (If memory serves, this was not
always the case before. I’m unsure on both counts.) But you
can definitely not replace any old double s by a sharp s. The
canonical example is “Wasser”: spelling it “Waßer” has always
been an error and so it remains.

This means the regex engine cannot make *any* reasonable guess
whatsoever at which match is desired or even acceptable in any
particular case without the user indicating it explicitly.

I’m iffy about the entire notion of multi-character case folds
(for regex matching), outside of designated pure ligatures.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About