develooper Front page | perl.perl5.porters | Postings from April 2011

Re: Unicode regex negated case-insensitivity in 5.14.0-RC1

Thread Previous | Thread Next
From:
Nicholas Clark
Date:
April 29, 2011 14:14
Subject:
Re: Unicode regex negated case-insensitivity in 5.14.0-RC1
Message ID:
20110429211401.GV23881@plum.flirble.org
On Fri, Apr 29, 2011 at 12:21:21PM -0600, Karl Williamson wrote:
> On 04/29/2011 11:53 AM, Corzine, Deven wrote:

> > The programmer expects the case-insensitive flag to be convenience to
> > avoid enumerating all case variations, much like the character class
> > negation is a convenience to avoid enumerating the entire character set
> > without a few unwanted characters.
> 
> Again, I agree.

Except that negation can't actually be equivalent to enumerating the entire
character set less unwanted, else this would match:

$ ./perl -Ilib -lwe '$_ = "ss"; utf8::upgrade($_); print /\A[^ ]\z/i ? "Y" : "N"'
N

because "all of Unicode less space" includes ß, and /ß/i matches "ss"

So negation is behaving equivalent to multiple non-match (lookahead)
assertions, and a match on qr/./ (ie consume exactly one code point)

[which is making sense to me now, but is a surprise if you're thinking in sets]



aargh. Also, as this matches:

$ ./perl -Ilib -lwe '$_ = "ss"; utf8::upgrade($_); print /\A[\x80-\xFF]\z/i ? "Y" : "N"'
Y

shouldn't this?

$ ./perl -Ilib -lwe '$_ = "ss"; utf8::upgrade($_); print /\A[\x00-\xFF]\z/i ? "Y" : "N"'
N


(I was trying to test whether [^ ] was equivalent to [\x00-\x1F\x21-\x{1FFFF}]
and finding it a surprise)

Nicholas Clark

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About