develooper Front page | perl.perl5.porters | Postings from December 2009

Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent

Thread Previous | Thread Next
December 10, 2009 05:12
Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Message ID:
2009/12/10 Juerd Waalboer <>:
> demerphq skribis 2009-12-10 13:23 (+0100):
>> And, [[:word:]] is spelled [[:alnum:]].
> juerd@lanova:~$ perl -le'print "foo" =~ /[[:word:]]/'
> 1
> See perlre

See regexec.c and regcomp.c for the source of our mutual confusion.

>> You cannot have both the current behaviour and non buggy implementation.
> Fully agreed. That's certainly not what I'm after, either.
>> Simply put I consider that:
>> [^STUFF] matching the same code points as [STUFF] to be an irrefutable
>> and overwhelming reason why the current behavior of POSIX charclass
>> cannot be preserved.
> What exactly do you mean by "current behaviour"?
> To fix the issue that codepoints 128..255 are included depending on
> internal encoding, there are two options:
> - Ignore anything above 127
> - Provide full unicode semantics.
> The first, ASCII-only, would be a mistake.

No it wouldnt. There are no "unicode semantics" for POSIX.

It is a fundamental error to speak of there being any.

> Perhaps there is other current behaviour that I am not aware of.

Apparently my hint wasnt strong enough.

Try matching all the legal codepoints against [^POSIX] and against [POSIX]

And note all the cases where you have both matching. Then do it with
the strings in unicode. Note all the errors.

These are fundamental errors.

For me this debate is over, POSIX charclasses are not Unicode
charclasses and any contortion to try to make them so is futile and
doomed to screw stuff over.


perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About