POSIX-like syntax or full compliancy? (Was: PATCH: partial [perl#58182] ...)

Juerd Waalboer
December 11, 2009 01:52
POSIX-like syntax or full compliancy? (Was: PATCH: partial [perl#58182] ...)
Message ID:
demerphq skribis 2009-12-10 14:11 (+0100):
> See regexec.c and regcomp.c for the source of our mutual confusion.

Unfortunately I don't speak C. If I understood Perl's source, I would
probably have been much more specific when suggesting fixes, from the

> > The first, ASCII-only, would be a mistake.
> No it wouldnt. There are no "unicode semantics" for POSIX.

That would be relevant if Perl had POSIX character classes.

However, the question of whether [[:xxx:]] is POSIX-like syntax, or an
actual POSIX character class, remains unanswered or at least unclear.

Certainly Perl's documentation isn't fully definitive. perlre mentions:

1 => "POSIX character class syntax"
2 => "POSIX character classes"
3 => Equivalences to \p{} Unicode constructs

1 and 3 can both be true, but then 2 is not. This is how I (prefer to)
think of it.

However, it could also be that 1 and 2 are true, ruling 3 out. If I
understand correctly, that's how you see the matter.

The mere existence of exceptions to the POSIX standard in [:xxx:], and
the exclusion of [.xxx.] and [=xxx=] lead me to believe that it's just
syntax compatibility, and Perl is free to extend the class definitions
to meet more modern requirements, like acknowledging that é is indeed
alphanumeric. Even if the people who invented the original POSIX bracket
expressions failed to notice.

> Try matching all the legal codepoints against [^POSIX] and against [POSIX]
> And note all the cases where you have both matching. Then do it with
> the strings in unicode. Note all the errors.

I wish you had spent the same time trying to explain what happens if you
do try this. Would have saved me some time and failure, because I was
unable to reproduce the errors.

perl -CO -le'(chr($_) =~ /[[:alnum:]]/) and (chr($_) =~ /[^[:alnum:]]/)
and warn sprintf "U+%04x (%s)\n", $_, chr for 1..65000'

doesn't give me anything. It is likely, however, that I misinterpreted
your instructions. (Note: I have no idea which codepoints qualify as
legal for this purpose, so I used the arbitrary limit of 65000.)

> For me this debate is over, POSIX charclasses are not Unicode
> charclasses and any contortion to try to make them so is futile and
> doomed to screw stuff over.

The futile, doomed to screw stuf over attempt has been ongoing for
almost a decade. You suggest going back, I suggest going forward.
Unfortunately I don't understand the points you're making, except the
one about POSIX simply not having any notion of unicode. I'm okay with
a change that makes Perl's [:x:] charclasses fully POSIX compliant, but
then it needs to be done rigourously, and all Perl exceptions have to be
eradicated. This should then not be seen as a fix of any unicode bug,
but as a design/semantics change.
