develooper Front page | perl.perl5.porters | Postings from December 2009

Re: POSIX-like syntax or full compliancy? (Was: PATCH: partial [perl#58182] ...)

Thread Previous | Thread Next
karl williamson
December 10, 2009 21:29
Re: POSIX-like syntax or full compliancy? (Was: PATCH: partial [perl#58182] ...)
Message ID:
Juerd Waalboer wrote:
> demerphq skribis 2009-12-10 14:11 (+0100):
>> See regexec.c and regcomp.c for the source of our mutual confusion.
> Unfortunately I don't speak C. If I understood Perl's source, I would
> probably have been much more specific when suggesting fixes, from the
> beginning.
>>> The first, ASCII-only, would be a mistake.
>> No it wouldnt. There are no "unicode semantics" for POSIX.
> That would be relevant if Perl had POSIX character classes.
> However, the question of whether [[:xxx:]] is POSIX-like syntax, or an
> actual POSIX character class, remains unanswered or at least unclear.
> Certainly Perl's documentation isn't fully definitive. perlre mentions:
> 1 => "POSIX character class syntax"
> 2 => "POSIX character classes"
> 3 => Equivalences to \p{} Unicode constructs
> 1 and 3 can both be true, but then 2 is not. This is how I (prefer to)
> think of it.
> However, it could also be that 1 and 2 are true, ruling 3 out. If I
> understand correctly, that's how you see the matter.
> The mere existence of exceptions to the POSIX standard in [:xxx:], and
> the exclusion of [.xxx.] and [=xxx=] lead me to believe that it's just
> syntax compatibility, and Perl is free to extend the class definitions
> to meet more modern requirements, like acknowledging that é is indeed
> alphanumeric. Even if the people who invented the original POSIX bracket
> expressions failed to notice.
>> Try matching all the legal codepoints against [^POSIX] and against [POSIX]
>> And note all the cases where you have both matching. Then do it with
>> the strings in unicode. Note all the errors.
> I wish you had spent the same time trying to explain what happens if you
> do try this. Would have saved me some time and failure, because I was
> unable to reproduce the errors.
> perl -CO -le'(chr($_) =~ /[[:alnum:]]/) and (chr($_) =~ /[^[:alnum:]]/)
> and warn sprintf "U+%04x (%s)\n", $_, chr for 1..65000'
> doesn't give me anything. It is likely, however, that I misinterpreted
> your instructions. (Note: I have no idea which codepoints qualify as
> legal for this purpose, so I used the arbitrary limit of 65000.)
>> For me this debate is over, POSIX charclasses are not Unicode
>> charclasses and any contortion to try to make them so is futile and
>> doomed to screw stuff over.
> The futile, doomed to screw stuf over attempt has been ongoing for
> almost a decade. You suggest going back, I suggest going forward.
> Unfortunately I don't understand the points you're making, except the
> one about POSIX simply not having any notion of unicode. I'm okay with
> a change that makes Perl's [:x:] charclasses fully POSIX compliant, but
> then it needs to be done rigourously, and all Perl exceptions have to be
> eradicated. This should then not be seen as a fix of any unicode bug,
> but as a design/semantics change.

If you run the attached test file that I believe Yves wrote, you can 
find some of the errors.   That said, I do believe that given enough 
work we could fix things so that these posix-like constructs seamlessly 
match above the ASCII range without matching a class and its complement 
simultaneously (we've invented quantum particle regexes!) BUT with 
several exceptions.  One is locale (and EBCDIC machines, should they 
ever use Perl again).  These constructs are supposed to match in the 
given locale.  The parallel \p ones do not.  The other exceptions are 
that the Perl/Unicode definitions don't match the Posix definitions for 
two of the constructs.  [[:Word:]] is a perl extension, and hence not 
relevant here; but the Punct differs significantly, and the Space 
differs in one character, as Juerd pointed out.

It was our intention that 5.12 would use strict Posix definitions 
rigourously for all these, except the perl made-up extension, 
[[:Word:]], which has no Posix definition.  So I think you said you were 
OK with that.  In a sense it is a partial fix the Unicode bug because it 
means that these cases will no longer have different semantics if the 
internal  representation changes.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About