develooper Front page | perl.perl5.porters | Postings from February 2015

Re: [perl #123946] assert in /\p^ /

Thread Previous | Thread Next
From:
Karl Williamson
Date:
February 27, 2015 23:00
Subject:
Re: [perl #123946] assert in /\p^ /
Message ID:
54F0F700.2070002@khwilliamson.com
On 02/27/2015 06:48 AM, Hugo van der Sanden via RT wrote:
> On Thu Feb 26 17:52:20 2015, hv wrote:
>> My guess is we want to support /\p^L/ but not /\p^ L/; the diff below
>> is a start towards that, but it's not sufficient - I think we need to
>> move the parsing out of the !SIZE_ONLY guard, or we can't be sure to
>> continue at the right point.
>
> I'm not at all sure about that, and the docs are coy - the only mention I can find of using a caret to invert a property is in perlunicode:
>         You can also use negation in both "\p{}" and "\P{}" by introducing a
>         caret (^) between the first brace and the property name: "\p{^Tamil}"
>         is equal to "\P{Tamil}".
>
> A CPAN grep shows braceless \p^x being tested by ShiftJIS::Regexp, and documented in passing by its pod examples, but didn't show any other uses.
>
> Permitting it, but not skipping whitespace after the caret, results in behaviour I don't understand - I think it's successfully matching some kind of null property, so that /\p^ / ends up roughly equivalent to /./s. So maybe it is better to skip whitespace after all; on that assumption I've pushed the branch hv/braceless-property for review, with one commit for the utf8_heavy warnings and a second for the parse issues.
>
> Hugo
>
> ---
> via perlbug:  queue: perl5 status: new
> https://rt.perl.org/Ticket/Display.html?id=123946

I had noticed lately various issues with \p parsing, but thought it too 
late to change until 5.23, but I'll mention some here.

First, white space is supposed to be significant in patterns except 
under /x.  However, within the braces of \p{}, it is ignored regardless 
of /x, because Unicode suggests/requires doing that in resolving 
property names.  Thus /\p ^ L/ should be an error except possibly under 
/x.  It's a bug that it isn't.  And I think the first thing to do is to 
make it so.  I rather think it should be forbidden here except within 
braces, but maybe backcompat says we have to allow it.  But definitely 
not unless /x is specified.

Also, the white space accepted includes vertical space.  I don't think 
that was intended; I don't think we should be accepting

\p


^



L


with or without braces.  I think those isSPACE calls should be isBLANK 
calls instead.  But if we do want to continue allowing vertical space, 
we should be using Unicode's pattern white space instead of regular SPACE.

Another issue is probably endemic through perl, and that is the use of 
strchr() to find matching characters, in this case the right brace. 
Consider

/\p{foo # This was supposed to be a comment
         # and this
         # and so on
         # including this which contains a }
	# ...
/x

This actually compiles, and when you match against it, you get
Can't find Unicode property definition "foo # This was supposed to be a 
comment"

It thought this was a user-defined property (with a multi-line name, 
though the error message only prints the first line of that name).  It 
may be hard for the user to associate the error message with the actual 
error if the pattern is buried inside a module.  It should have been a 
compile-time error.

More reasonable things are even problematic: \p{Any=Y} might make sense 
if Any were a Unicode property, all of which are specifiable in a 
bipartite manner like this.  But 'Any' is a Perl extension, none of 
which can currently be specified this way.  The code thinks it must be a 
user-defined property whose name is "Any=Y", and compiles to that 
interpretation, and the error found only upon execution.  The code 
should be fixed to look for valid syntax in user-defined property names.

strchr() leads to this kind of problem which exists in other constructs 
besides this one.

To go back to your original question.  We need to decide exactly what 
should be legal first.  I gave my opinion; others welcome (as long as 
they agree with mine, that is).




Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About