develooper Front page | perl.perl5.porters | Postings from August 2010

Re: RFC: New regex modifier flags; also the whimsical nature of backwardcompatibility; new 'r' flag has issues

Thread Previous | Thread Next
From:
karl williamson
Date:
August 6, 2010 11:01
Subject:
Re: RFC: New regex modifier flags; also the whimsical nature of backwardcompatibility; new 'r' flag has issues
Message ID:
4C5C4DCE.1080702@khwilliamson.com
H.Merijn Brand wrote:
> On Fri, 6 Aug 2010 08:14:02 -0400, David Golden <xdaveg@gmail.com>
> wrote:
> 
>> On Fri, Aug 6, 2010 at 7:36 AM, karl williamson <public@khwilliamson.com> wrote:
>>> I did an analysis of this, and it turns out that the only ambiguous case is
>>> 's/foo/bar/le'.  It seems like overkill for this to invent a new temporary
>>> pragma, and forbid all the new modifiers as suffixes, when there is no
>>> ambiguity at all outside of substitutions, and no ambiguity  using
>>> substitutions except for one  combination out of all those possible.  Why
>>> can't we just say in the pods and warning message that '/le'  must be
>>> written as '/el' in 5.14?
>> Help me understand what you mean by ambiguous. If there is really only
>> one case, then great!
>>
>> But hypothetically, what would s/foo/bar/elt1 do?  Would the "l" parse
>> as a modifier or would it parse as bar of "lt"?
>>
>> Here's a stupid, but legal example:
>>
>>   $ perl -wE '$_=<>; sub bar { "bar" }; if ( s/foo/bar/elt 1 ) { say
>> "not done" }'
> 
> And mind you that some module might add those flags dynamically. I know
> abigail does some funky stuff, but I bet others do to. Then an eval of
> code with a generated regex where the l got inserted just before the e
> will suddenly fail.
> 
> FWIW I have no strong opinions here, just pointing to possible places
> of hurt.
> 

I'm not sure where to start.

So I'll start here:  My waking-up-in-the-middle-of-the-night analysis 
was somewhat flawed, (latest included at the end for you to check).

First, when I said 'le' was the only possible conflict, I should have 
said any combination that has 'le' in it, such as '/gle'.  Actually, I 
think, any combination that ends in 'le', so '/glex' isn't ambiguous. 
Anyway, that really doesn't change the original claim.

And the new analysis shows one additional issue I had overlooked if we 
use the 't' modifier, that issue being 'gt'.  If we switch to using 'd' 
instead, as I'd already been leaning towards, it goes back down to the 
single problem ('le') that I identified earlier.

However, the new analysis shows two problems with the recently added 'r' 
modifier: '/or' and '/xor'.  Hence I've retitled the subject of this 
post to include Yves' earlier comment on the whimsical nature of finding 
these backward compatibility issues.  'r' was added with nary a peep, 
IIRC, about such things.  There is a .t patch in the queue somewhere, 
BTW, which if it had ever been applied, I think would have found these.

The reason there are so few of the charset modifier issues is because we 
decided that if there were more than one mutually exclusive flag, it 
would be a syntax error.  Thus 'lt' in David's example is not ambiguous. 
  It has to mean that the 'lt' is the less-than operator, because 
otherwise leads to a syntax error.

I had come up with preferring 'd' as the modifier meaning the 
traditional behavior instead of 't', because I think it succinctly 
describes what is happening: the character set used is like a 
dual-valued variable.  It can be native sometimes, and unicode other 
times.  (I personally think this behavior is crazy.)  'd' really lays it 
out what this means; whereas 'traditional' is sort of hazy.  I bet there 
are readers of this list who don't realize what is going on, and I, 
who've looked at the code extensively, still get surprised.  And using 
'd', reduces the compatibility problems in regard to the charset flags 
to one, assuming my light-of-day analysis is correct.

The analysis is that I wrote the attached program and ran it on all the 
keywords in the DATA section of keywords.pl.  It finds all keywords that 
consist only of characters that are regex modifiers.  My claim is that a 
new ambiguity exists only if the addition of any legal combination of 
the new modifiers spells a keyword that isn't already spelled, and if 
that keyword can occur without a syntax error in the context of 
immediately following a regex.  It prints out the various words spelled. 
  I then eyeballed the output looking for ones that I thought legally 
could follow a regex.  Perhaps people more familiar with the nuances of 
Perl will see more.  The output follows (note that cmp, ge, and x would 
be forbidden under our new policy):

Pre-existing potential conflicts: cmp, cos, exec, exp, ge, m, pipe, pop, 
pos, s , semop, x

  New potential conflicts: close, else, grep, lc, le, log, or, our, 
sleep, splice , uc, use, xor

  New potential conflicts with /t: exists, exit, getc, getpgrp, gmtime, 
goto, gt,  msgget, oct, reset, semget, setpgrp, sort, tie, time, times, tr

  New potential conflicts with /d: die, do, ord, redo, rmdir


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About