develooper Front page | perl.perl5.porters | Postings from August 2010

Re: RFC: New regex modifier flags; also the whimsical nature of backwardcompatibility; new 'r' flag has issues

Thread Previous | Thread Next
From:
karl williamson
Date:
August 6, 2010 12:51
Subject:
Re: RFC: New regex modifier flags; also the whimsical nature of backwardcompatibility; new 'r' flag has issues
Message ID:
4C5C67C6.3090108@khwilliamson.com
I've done even more analysis, and things aren't as bad as I had claimed.

karl williamson wrote:
> Oops, forgot to attach the program
> 
> karl williamson wrote:
>> H.Merijn Brand wrote:
>>> On Fri, 6 Aug 2010 08:14:02 -0400, David Golden <xdaveg@gmail.com>
>>> wrote:
>>>
>>>> On Fri, Aug 6, 2010 at 7:36 AM, karl williamson 
>>>> <public@khwilliamson.com> wrote:
>>>>> I did an analysis of this, and it turns out that the only ambiguous 
>>>>> case is
>>>>> 's/foo/bar/le'.  It seems like overkill for this to invent a new 
>>>>> temporary
>>>>> pragma, and forbid all the new modifiers as suffixes, when there is no
>>>>> ambiguity at all outside of substitutions, and no ambiguity  using
>>>>> substitutions except for one  combination out of all those 
>>>>> possible.  Why
>>>>> can't we just say in the pods and warning message that '/le'  must be
>>>>> written as '/el' in 5.14?
>>>> Help me understand what you mean by ambiguous. If there is really only
>>>> one case, then great!
>>>>
>>>> But hypothetically, what would s/foo/bar/elt1 do?  Would the "l" parse
>>>> as a modifier or would it parse as bar of "lt"?
>>>>
>>>> Here's a stupid, but legal example:
>>>>
>>>>   $ perl -wE '$_=<>; sub bar { "bar" }; if ( s/foo/bar/elt 1 ) { say
>>>> "not done" }'
>>>
>>> And mind you that some module might add those flags dynamically. I know
>>> abigail does some funky stuff, but I bet others do to. Then an eval of
>>> code with a generated regex where the l got inserted just before the e
>>> will suddenly fail.
>>>
>>> FWIW I have no strong opinions here, just pointing to possible places
>>> of hurt.
>>>
>>
>> I'm not sure where to start.
>>
>> So I'll start here:  My waking-up-in-the-middle-of-the-night analysis 
>> was somewhat flawed, (latest included at the end for you to check).
>>
>> First, when I said 'le' was the only possible conflict, I should have 
>> said any combination that has 'le' in it, such as '/gle'.  Actually, I 
>> think, any combination that ends in 'le', so '/glex' isn't ambiguous. 
>> Anyway, that really doesn't change the original claim.
>>
>> And the new analysis shows one additional issue I had overlooked if we 
>> use the 't' modifier, that issue being 'gt'.  If we switch to using 
>> 'd' instead, as I'd already been leaning towards, it goes back down to 
>> the single problem ('le') that I identified earlier.
>>
>> However, the new analysis shows two problems with the recently added 
>> 'r' modifier: '/or' and '/xor'.  Hence I've retitled the subject of 
>> this post to include Yves' earlier comment on the whimsical nature of 
>> finding these backward compatibility issues.  'r' was added with nary 
>> a peep, IIRC, about such things.  There is a .t patch in the queue 
>> somewhere, BTW, which if it had ever been applied, I think would have 
>> found these.

Even though /or and /xor are ambiguous, there really isn't a backward 
compatibility problem here.  I'm sorry I cast aspersions on it.  The 
reason there is no problem is that there was no pre-existing code that 
this made ambiguous. '/xor' or '/or' previously generated a syntax error.

>>
>> The reason there are so few of the charset modifier issues is because 
>> we decided that if there were more than one mutually exclusive flag, 
>> it would be a syntax error.  Thus 'lt' in David's example is not 
>> ambiguous.  It has to mean that the 'lt' is the less-than operator, 
>> because otherwise leads to a syntax error.
>>
>> I had come up with preferring 'd' as the modifier meaning the 
>> traditional behavior instead of 't', because I think it succinctly 
>> describes what is happening: the character set used is like a 
>> dual-valued variable.  It can be native sometimes, and unicode other 
>> times.  (I personally think this behavior is crazy.)  'd' really lays 
>> it out what this means; whereas 'traditional' is sort of hazy.  I bet 
>> there are readers of this list who don't realize what is going on, and 
>> I, who've looked at the code extensively, still get surprised.  And 
>> using 'd', reduces the compatibility problems in regard to the charset 
>> flags to one, assuming my light-of-day analysis is correct.
>>
>> The analysis is that I wrote the attached program and ran it on all 
>> the keywords in the DATA section of keywords.pl.  It finds all 
>> keywords that consist only of characters that are regex modifiers.  My 
>> claim is that a new ambiguity exists only if the addition of any legal 
>> combination of the new modifiers spells a keyword that isn't already 
>> spelled, and if that keyword can occur without a syntax error in the 
>> context of immediately following a regex. 

Actually, it is stricter than that.  Consider a string of alphas that 
begins with letters that are all existing regex modifiers followed 
immediately and ending with a sequence that is a perl keyword valid 
right after a regex.  If the new legal modifiers cause that string to be 
entirely modifiers, then there is an ambiguity.  Otherwise there isn't.

  It prints out the various
>> words spelled.  I then eyeballed the output looking for ones that I 
>> thought legally could follow a regex.  Perhaps people more familiar 
>> with the nuances of Perl will see more.  The output follows (note that 
>> cmp, ge, and x would be forbidden under our new policy):
>>
>> Pre-existing potential conflicts: cmp, cos, exec, exp, ge, m, pipe, 
>> pop, pos, s , semop, x
>>
>>  New potential conflicts: close, else, grep, lc, le, log, or, our, 
>> sleep, splice , uc, use, xor
>>
>>  New potential conflicts with /t: exists, exit, getc, getpgrp, gmtime, 
>> goto, gt,  msgget, oct, reset, semget, setpgrp, sort, tie, time, 
>> times, tr
>>
>>  New potential conflicts with /d: die, do, ord, redo, rmdir
>>
> 

I will reiterate my updated claims.  The only ambiguity introduced by 
adding regex modifiers 'r', 'l', 'u', and 'd' is in a s/// when the 
/modifiers end in 'le'.  I further claim that that single exception can 
adequately be handled by giving a customized re-wording of the existing 
warning when this case is encountered, and in documentation.  Using 't' 
instead of 'd' causes another exception: things ending in 'gt'.  I 
prefer 'd' anyway.


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About