Front page | perl.perl5.porters |
Postings from August 2010
Re: RFC: New regex modifier flags; also the whimsical nature of backwardcompatibility; new 'r' flag has issues
Thread Previous
|
Thread Next
From:
karl williamson
Date:
August 6, 2010 12:51
Subject:
Re: RFC: New regex modifier flags; also the whimsical nature of backwardcompatibility; new 'r' flag has issues
Message ID:
4C5C67C6.3090108@khwilliamson.com
I've done even more analysis, and things aren't as bad as I had claimed.
karl williamson wrote:
> Oops, forgot to attach the program
>
> karl williamson wrote:
>> H.Merijn Brand wrote:
>>> On Fri, 6 Aug 2010 08:14:02 -0400, David Golden <xdaveg@gmail.com>
>>> wrote:
>>>
>>>> On Fri, Aug 6, 2010 at 7:36 AM, karl williamson
>>>> <public@khwilliamson.com> wrote:
>>>>> I did an analysis of this, and it turns out that the only ambiguous
>>>>> case is
>>>>> 's/foo/bar/le'. It seems like overkill for this to invent a new
>>>>> temporary
>>>>> pragma, and forbid all the new modifiers as suffixes, when there is no
>>>>> ambiguity at all outside of substitutions, and no ambiguity using
>>>>> substitutions except for one combination out of all those
>>>>> possible. Why
>>>>> can't we just say in the pods and warning message that '/le' must be
>>>>> written as '/el' in 5.14?
>>>> Help me understand what you mean by ambiguous. If there is really only
>>>> one case, then great!
>>>>
>>>> But hypothetically, what would s/foo/bar/elt1 do? Would the "l" parse
>>>> as a modifier or would it parse as bar of "lt"?
>>>>
>>>> Here's a stupid, but legal example:
>>>>
>>>> $ perl -wE '$_=<>; sub bar { "bar" }; if ( s/foo/bar/elt 1 ) { say
>>>> "not done" }'
>>>
>>> And mind you that some module might add those flags dynamically. I know
>>> abigail does some funky stuff, but I bet others do to. Then an eval of
>>> code with a generated regex where the l got inserted just before the e
>>> will suddenly fail.
>>>
>>> FWIW I have no strong opinions here, just pointing to possible places
>>> of hurt.
>>>
>>
>> I'm not sure where to start.
>>
>> So I'll start here: My waking-up-in-the-middle-of-the-night analysis
>> was somewhat flawed, (latest included at the end for you to check).
>>
>> First, when I said 'le' was the only possible conflict, I should have
>> said any combination that has 'le' in it, such as '/gle'. Actually, I
>> think, any combination that ends in 'le', so '/glex' isn't ambiguous.
>> Anyway, that really doesn't change the original claim.
>>
>> And the new analysis shows one additional issue I had overlooked if we
>> use the 't' modifier, that issue being 'gt'. If we switch to using
>> 'd' instead, as I'd already been leaning towards, it goes back down to
>> the single problem ('le') that I identified earlier.
>>
>> However, the new analysis shows two problems with the recently added
>> 'r' modifier: '/or' and '/xor'. Hence I've retitled the subject of
>> this post to include Yves' earlier comment on the whimsical nature of
>> finding these backward compatibility issues. 'r' was added with nary
>> a peep, IIRC, about such things. There is a .t patch in the queue
>> somewhere, BTW, which if it had ever been applied, I think would have
>> found these.
Even though /or and /xor are ambiguous, there really isn't a backward
compatibility problem here. I'm sorry I cast aspersions on it. The
reason there is no problem is that there was no pre-existing code that
this made ambiguous. '/xor' or '/or' previously generated a syntax error.
>>
>> The reason there are so few of the charset modifier issues is because
>> we decided that if there were more than one mutually exclusive flag,
>> it would be a syntax error. Thus 'lt' in David's example is not
>> ambiguous. It has to mean that the 'lt' is the less-than operator,
>> because otherwise leads to a syntax error.
>>
>> I had come up with preferring 'd' as the modifier meaning the
>> traditional behavior instead of 't', because I think it succinctly
>> describes what is happening: the character set used is like a
>> dual-valued variable. It can be native sometimes, and unicode other
>> times. (I personally think this behavior is crazy.) 'd' really lays
>> it out what this means; whereas 'traditional' is sort of hazy. I bet
>> there are readers of this list who don't realize what is going on, and
>> I, who've looked at the code extensively, still get surprised. And
>> using 'd', reduces the compatibility problems in regard to the charset
>> flags to one, assuming my light-of-day analysis is correct.
>>
>> The analysis is that I wrote the attached program and ran it on all
>> the keywords in the DATA section of keywords.pl. It finds all
>> keywords that consist only of characters that are regex modifiers. My
>> claim is that a new ambiguity exists only if the addition of any legal
>> combination of the new modifiers spells a keyword that isn't already
>> spelled, and if that keyword can occur without a syntax error in the
>> context of immediately following a regex.
Actually, it is stricter than that. Consider a string of alphas that
begins with letters that are all existing regex modifiers followed
immediately and ending with a sequence that is a perl keyword valid
right after a regex. If the new legal modifiers cause that string to be
entirely modifiers, then there is an ambiguity. Otherwise there isn't.
It prints out the various
>> words spelled. I then eyeballed the output looking for ones that I
>> thought legally could follow a regex. Perhaps people more familiar
>> with the nuances of Perl will see more. The output follows (note that
>> cmp, ge, and x would be forbidden under our new policy):
>>
>> Pre-existing potential conflicts: cmp, cos, exec, exp, ge, m, pipe,
>> pop, pos, s , semop, x
>>
>> New potential conflicts: close, else, grep, lc, le, log, or, our,
>> sleep, splice , uc, use, xor
>>
>> New potential conflicts with /t: exists, exit, getc, getpgrp, gmtime,
>> goto, gt, msgget, oct, reset, semget, setpgrp, sort, tie, time,
>> times, tr
>>
>> New potential conflicts with /d: die, do, ord, redo, rmdir
>>
>
I will reiterate my updated claims. The only ambiguity introduced by
adding regex modifiers 'r', 'l', 'u', and 'd' is in a s/// when the
/modifiers end in 'le'. I further claim that that single exception can
adequately be handled by giving a customized re-wording of the existing
warning when this case is encountered, and in documentation. Using 't'
instead of 'd' causes another exception: things ending in 'gt'. I
prefer 'd' anyway.
Thread Previous
|
Thread Next