develooper Front page | perl.perl5.porters | Postings from November 2012

On deprecating unescaped literal left brace

Thread Next
Karl Williamson
November 15, 2012 19:13
On deprecating unescaped literal left brace
Message ID:
There was a long twisty set of threads 4 months ago about deprecating 
literal left braces in regular expression patterns when they haven't 
been escaped by a preceding backslash.

Most of the threads were about these tickets

blead currently raises a warning when this unescaped "{" happens, and 
this creates problems which must be solved before 5.18 is released.  I 
thought it was time to work on that, and went back over the issues, to 
refresh my memory about the nuances, etc.  This post summarizes what 
I've found, and to see if others agree.

The motivation for doing the deprecation is to eventually forbid an 
unescaped literal left brace, and the motivation for doing that is 
really threefold:

a) It is too easy to make a typo in a quantifier, and have it silently 
be turned into a literal.  An example is the programmer thinking that /x 
allows spaces in the quantifier, like "/foo{1, 3}/x".  This matches the 
characters literally.  There are similar examples, like forgetting the 
closing brace.

b) We are prevented from extending the quantifier syntax to, say, allow 
such a space under /x, or to accept "foo{,7}", which now also silently 
matches literally.

c) The left brace (or curly bracket) is the obvious candidate to use to 
extend the language in various ways.  For example, in Unicode there 
several different types of word boundaries that could reasonably be 
desired.  Perl only knows about its traditional one.  It would be nice 
to be able to specify a different one, like "\b{g}"

The reason there is a problem with blead, is if the delimiters of the 
pattern are braces, the lexer/tokenizer strips off any backslash that 
the programmer added before handing it to the regex parser, so it is 
futile for the programmer to add a backslash.  There is no work-around, 
except to not use braces as the delimiters.  Braces are used very 
commonly as delimiters.

A solution to that is to change the lexer/tokenizer so it retains the 
added backslash.  Several people in the thread thought that this is 
really how things should have worked all along, and the existing 
behavior is a bug.  But there is a problem with making this change: it 
silently changes the behavior of existing programs, as Dave Mitchell 
pointed out:

     blead -e 'print "matched\n" if "aa" =~ m{^a\{1,2\}$};'

The backslashes are stripped off by the lexer/tokenizer, and become 
metacharacters.  It turns out, though, that such code is execeedingly 
rare, at least on CPAN.  I didn't find any such occurrences involving 
braces.  (In a post at the time of the earlier discussion, I indicated 
that if we applied this change to not just braces but to all paired 
delimiters, that there was one case that would be a problem involving a 
left bracket "[", but the code involved has changed in the meantime, so 
it no longer would be a problem.)

Note that in the example those backslashes are extraneous, that is, it 
is equivalent to

     blead -e 'print "matched\n" if "aa" =~ m{^a{1,2}$};'

That presents a way out of the dilemma.  We revert the patch that 
instituted the current warning.  Then we deprecate adding extraneous 
backslashes, so that "m{^a\{1,2\}$}" would generate a warning.  In 5.22, 
we would change the lexer/tokenizer to not strip off the preceding 
backslashes of paired regex delimiters, and reinstitute the warning 
about unescaped left braces.  We announced in 5.16's perldelta that 
unescaped left braces would change behavior in 5.20.  We should update 
that announcement in the deltas for 5.18 through 5.22 that the release 
for this change is 5.24.

The first deprecation message would only be for unnecessary backslashes. 
  In, "m{ \{1,2 }", the backslash is necessary to indicate that the left 
brace within the pattern is to be taken literally.  No deprecation 
message would be raised.

Although the left brace is the only character that presents this kind of 
problem in our current plans, to avoid a special case, we would apply 
the non-stripping to all paired delimiters.  However, we would have to 
continue to strip the preceding backslash for non-paired delimiters. 
For example, in


the "\?" is necessary to make the "?" not appear as one of the 
delimiters, and it has the effect of making the question mark be a 
metacharacter.  It is impossible AFAIK to insert a literal question mark 
into the pattern here, unless you use something like \N{QUESTION MARK}. 
  In this case, no warning should be raised, and the lexer/tokenizer 
must continue to strip off the backslash, or else there would be no way 
for someone to use a non-paired metacharacter (one of: "\ | ^ $ * + ? 
."), as a delimiter, and be able to specify it as a metacharacter inside 
the pattern.  Note again that you can't insert one of these to be a literal.

Thus there is an existing asymmetry in that it is possible to insert a 
literal paired metacharacter as long as it doesn't have a mate, but not 
to insert a literal non-paired metacharacter.

This proposal has the disadvantage that the behavior differs between 
paired and non-paired delimiters with a different asymmetry than 
currently.  Preceding backslashes would not be stripped off paired ones, 
but would continue to be from non-paired ones.  It changes the current 
asymmetry in the handling of these to a different sort of asymmetry. 
I'm coming down on the side that it is worth this change.  Too much time 
has been wasted, and will continue to be wasted on debugging typos in 
quantifiers that are not warned on, until we do something about it.

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About