Front page | perl.perl5.porters |
Postings from November 2012
On deprecating unescaped literal left brace
Thread Next
From:
Karl Williamson
Date:
November 15, 2012 19:13
Subject:
On deprecating unescaped literal left brace
Message ID:
50A5AF43.4040109@khwilliamson.com
There was a long twisty set of threads 4 months ago about deprecating
literal left braces in regular expression patterns when they haven't
been escaped by a preceding backslash.
Most of the threads were about these tickets
https://rt.perl.org/rt3/Ticket/Display.html?id=113094
https://rt.perl.org/rt3/Ticket/Display.html?id=113420
https://rt.perl.org/rt3/Ticket/Display.html?id=114128
blead currently raises a warning when this unescaped "{" happens, and
this creates problems which must be solved before 5.18 is released. I
thought it was time to work on that, and went back over the issues, to
refresh my memory about the nuances, etc. This post summarizes what
I've found, and to see if others agree.
The motivation for doing the deprecation is to eventually forbid an
unescaped literal left brace, and the motivation for doing that is
really threefold:
a) It is too easy to make a typo in a quantifier, and have it silently
be turned into a literal. An example is the programmer thinking that /x
allows spaces in the quantifier, like "/foo{1, 3}/x". This matches the
characters literally. There are similar examples, like forgetting the
closing brace.
b) We are prevented from extending the quantifier syntax to, say, allow
such a space under /x, or to accept "foo{,7}", which now also silently
matches literally.
c) The left brace (or curly bracket) is the obvious candidate to use to
extend the language in various ways. For example, in Unicode there
several different types of word boundaries that could reasonably be
desired. Perl only knows about its traditional one. It would be nice
to be able to specify a different one, like "\b{g}"
The reason there is a problem with blead, is if the delimiters of the
pattern are braces, the lexer/tokenizer strips off any backslash that
the programmer added before handing it to the regex parser, so it is
futile for the programmer to add a backslash. There is no work-around,
except to not use braces as the delimiters. Braces are used very
commonly as delimiters.
A solution to that is to change the lexer/tokenizer so it retains the
added backslash. Several people in the thread thought that this is
really how things should have worked all along, and the existing
behavior is a bug. But there is a problem with making this change: it
silently changes the behavior of existing programs, as Dave Mitchell
pointed out:
blead -e 'print "matched\n" if "aa" =~ m{^a\{1,2\}$};'
matched
The backslashes are stripped off by the lexer/tokenizer, and become
metacharacters. It turns out, though, that such code is execeedingly
rare, at least on CPAN. I didn't find any such occurrences involving
braces. (In a post at the time of the earlier discussion, I indicated
that if we applied this change to not just braces but to all paired
delimiters, that there was one case that would be a problem involving a
left bracket "[", but the code involved has changed in the meantime, so
it no longer would be a problem.)
Note that in the example those backslashes are extraneous, that is, it
is equivalent to
blead -e 'print "matched\n" if "aa" =~ m{^a{1,2}$};'
matched
That presents a way out of the dilemma. We revert the patch that
instituted the current warning. Then we deprecate adding extraneous
backslashes, so that "m{^a\{1,2\}$}" would generate a warning. In 5.22,
we would change the lexer/tokenizer to not strip off the preceding
backslashes of paired regex delimiters, and reinstitute the warning
about unescaped left braces. We announced in 5.16's perldelta that
unescaped left braces would change behavior in 5.20. We should update
that announcement in the deltas for 5.18 through 5.22 that the release
for this change is 5.24.
The first deprecation message would only be for unnecessary backslashes.
In, "m{ \{1,2 }", the backslash is necessary to indicate that the left
brace within the pattern is to be taken literally. No deprecation
message would be raised.
Although the left brace is the only character that presents this kind of
problem in our current plans, to avoid a special case, we would apply
the non-stripping to all paired delimiters. However, we would have to
continue to strip the preceding backslash for non-paired delimiters.
For example, in
m?^xy\?$?
the "\?" is necessary to make the "?" not appear as one of the
delimiters, and it has the effect of making the question mark be a
metacharacter. It is impossible AFAIK to insert a literal question mark
into the pattern here, unless you use something like \N{QUESTION MARK}.
In this case, no warning should be raised, and the lexer/tokenizer
must continue to strip off the backslash, or else there would be no way
for someone to use a non-paired metacharacter (one of: "\ | ^ $ * + ?
."), as a delimiter, and be able to specify it as a metacharacter inside
the pattern. Note again that you can't insert one of these to be a literal.
Thus there is an existing asymmetry in that it is possible to insert a
literal paired metacharacter as long as it doesn't have a mate, but not
to insert a literal non-paired metacharacter.
This proposal has the disadvantage that the behavior differs between
paired and non-paired delimiters with a different asymmetry than
currently. Preceding backslashes would not be stripped off paired ones,
but would continue to be from non-paired ones. It changes the current
asymmetry in the handling of these to a different sort of asymmetry.
I'm coming down on the side that it is worth this change. Too much time
has been wasted, and will continue to be wasted on debugging typos in
quantifiers that are not warned on, until we do something about it.
Thread Next
-
On deprecating unescaped literal left brace
by Karl Williamson