Front page | perl.perl6.language.regex |
Postings from December 2000
Re: Perl 5's "non-greedy" matching can be TOO greedy!
From: Tom Christiansen
December 14, 2000 13:24
Re: Perl 5's "non-greedy" matching can be TOO greedy!
Message ID: 15700.976829056@chthon
>Does anyone disagree with the premise, and believe that "bbbbccccd" is the
>CORRECT match for the non-greedy regexp above?
Yes. The Camel's regex chapter reads:
You might say that eagerness holds priority over greed (or thrift).
>For what it's worth, here's a quote from a Perl 5.005_03 "perlre" manpage:
> By default, a quantified subpattern is "greedy", that is, it will
> match as many times as possible (given a particular starting
> location) while still allowing the rest of the pattern to match.
> If you want it to match the minimum number of times possible,
> follow the quantifier with a "?". Note that the meanings don't
> change, just the "greediness":
>I don't believe that ".*?" matching "bbbcccc" above qualifies as "to match
>the minimum number of times possible", when it is possible only to match
>the "cccc" and still match the full regexp. Since the documentation makes
>no mention of earliest-match in this paragraph, I can only assume this is
>unintended behavior, but I'm asking to check my assumptions. Any devil's
>advocates out there who want to argue for the current behavior?
The simple story is this:
Rule 1: Given two matches at *different* starting points, the
one that occurs earlier wins.
Rule 2: Given two matches at the *same* starting points, the
one that is longer wins.
Or, more lengthly:
Given the opportunity to match something a variable number of
times, maximal quantifiers will elect to maximize the repeat
count. So when we say "as many times as you'd like", the greedy
quantifier interprets this to mean "as many times as you can
possibly get away with", constrained only by the requirement
that this not cause specifications later in the match to fail.
If a pattern contains two open-ended quantifiers, then obviously
both cannot consume the entire string: characters used by one
part of the match are no longer available to a later part. Each
quantifier is greedy at the expense of those that follow it,
reading the pattern left to right.
That's the traditional behavior of quantifiers in regular
expressions. However, Perl permits you to reform the behavior
of its quantifiers: by placing a C<?> after that quantifier,
you change it from maximal to minimal. That doesn't mean that
a minimal quantifier will always match the smallest number of
repetitions allowed by its range, any more than a maximal
quantifier must always match the greatest number allowed in its
range. The overall match must still succeed, and the minimal
match will take as much as it needs to succeed, and no more.
(Minimal quantifiers value contentment over greed.)
For example, in the match:
"exasperate" =~ /e(.*)e/ # $1 now "xasperat"
the C<.*> matches "C<xasperat>", the longest possible way for
it to match. (It also stores that value in C<$1>, as described
below under "Capturing and Clustering".) Although there was a
shorter match available, a greedy match doesn't care. Given
two choices at the same starting point, it always returns the
I<longer> of the two.
Contrast this with this:
"exasperate" =~ /e(.*?)e/ # $1 now "xasp"
Here, the minimal matching version, C<.*?>, is used. Adding
the C<?> to C<*> makes C<*?> take on the opposite behavior: Now
given two choices at the same starting point, it always returns
the I<shorter> of the two.
Although you could read C<*?> as saying to match zero or more
of something but preferring zero, that doesn't mean it will
always match zero characters. If it did so here, for example,
and left C<$1> set to C<"">, then the second "C<e>" wouldn't
be found, since it doesn't immediately follow the first one.
You might also wonder why, in minimally matching C</e(.*?)e/>,
Perl didn't stick "C<rat>" into C<$1>. After all, "C<rat>"
also falls between two C<e>'s, and is shorter than "C<xasp>".
In Perl, the minimal/maximal choice applies only when selecting
the shortest or longest from among several matches that all
have the same starting point. If two possible matches exist,
but these start at different offsets in the string, then their
lengths don't matter--and neither does whether you've used a
minimal quantifier or a maximal one. The earliest of several
valid matches always wins out over all latecomers. It's only
when multiple possible matches start at the same point that you
use minimal or maximal matching to break the tie. If the
starting points differ, there's no tie to break. Perl's matching
is normally I<leftmost longest>; with minimal matching, it
becomes I<leftmost shortest>. But the "leftmost" part never
varies, and is the dominant criterion.<footnote> Not all regex
engines work this way. Some believe in overall greed, in which
the longest match always wins, even if it shows up later. Perl
isn't that way. You might say that eagerness holds priority
over greed (or thrift). For a more formal discussion of this
principle and many others, see the section on "The Little Engine
There are two ways to defeat the leftmost leanings of the pattern
matcher. First, you can use an earlier greedy quantifier
(typically C<.*>) to try to slurp earlier parts of the string.
In searching for a match for a greedy quantifier, it tries for
the longest match first, which effectively searches the rest
of the string right-to-left:
"exasperate" =~ /.*e(.*?)e/ # $1 now "rat"
But be careful with that, since the overall match now includes
the entire string up to that point.
The second way to defeat leftmostness is using positional
assertions, discussed in the next section.
You should probably read the whole chapter.