develooper Front page | perl.perl6.language.regex | Postings from December 2000

Re: Perl 5's "non-greedy" matching can be TOO greedy!

From:
Deven T. Corzine
Date:
December 14, 2000 14:35
Subject:
Re: Perl 5's "non-greedy" matching can be TOO greedy!
Message ID:
Pine.LNX.4.10.10012141637440.22729-100000@escher.ties.org

On Thu, 14 Dec 2000, Tom Christiansen wrote:

> >Does anyone disagree with the premise, and believe that "bbbbccccd" is the
> >CORRECT match for the non-greedy regexp above?
> 
> Yes.  The Camel's regex chapter reads:
> 
>     You might say that eagerness holds priority over greed (or thrift).

No question that's how it's been implemented.  But WHY would anyone want
such behavior?  When is it beneficial?

> >For what it's worth, here's a quote from a Perl 5.005_03 "perlre" manpage:
> 
> >     By default, a quantified subpattern is "greedy", that is, it will
> >     match as many times as possible (given a particular starting
> >     location) while still allowing the rest of the pattern to match.
> >     If you want it to match the minimum number of times possible,
> >     follow the quantifier with a "?".  Note that the meanings don't 
> >     change, just the "greediness":
> 
> >I don't believe that ".*?" matching "bbbcccc" above qualifies as "to match
> >the minimum number of times possible", when it is possible only to match
> >the "cccc" and still match the full regexp.  Since the documentation makes
> >no mention of earliest-match in this paragraph, I can only assume this is
> >unintended behavior, but I'm asking to check my assumptions.  Any devil's
> >advocates out there who want to argue for the current behavior?
> 
> The simple story is this:
> 
>     Rule 1: Given two matches at *different* starting points, the
> 	    one that occurs earlier wins.
> 
>     *OTHERWISE*
> 
>     Rule 2: Given two matches at the *same* starting points, the
> 	    one that is longer wins.
>
> Or, more lengthly:
> 
[...]
> 	"exasperate" =~ /e(.*?)e/   #  $1 now "xasp"
[...]

I didn't need the long-winded explanation, and I don't need help with
understanding how that regexp matches what it does.  I understand it
perfectly well already.  I'm no neophyte with regular expressions, even if
Perl 5 does offer some regexp features I've never bothered to exploit...

My point is that the current behavior, while reasonable, isn't quite right.
It's very close, but not quite.

I thought the point of Perl 6 was to improve the language through feedback
and involvement from the Perl community?  That's what I'm trying to do,
offer feedback on what I consider to be a subtle (and yes, fairly minor)
design flaw in Perl 5's regular expressions.  I was hoping we could have a
rational debate about it, rather than propagating the same design flaw into
Perl 6 through inertia or for legacy reasons.  (Is there some reason that
the Perl 5 regular expressions should be exempt from refinement?)

The example I gave has 16 possible matches, "bbbbccccdddd", "bbbbccccddd",
"bbbbccccdd", "bbbbccccd", "bbbccccdddd", "bbbccccddd", "bbbccccdd",
"bbbccccd", "bbccccdddd", "bbccccddd", "bbccccdd", "bbccccd", "bccccdddd",
"bccccddd", "bccccdd", "bccccd".  _All_ of these possible matches are
OVERLAPPING matches.  Obviously greedy matching should (and does) match the
longest match, "bbbbccccdddd".  My contention is that non-greedy matching
should match the shortest match, "bccccd".

Now, the "exasperate" example quoted above is a good one.  It's also an
example of overlapping matches.  And I AGREE that "xasp" is what $1 should
match in the quoted example, not the shorter, later "rat" match.

Basically, I agree that eagerness SHOULD take priority, but not at the
expense of making a "non-greedy" gratuitously more greedy than necessary.

> [...] It's only when multiple possible matches start at the same point
> that you use minimal or maximal matching to break the tie.  If the
> starting points differ, there's no tie to break.

This is the point I take issue with.  In my original example, "bbbbccccd"
had an earlier starting point than "bccccd", and I believe that WAS a tie
to break with non-greedy matching, to prefer "bccccd" over "bbbbccccd".
After considering the interesting example of matching "exasperate", I've
got a more precise formulation to propose:  "tie-breaking" should be done
when potential matches have the same starting OR ending points.

Since the "rat" match in "exasperate" would require pushing the ending
point back further, I'm in agreement that "xasp" is the correct match.
However, "bccccd" had the SAME ending point as "bbbbccccd", so considering
"bbbbccccd" a better match is really VERY arbitrary, and I believe flawed.

I might also note that this is probably a subtle effect of the fact that
Perl and other regular expression systems are based on English, therefore
making leftmost matching seem more "natural".  This is a cultural bias,
which would be different if English were a right-to-left language like
Hebrew or Arabic.  If it were, you can bet that we'd be arguing over
whether "bccccd" or "bccccdddd" was the proper match in my example!

In fact, this brings to mind another possibility to consider for the Perl 6
regular expression engine.  It would probably be helpful for international
purposes to support rightmost matching (as a modifier), since it would
probably be helpful when processing right-to-left languages...  (This would
have to work with the /g modifier, and would make the "rat" in "exasperate"
the preferred match, etc.)

> You should probably read the whole chapter.

That was uncalled for!  Just because I disagree with a particular detail of
the design doesn't mean that I didn't _understand_ it...

Deven




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About