develooper Front page | perl.perl6.language.regex | Postings from December 2000

Perl 5's "non-greedy" matching can be TOO greedy!

Deven T. Corzine
December 14, 2000 13:10
Perl 5's "non-greedy" matching can be TOO greedy!
Message ID:
Non-greedy matching is a very valuable Perl 5 regular expression feature
that simplifies many regular expressions.  However, early on I discovered
what seems to be a failure of the mechanism -- matches were MORE greedy
than necessary.  I'm not sure if other people noticed this, or just failed
to care about it.  Maybe they would disagree with my interpretation or want
to maintain backward-compatibility now that it's implemented in Perl 5 this
way.  I always meant to raise it for discussion long ago, but I never found
the time to participate in the p5p list...

The crux of the problem is that non-greedy qualifiers don't affect the
"earliest match" behavior, which makes the matches more greedy than they
really ought to be.

Here is a simple example: (tested with perl 5.005_03)

     $_ = "aaaabbbbccccddddeeee";
     ($greedy) = /(b.*d)/;              # "bbbbccccdddd" (correct)
     ($non_greedy) = /(b.*?d)/;         # "bbbbccccd" (should be "bccccd"!)

I'm sure this will complicate the NFA for the regexp, but it seems like
this really ought to be fixed, at least in Perl 6.  (There's a good case to
be made for fixing it in Perl 5, but people have ignored/missed it for this
long already...)

Does anyone disagree with the premise, and believe that "bbbbccccd" is the
CORRECT match for the non-greedy regexp above?

For what it's worth, here's a quote from a Perl 5.005_03 "perlre" manpage:

     By default, a quantified subpattern is "greedy", that is, it will
     match as many times as possible (given a particular starting
     location) while still allowing the rest of the pattern to match.
     If you want it to match the minimum number of times possible,
     follow the quantifier with a "?".  Note that the meanings don't 
     change, just the "greediness":

I don't believe that ".*?" matching "bbbcccc" above qualifies as "to match
the minimum number of times possible", when it is possible only to match
the "cccc" and still match the full regexp.  Since the documentation makes
no mention of earliest-match in this paragraph, I can only assume this is
unintended behavior, but I'm asking to check my assumptions.  Any devil's
advocates out there who want to argue for the current behavior?


P.S. Meta question: Is this the right way to propose this, or should I
write it up as an RFC and subit it first?  (I'm not clear if the RFC is
intended to be submitted out of the blue, or after initial discussion...) Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About