develooper Front page | perl.perl6.language.regex | Postings from December 2000

Re: Perl 5's "non-greedy" matching can be TOO greedy!

Thread Previous | Thread Next
From:
Deven T. Corzine
Date:
December 14, 2000 14:52
Subject:
Re: Perl 5's "non-greedy" matching can be TOO greedy!
Message ID:
Pine.LNX.4.10.10012141736140.22729-100000@escher.ties.org

On Thu, 14 Dec 2000, Jeff Pinyan wrote:

> On Dec 14, Deven T. Corzine said:
> 
> >The crux of the problem is that non-greedy qualifiers don't affect the
> >"earliest match" behavior, which makes the matches more greedy than they
> >really ought to be.
> 
> That's because "greediness" is just a measure of crawl vs. backtrack.  The
> regex /a.*b/ will match 'a', and as many non-\n characters as possible,
> and then look for a 'b'.  Upon failing, it will back up one character.  On
> the other hand, /a.*?b/ matches an 'a', and then 0 characters, and then
> tries to match a 'b', and upon failing matches another character, etc.

Well, you're discussing implementation details.  I don't know if it can be
implemented easily or not, but there seems to be dispute over which are the
right semantics, which is critical to resolve before worrying about how to
actually implement it...

> >     $_ = "aaaabbbbccccddddeeee";
> >     ($greedy) = /(b.*d)/;              # "bbbbccccdddd" (correct)
> >     ($non_greedy) = /(b.*?d)/;         # "bbbbccccd" (should be "bccccd"!)
> >
> >Does anyone disagree with the premise, and believe that "bbbbccccd" is the
> >CORRECT match for the non-greedy regexp above?
> 
> >     match as many times as possible (given a particular starting
> >     location) while still allowing the rest of the pattern to match.
> 
> The starting location is the first 'b' it matches.  Greediness has nothing
> to do with the 'b' in your regex -- it has to do with the '.'.  The engine
> matches a 'b', and then starts working on 0 or more of anything.
> 
> You're asking for something like
> 
>   /(?<!b)(b.*?d)/
> 
> which is an "optimization" you'll have to incorporate on your own.

Thanks for the example.  Unfortunately, your attempted workaround doesn't
even work for the example string; the "a" preceding "bbbbccccd" isn't a
"b", so the regexp engine is still perfectly happy with the same match.
Even if it worked as you intended, it would have failed with something like
"bbbabbbccccdddd", since the ".*?" would happily match "bbabbbcccc"...

It just goes to show that it's not always as simple to workaroud the flaw
in the regexp design as one might hope.  The other obvious workaround,
/.*(b.*d)/ would work for the example string, but fail if you append "bd"
to the end of the string.  The most viable workaround is to make a second
pass at the match to try to pare it down further:

     $_ = "aaaabbbbccccddddeeee";
     ($non_greedy) = /(b.*?d)/;         # "bbbbccccd"
     $non_greedy =~ s/^.*b/b/;          # "bccccd"

This workaround DOES work, but it wouldn't be necessary in the first place
if the design of the regular expression didn't have this subtle flaw to
begin with.  (And not all regexps will be so trivially patched up, either.)

I'm not pushing for this to be fixed in Perl 5; it's been out there long
enough, and there's no point worrying about it in that version.  But that
doesn't mean that the same design flaw should be kept in Perl 6, when some
minor incompatibilities are to be expected...

Deven


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About