develooper Front page | perl.perl6.language.regex | Postings from December 2000

Re: Perl 5's "non-greedy" matching can be TOO greedy!

From:
Deven T. Corzine
Date:
December 15, 2000 12:23
Subject:
Re: Perl 5's "non-greedy" matching can be TOO greedy!
Message ID:
Pine.LNX.4.10.10012151459090.31056-100000@escher.ties.org

On 15 Dec 2000, Randal L. Schwartz wrote:

> >>>>> "Deven" == Deven T Corzine <deven@ties.org> writes:
> 
> Deven> What surprised me was how vigorously people would defend the
> Deven> status quo, and insist on the correctness of the current
> Deven> behavior without thinking it through.
> 
> No, I thought it through quite completely.  As have others.

You and others spent plenty of time thinking this through a long time ago.
I wish I had been involved with that process, to raise these points then.

Have you thought it through NOW, on a purely semantic level (in isolation
from implementation issues and historical precedent), or are you rejecting
the idea out-of-hand because you believe you've already given the matter
enough consideration?

> Deven> Given how invested people are in the exact current behavior, I
> Deven> now believe it was a poor choice of words to describe it as a
> Deven> "flaw", simply because it presumed an implied consensus on the
> Deven> higher-level semantics that obviously wasn't there.
> 
> Quite the opposite.  You seem to be one of the very few who expects it
> to act other than as documented.

Really?  I haven't taken a survey, but I did ask one co-worker for his
first impression of what the regexp (from my example) would match.  Not
being an experienced Perl programmer, but being familiar with regular
expressions, he believed he understood the idea of non-greedy matching.
His expectation?  That would match "bccccd", not "bbbbccccd".

Is it that few people expect this, or that they get smacked down so fast
when they suggest it that they never dare raise the question again?

> Deven> It seems to have been interpreted as a value judgement on my
> Deven> part, which it wasn't.  It merely occurred to me that Perl 6
> Deven> might provide an opportunity to eliminate a minor quirk in the
> Deven> regular expression system.  I didn't mean to imply that the
> Deven> current behavior is BAD, simply that it's not quite right (at
> Deven> least in my mind) -- since there's serious disagreement about
> Deven> this, I'd like to make a shift in terminology and start
> Deven> referring to this behavior as a "semantic anomaly" rather than
> Deven> a "flaw" or a "bug", and hope that will be a more neutral term.
> 
> It's not an anomoly at all.  It comes out completely accurate with
> some very simple rules for how regex works.

It's an anomaly in my mind.  Whether it's a real anomaly or an imaginary
one, I left that one up in the air for the moment.  Personally, I believe
the anomaly is real, but quite dependent on the perspective from which you
view the situation.

Yes, the existing rules are accurate, predictable and simple.  Are they the
BEST possible rules, or just what made the most sense at the time?

> Admit it... it bit you, and you are just refusing to believe that you
> don't have a complete and accurate model for how regex work.  Please,
> admit this, and we can MOVE ON.

Actually, it's never bit me.  I only discovered it by experimenting with
contrived examples to see exactly how the new (at that time) non-greedy
matching feature worked.  I thought the behavior was a little odd, but the
explanation made enough sense that it didn't seem worth arguing about.

Now that Perl 6 is on the horizon, it seems time to consider the idea.

I have yet to run into a real-life situation where this anomaly really
matters much.  Which might be an argument not to worry about it.  But the
fact that I was already extremely familiar with greedy regexps, and (at
first) surprised by this behavior, makes me wonder what's the best way.

> Deven> Hopefully, we can have a rational discussion about whether this
> Deven> semantic anomaly is real or imagined, what impact "fixing" it
> Deven> would have on the implementation (if it's deemed real), and
> Deven> whether it's worth "fixing".
> 
> You can't fix what isn't broken.

I put "fixing" in quotes because it depends on whether it's broken or not.

I did the same thing that everyone else is doing, skipping the first part
of the discussion on the assumption that the answer was OBVIOUS.  Well, it
clearly isn't as obvious as we all believe, since what seems obvious to me
and what seems obvious to you are in conflict.  Everyone keeps trying to
frame it as my presumed "misunderstanding" of how it should work, when it's
actually a question of perspective.

Since nobody seems willing to contemplate this from the perspective that
I'm presenting, it's difficult to conclude whether or not it's actually
broken.  Everyone simply assumes I'm wrong and doesn't care beyond that.

> Deven> If the final decision is not to change the current behavior,
> Deven> for whichever reason, I'd like to see this documented in an RFC
> Deven> that says "here's what was requested and why it isn't going to
> Deven> be done".  I'll volunteer to help with that (even if I remain
> Deven> in the minority), whether by summarizing or cutting and pasting
> Deven> arguments made in this discussion...
> 
> Changing the regex to do what you wish would make regex in Perl
> entirely unlike the regex in every other language.  Not gonna happen.

I've NEVER seen non-greedy matching in any other regular expression system
before Perl 5 introduced it.  Ever since then, I've only seen it in other
systems that imitate (or use outright) Perl's regex system.  The very idea
of non-greedy matching is "entirely unlike" all the regex systems that came
before it, including the Perl 4 one.

> Deven> The pattern in question is "b.*?d".  Obviously, this matches
> Deven> "b", followed by something else, followed by "d".  What
> Deven> "something else" should be is the issue at hand.  That portion
> Deven> of the regexp is just ".*?" -- the "." matches any character
> Deven> (except newlines, depending on the mode), the "*" modifies the
> Deven> "." to match "zero or more" of "any character", and the "?"
> Deven> modifies the ".*" to match "zero or more" of "any character",
> Deven> but "matching the minimum number of times possible".
> 
> No.  This is where you are off.  .* and .*? match the same types
> of things.  Just that when given the choice, .*? leans towards
> the shorter version, and .* leans toward the longer.  All the ? does
> is change the *bias* in the face of *choices*.
> 
> But the overriding rules of a regex match are "left most first".
> So the first b will match at the first possible b.  And then
> we run out as many "." matches as we can.  No wait, it's ?, so we
> run out as few "." matches as we can, until we can match a "d".
> Bingo, we got a match!

My proposition is that "leftmost first" should be overriding, but not THAT
overriding.  It's a good idea taken just a little too far for consistency.

> That's the rules.  They're very easy to grasp.  The leftmost match is
> found with the required semantics.  You don't keep going on looking
> for a shorter match.

Yes, the rule is simple.  I don't have an equally simple formulation of how
to _describe_ the rules I'm proposing, but I believe that they ARE more
consistent with programmer intuitions.  (Expectations based on being told
"that's how it works" aren't intuitions.)

> Deven>   Hence,
> Deven> the ".*?" can be summarized as "match anything, but keep the
> Deven> match as short as possible".
> 
> No, that's an incorrect description.  No wonder you are confused.

I'm not confused, I'm in dissent.  That's the description of what that
regular expression would most naturally mean.  The TRUE description of that
it means in Perl 5 is more complicated.  An unnecessary complication, which
is why I'm objecting to it.

> Deven> Am I really the only one who views it this way?  Must I stand
> Deven> alone?
> 
> Yes.  Go stand in the corner. :)

Oh, yes.  Nothing like censure to keep the heathens in line.  :-)

> Deven> If we lived in that ideal world, what behavior would be
> Deven> expected and preferred?
> 
> The current one.  If you muck with "leftmost match wins", not only
> will you break most existing programs, you will SLOW EVERYTHING DOWN,
> because we have to keep going on even after we already have a match!

In that ideal world, why are you assuming that "leftmost match wins" would
be necessarily overriding?

I disagree that the engine would have to continue searching after finding a
match.  There's no reason that it should need to look any further in the
string being matched than it currently does.

The speed question is an interesting one, and I do NOT believe it is a
foregone conclusion that my proposed behavior would be slower.  Yes, it
would be more complex, which means that the NFA would probably take more
memory and slightly longer to compile.  That doesn't guarantee that it
will take longer to execute.  (Maybe it will, but it's not a given.)

I don't want to delve too deeply into implementation efficiency questions
while we still can't agree on the basic premise.  Let me just say that I'm
not (yet) convinced that it will be "too" slow.  Maybe it will be, but that
is not an assumption I intend to make without looking into it further.

Deven




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About