develooper Front page | perl.perl6.language.regex | Postings from December 2000

Re: Perl 5's "non-greedy" matching can be TOO greedy!

Thread Previous | Thread Next
From:
merlyn
Date:
December 15, 2000 11:37
Subject:
Re: Perl 5's "non-greedy" matching can be TOO greedy!
Message ID:
m1g0jpiigk.fsf@halfdome.holdit.com
>>>>> "Deven" == Deven T Corzine <deven@ties.org> writes:

Deven> What surprised me was how vigorously people would defend the
Deven> status quo, and insist on the correctness of the current
Deven> behavior without thinking it through.

No, I thought it through quite completely.  As have others.

Deven> Given how invested people are in the exact current behavior, I
Deven> now believe it was a poor choice of words to describe it as a
Deven> "flaw", simply because it presumed an implied consensus on the
Deven> higher-level semantics that obviously wasn't there.

Quite the opposite.  You seem to be one of the very few who expects it
to act other than as documented.

Deven> It seems to have been interpreted as a value judgement on my
Deven> part, which it wasn't.  It merely occurred to me that Perl 6
Deven> might provide an opportunity to eliminate a minor quirk in the
Deven> regular expression system.  I didn't mean to imply that the
Deven> current behavior is BAD, simply that it's not quite right (at
Deven> least in my mind) -- since there's serious disagreement about
Deven> this, I'd like to make a shift in terminology and start
Deven> referring to this behavior as a "semantic anomaly" rather than
Deven> a "flaw" or a "bug", and hope that will be a more neutral term.

It's not an anomoly at all.  It comes out completely accurate with
some very simple rules for how regex works.

Admit it... it bit you, and you are just refusing to believe that you
don't have a complete and accurate model for how regex work.  Please,
admit this, and we can MOVE ON.

Deven> Hopefully, we can have a rational discussion about whether this
Deven> semantic anomaly is real or imagined, what impact "fixing" it
Deven> would have on the implementation (if it's deemed real), and
Deven> whether it's worth "fixing".

You can't fix what isn't broken.

Deven> If the final decision is not to change the current behavior,
Deven> for whichever reason, I'd like to see this documented in an RFC
Deven> that says "here's what was requested and why it isn't going to
Deven> be done".  I'll volunteer to help with that (even if I remain
Deven> in the minority), whether by summarizing or cutting and pasting
Deven> arguments made in this discussion...

Changing the regex to do what you wish would make regex in Perl
entirely unlike the regex in every other language.  Not gonna happen.

Deven> The pattern in question is "b.*?d".  Obviously, this matches
Deven> "b", followed by something else, followed by "d".  What
Deven> "something else" should be is the issue at hand.  That portion
Deven> of the regexp is just ".*?" -- the "." matches any character
Deven> (except newlines, depending on the mode), the "*" modifies the
Deven> "." to match "zero or more" of "any character", and the "?"
Deven> modifies the ".*" to match "zero or more" of "any character",
Deven> but "matching the minimum number of times possible".

No.  This is where you are off.  .* and .*? match the same types
of things.  Just that when given the choice, .*? leans towards
the shorter version, and .* leans toward the longer.  All the ? does
is change the *bias* in the face of *choices*.

But the overriding rules of a regex match are "left most first".
So the first b will match at the first possible b.  And then
we run out as many "." matches as we can.  No wait, it's ?, so we
run out as few "." matches as we can, until we can match a "d".
Bingo, we got a match!

That's the rules.  They're very easy to grasp.  The leftmost match is
found with the required semantics.  You don't keep going on looking
for a shorter match.

Deven>   Hence,
Deven> the ".*?" can be summarized as "match anything, but keep the
Deven> match as short as possible".

No, that's an incorrect description.  No wonder you are confused.

Deven> Am I really the only one who views it this way?  Must I stand
Deven> alone?

Yes.  Go stand in the corner. :)

Deven> If we lived in that ideal world, what behavior would be
Deven> expected and preferred?

The current one.  If you muck with "leftmost match wins", not only
will you break most existing programs, you will SLOW EVERYTHING DOWN,
because we have to keep going on even after we already have a match!

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About