develooper Front page | perl.perl6.language.regex | Postings from December 2000

Re: Perl 5's "non-greedy" matching can be TOO greedy!

Deven T. Corzine
December 15, 2000 11:14
Re: Perl 5's "non-greedy" matching can be TOO greedy!
Message ID:

On Fri, 15 Dec 2000, brian d foy wrote:

> On Thu, 14 Dec 2000, Deven T. Corzine wrote:
> > Here is a simple example: (tested with perl 5.005_03)
> > 
> >      $_ = "aaaabbbbccccddddeeee";
> >      ($greedy) = /(b.*d)/;              # "bbbbccccdddd" (correct)
> >      ($non_greedy) = /(b.*?d)/;         # "bbbbccccd" (should be "bccccd"!)
> > Does anyone disagree with the premise, and believe that "bbbbccccd" is the
> > CORRECT match for the non-greedy regexp above?

I made a mistake in phrasing it this way, because it seemed to suggest that
I thought it was an implementation bug that it returns "bbbbccccd" instead
of "bccccd".  I didn't make it clear that I was trying to approach this as
a purely SEMANTIC question, considered in isolation from the implementation
of the system.  The question is, "what interpretation makes the most sense,
at a high level", not "why does the current behavior make sense".

It's not that there aren't justifications for the current behavior.  It's a
question of perspective -- from one perspective (mine), "bccccd" makes more
sense semantically.  I believe it it more intuitive, at the highest level.
From a different (more implementation-oriented) perspective, the current
behavior seems to make more sense.  I assumed the current behavior was an
oversight that was later explained away, or a design tradeoff made in favor
of speed over correctness, because it affects relatively few regexps.

What surprised me was how vigorously people would defend the status quo,
and insist on the correctness of the current behavior without thinking it
through.  Maybe I should have expected this -- it's probably an old debate
people are tired of having.  But if there's ever a time to revive any old
debates, isn't this the time?  Perl 5 was an opportune time to change some
entrenched historical behaviors of the language; surely Perl 6 is as well?

Given how invested people are in the exact current behavior, I now believe
it was a poor choice of words to describe it as a "flaw", simply because it
presumed an implied consensus on the higher-level semantics that obviously
wasn't there.

It seems to have been interpreted as a value judgement on my part, which it
wasn't.  It merely occurred to me that Perl 6 might provide an opportunity
to eliminate a minor quirk in the regular expression system.  I didn't mean
to imply that the current behavior is BAD, simply that it's not quite right
(at least in my mind) -- since there's serious disagreement about this, I'd
like to make a shift in terminology and start referring to this behavior as
a "semantic anomaly" rather than a "flaw" or a "bug", and hope that will be
a more neutral term.

Hopefully, we can have a rational discussion about whether this semantic
anomaly is real or imagined, what impact "fixing" it would have on the
implementation (if it's deemed real), and whether it's worth "fixing".

I apologize for jumping the gun on this; I didn't expect such controversy
over the first part (determining whether the anomaly is real or imagined),
and I realize now that it's critical to have that part of the discussion...

> the quantifier applies to the .* portion. that part finds its shortest
> possible match.  the regex engine, however, choose the first "b" it
> encountered as it's start point.  this is unrelated to the behaviour of ?.
> you don't like that it finds the earliest match, instead of, say, the
> longest match or the shortest match, or even all the possible matches
> (hmmm... module idea ;).

Here's where I see the disconnect happening.  I'm approaching this from a
semantic perspective, asking myself "what should this match (ideally)?"
Everyone else is considering what the regex engine will actually do.  Yes,
it's important to consider implementation details, but the ideal semantics
certainly shouldn't be driven by implementation considerations.  Maybe the
semantic anomaly I see is real, yet not worth fixing for implementation
reasons.  That's a plausible outcome, but if that's the case, it would be
best to document that the anomaly exists for practical reasons.  (If the
anomaly is deemed imaginary, a clear explanation (possibly coupled with a
dissenting viewpoint, if any) would good to document.

If the final decision is not to change the current behavior, for whichever
reason, I'd like to see this documented in an RFC that says "here's what
was requested and why it isn't going to be done".  I'll volunteer to help
with that (even if I remain in the minority), whether by summarizing or
cutting and pasting arguments made in this discussion...

Let me explain my perspective on this, from the highest semantic level, and
why this feels like an anomaly to me.  PLEASE, no responses based on NFA's
or the implementation of the regular expression engine, or anything lower
level than the concept of "pattern matching".  I'll concede that there are
some lower-level perspectives from which the current behavior makes more
sense than my suggested behavior.  I'm concerned with the behavior at the
high level that the programmer is (hopefully) operating at.  I'd welcome
any attempt to refuse this argument AT THAT SEMANTIC LEVEL.  Nobody has
tried to refute it yet, without resorting to lower-level issues...

The pattern in question is "b.*?d".  Obviously, this matches "b", followed
by something else, followed by "d".  What "something else" should be is the
issue at hand.  That portion of the regexp is just ".*?" -- the "." matches
any character (except newlines, depending on the mode), the "*" modifies
the "." to match "zero or more" of "any character", and the "?" modifies
the ".*" to match "zero or more" of "any character", but "matching the
minimum number of times possible".  Hence, the ".*?" can be summarized as
"match anything, but keep the match as short as possible".

TAKEN IN ISOLATION, the most natural interpretation for the programmer to
make, at this high semantic level, is that it will match "bccccd" rather
than "bbbbccccd", because the ".*?" is expected to match as little as
possible (though as much as necessary), and "cccc" is as little as it can
possibly match while allowing the entire regexp to match.  "bbbbccccd" is a
longer match, and counterintuitive compared to the semantic description of
what "b.*?d" should match.

In fact, if there's something greedy (such as ".*") earlier in the regexp
that CAN eat up the extra characters, then the ".*?" won't match the extra
"b" characters after all.  Given that regular expressions are normally just
that (regular) and decomposable into its constituent subexpressions, to
have the matching behavior of ".*?" vary depending on factors OUTSIDE that
subexpression, this certainly seems to be an anomaly to me.

Am I really the only one who views it this way?  Must I stand alone?

All the explanations and justifications presented for this have revolved
around implementation issues and historical precedent.  Now, either or both
of those could turn out to be compelling reasons to retain the current
behavior, but they shouldn't preclude consensus on the ideal behavior in an
idealized world where there is no historical precedent to consider and the
implementation is magically just as fast and simple.  If we lived in that
ideal world, what behavior would be expected and preferred?

> perhaps, however, Perl 6 wil allow you to easily use your own custom regex
> engine and then you can make it do whatever you like. ;)

Now, this is a pretty cool idea.  But really a separate topic.  (And one
worth discussing, too.)  Actually, I'm hoping that Perl 6 can expose the
NFA machinery used by the regexp engine in some fashion, such that it could
be used for situations in which an NFA is needed for a non-regexp purpose.

A discussion for another thread, methinks.

Deven Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About