develooper Front page | perl.perl6.language.regex | Postings from December 2000

Re: Perl 5's "non-greedy" matching can be TOO greedy!

From:
Deven T. Corzine
Date:
December 15, 2000 15:04
Subject:
Re: Perl 5's "non-greedy" matching can be TOO greedy!
Message ID:
Pine.LNX.4.10.10012151728430.31056-100000@escher.ties.org
[I delayed responding to this message because it was the longest.]

On Thu, 14 Dec 2000, Tom Christiansen wrote:

> >No question that's how it's been implemented.  But WHY would anyone want
> >such behavior?  When is it beneficial?
> 
> It is beneficial because this is how it's always been, because it
> is faster, because it is more expressive, because it is more powerful,
> because it is more intuitive, and because it is more perlian.

Nice summary, but I'm not buying what you're selling in the elaboration.

> In elaboration:
> 
> 0) All NFAs before POSIX acted this way.  It is historically 
>    consistent and perfectly expected.

First, all those older regular expression systems were inherently greedy
matching algorithms.  There is absolutely no conflict between earliest
match and greedy matching.  The conflict only arises when non-greedy
matching is introduced into the equation.

Second, by referring to NFA's, you're already dropping down to a lower
level, and as I said in a prior message, you've lost the gestalt by then.
Once you drop to that level (and stop referencing the semantics of the
whole), my point is moot, because it no longer makes sense at that level.

Finally, historical precedent is a always a justification not to change
anything.  Of course, that argued against the wonderful regex extensions
that Perl 5 added, since they went against the way "it's always been"...

> 1) It is obviously faster to come to an answer earlier on in the
>    execution than it would be to come to an answer later.  It's
>    like an expression whose evaluation short-circuits.  Also, when
>    the matching sematics permit back tracking and back references,
>    the combinatoric possibilities can easily explode into virtual
>    unsolvability as the 2**N algorithm loses its race to the heat
>    death of the universe.  Yes, if Perl did overall-longest or
>    overall-shorted, this would produce a more predictable time;
>    however, as we see with DFAs and POSIX NFAs, this prediction
>    plays out as guaranteed *WORST-CASE* time.  It is not acceptable
>    to make everyone pay the worst-case time.  Never penalize
>    the whole world for the needs or desires or the few.

I don't think it should be implemented unless the cost is small or applied
only to those situations where it's wanted despite the cost.

However, I don't believe it would NECESSARILY be slower to execute -- yes,
there's more complexity.  The price for that complexity must be paid, but
it may be payable in memory by making a larger (but equally fast) NFA for
the regular expression.  Or maybe it can't.  I don't know, because I have
not investigated it.  It's premature to simply assume it MUST be slower.

> 2) Consider the simple case, /A|B/.  In your overall longest/shortest,
>    guaranteed worst-case time, both submatch A and submatch B must
>    be calculated, and then the lengths of their matches both be compared.
>    Perl, fortunately, does not do that.  Rather, the first one in that
>    sequence wins.  That means that under the current scheme, the 
>    patterns /A|B/ and /B|A/ have different semantics.  Under your 
>    worst-case scheme, they do not.  Because /A|B/ and /B|A/ mean
>    something different, more expressivity is provided.  This is the
>    same scenario, albeit expressed slightly differently, as your
>    situation.  The issues manifest in both are equivalent.

Then that is another semantic anomaly, because the alternation is supposed
to mean that one or the other pattern matches.  Logically, "or" should be
communitive.  If they're not quite, that's another disconnect between the
high-level model and the implementation.  Maybe /A|B/ and /B|A/ _do_ mean
different things to the engine, but they do NOT mean different things in
the high-level semantics -- except when you force the high-level semantics
to change by fiat, to match the implementation.

> 3) This leads to increased power.  It's like the difference between
>    a short-circuiting "or" and one that blindly plods ahead trying
>    to figure something out even when all is for naught.  Compare A&&B
>    with A&B, for example.  If A is 0, then B need not be computed, 
>    yet in the second version, one runs subexpression B nevertheless.
>    If according to the rules of one particular system, patX and
>    patY mean different things, whereas in a second system, they are
>    completely interchangeable, then the first system can express
>    nuances that the second one cannot.  When you have more nuances,
>    more expressivity, then you have more power, because you can say
>    things you could not otherwise say.  Why do C and its derivatives
>    such as Perl have short-circuiting Boolean operators?  Because
>    in older languages, such as Fortran and Pascal, where you did
>    not have them, one quickly found that this was cumbersome and
>    annoying.

It's a question of speed vs. correctness.  Correctness is important, but
occasionally a little incorrectness is worthwhile for increased speed.

Short-circuiting in C is a good example.  It's *almost* correct -- the
right boolean answer to the final expression WILL be returned, as will the
correct matching string for /A|B/ or /B|A/.  But it's not quite perfect.
In C, side-effects of executing the short-circuited expressions (such as
function calls) won't take effect.  This is a worthwhile tradeoff, since it
can save a LOT of execution time, and the programmer can work around it
when it matters.  But that semantic anomaly in C is well documented and
accepted, so people get by with it.

If the cost of "fixing" this non-greedy/leftmost semantic anomaly in Perl's
regular expression engine is prohibitive, it's certainly acceptable to say
it's a worthwhile tradeoff and document why it's the way it is.  But to
deny the validity of the alternative approach is a bit more extreme.

The current behavior is "correct enough".  It's not important to change it.
But, if we can fix it and make it completely correct, it's at least worth
considering -- but maybe not (in the end) worth actually implementing.

> 4) It is more intuitive to the reader and the writer to minimize
>    strange action at a distance.  It's more to remember; or, perhaps
>    better phrased, more to forget.  That's why we don't like 
>    variables set in one place magically affecting innocent code
>    elsewhere.  Maybe it's more applicable here to say that that's
>    why having mixed precedences and associativities confuses people.
>    If in an expression like A->B->C->D, you had to know a prior when
>    evaluating A that D was going to be coming up, it would require 
>    greater look-ahead, more mental storage.  Even if a computer could
>    do it, people would find it harder.  That's why we don't write
> 
>        &{&{$fnctbl{expr}}(arg1)}(arg2)
> 
>     when we can simply write
> 
>        $fnctbl{expr}->(arg1)->(arg2)
> 
>     It is not intuitive to people to have to do too much look-ahead, 
>     or too much storage.  Having distance items interact with one
>     another is confusing, and we've already got that situation with
>     backreferences, as in /(\w+)(\w+)\s+\2(\w+)/, which depending on 
>     how you start weighting those +'s into +?'s, can really move
>     matters around.  Let's not exacerbate the counterintuitiveness.

Ah, but the current behavior *is* strange action at a distance.  Instead of
looking at the pattern and the part of the string you expect it to match,
you have to consider at-a-distance factors such as the global priority of
leftmost matching as a strictly-overriding factor, or whether or not some
greedy subexpression earlier in the regular expression will eat the
characters or not.  I believe the current behavior is counterintuitive,
which is exactly why I suggested changing it.

> 5)  It is more Perlian because of the principle that things that look 
>     different should actually *be* different.  /A|B/ and /B|A/ look
>     quite different.  Thus, they should likewise *be* different.

Really?  So, the semantics of "if (!(expr)) {...}" should be different from
"unless (expr) {...}" just because they look different?  An optimizer would
have a hell of a time getting ANYTHING done if it weren't for the semantic
equivalence of different code that can accomplish the same thing.

If the high-level semantics are logically equivalent, constraining those
semantics for the sake of the implementation is unnecessarily complex.

> >I didn't need the long-winded explanation, and I don't need help with
> >understanding how that regexp matches what it does.  I understand it
> >perfectly well already.  I'm no neophyte with regular expressions, even if
> >Perl 5 does offer some regexp features I've never bothered to exploit...
> 
> All NFAs prior to POSIX behaved in the fashion that Perl's continue
> to behave in.  I am surprised that over the long course of your 
> experiences with regexes, that you never noticed this fundamental
> principle before.

Not all NFAs are used for regexps.  Moreover, it's already too low a level;
once you start viewing the regexp as a series of instructions for how to
move through the states of an NFA, you've already turned it into a program
and lost the semantics I'm trying to preserve from the original pattern.

And yes, I've used regular expressions heavily for MANY years now.  It's
not relevant to this discussion, except to say that I'm not ignorant of
the way regular expressions work.

> >My point is that the current behavior, while reasonable, isn't quite right.
> 
> You're wrong.  Don't call it "not right".  It's perfectly correct
> and consistent.  It follows directly from historical behavior of
> these things, and quite simply, it's in the rules.  It's preferable
> for the many reasons I outlined for you above.

Historical precedent does have bearing on whether to implement it, but not
on what the ideal semantics should be.  The codified rules are related to
the implementation, and again shouldn't have any bearing on the idealized
semantics.  As for being preferable, I found your arguments unconvincing;
on the contrary, I consider the current behavior counterintuitive, which is
the only incentive I see for changing it.

> You just want something different.  That does not make this wrong.
> If I want ice cream and you want hamburgers, this does not make me
> wrong.  However, for you to tell me I really want hamburgers when

I want the regular expressions to be intuitive and consistent across ALL
levels, not just at the procedural NFA implementation level.  What's wrong
with that desire?

> >design flaw in Perl 5's regular expressions.  I was hoping we could have a
> >rational debate about it, rather than propagating the same design flaw into
> >Perl 6 through inertia or for legacy reasons. 
> 
> It is not a design "flaw".  See above.

My mistake in terminology.  It's too loaded to call it a "flaw", which is
why I've taken to calling it an "anomaly", which is more value-neutral.

> >> You should probably read the whole chapter.
> 
> >That was uncalled for!  Just because I disagree with a particular detail of
> >the design doesn't mean that I didn't _understand_ it...
> 
> It was hardly uncalled for.  You did not prove that you understood
> it.  You have not yet done so, as you did not cover any of the
> ground which I did above.  Perhaps if I had said "Read the source
> code", that might have been uncalled for.  But I did not say that,
> and your retort was gratuitously combative.

Combative?  I don't believe so.  I simply objected to the assumption that
disagreeing with the design constitutes ignorance of it.

I could prove that I understand regexps, but what would be the point?

> You see, it just so happens that there's much more in that chapter,
> including the referenced Little Engine section, that when read as
> a whole, might well shed better understanding than mere extracts.
> You should also check out _Mastering Regular Expressions_ says about
> that.  I won't quote for you.  I'm tired of typing when you could
> be reading.

I've read the 2nd edition of the camel book, but I haven't seen the new
material in the 3rd edition yet.  (I've postponed buying it, to leave it as
a potential Xmas present.  I'll buy it immediately after Xmas if I don't
get it as a present.)  I glanced over the section in the bookstore.  Looks
like a good overview of the specific algorithm implemented, but it's not
really relevent to the question of idealized semantics.

> What is completely uncalled for, though, is to call something a bug
> just because you either don't understand it or because you don't
> like it.  Lamentably, such behavior is remarkably frequent.  You
> are hardly the first.

I apologize for seeming to call it a bug, and for calling it a "flaw" --
those sound too much like value judgements, and encourage people to
entrench their positions rather than contemplating alternatives.

Deven




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About