develooper Front page | perl.perl6.language.regex | Postings from December 2000

Re: Perl 5's "non-greedy" matching can be TOO greedy!

Thread Previous | Thread Next
From:
Tom Christiansen
Date:
December 14, 2000 15:36
Subject:
Re: Perl 5's "non-greedy" matching can be TOO greedy!
Message ID:
26316.976836998@chthon
>No question that's how it's been implemented.  But WHY would anyone want
>such behavior?  When is it beneficial?

It is beneficial because this is how it's always been, because it
is faster, because it is more expressive, because it is more powerful,
because it is more intuitive, and because it is more perlian.

In elaboration:

0) All NFAs before POSIX acted this way.  It is historically 
   consistent and perfectly expected.

1) It is obviously faster to come to an answer earlier on in the
   execution than it would be to come to an answer later.  It's
   like an expression whose evaluation short-circuits.  Also, when
   the matching sematics permit back tracking and back references,
   the combinatoric possibilities can easily explode into virtual
   unsolvability as the 2**N algorithm loses its race to the heat
   death of the universe.  Yes, if Perl did overall-longest or
   overall-shorted, this would produce a more predictable time;
   however, as we see with DFAs and POSIX NFAs, this prediction
   plays out as guaranteed *WORST-CASE* time.  It is not acceptable
   to make everyone pay the worst-case time.  Never penalize
   the whole world for the needs or desires or the few.

2) Consider the simple case, /A|B/.  In your overall longest/shortest,
   guaranteed worst-case time, both submatch A and submatch B must
   be calculated, and then the lengths of their matches both be compared.
   Perl, fortunately, does not do that.  Rather, the first one in that
   sequence wins.  That means that under the current scheme, the 
   patterns /A|B/ and /B|A/ have different semantics.  Under your 
   worst-case scheme, they do not.  Because /A|B/ and /B|A/ mean
   something different, more expressivity is provided.  This is the
   same scenario, albeit expressed slightly differently, as your
   situation.  The issues manifest in both are equivalent.

3) This leads to increased power.  It's like the difference between
   a short-circuiting "or" and one that blindly plods ahead trying
   to figure something out even when all is for naught.  Compare A&&B
   with A&B, for example.  If A is 0, then B need not be computed, 
   yet in the second version, one runs subexpression B nevertheless.
   If according to the rules of one particular system, patX and
   patY mean different things, whereas in a second system, they are
   completely interchangeable, then the first system can express
   nuances that the second one cannot.  When you have more nuances,
   more expressivity, then you have more power, because you can say
   things you could not otherwise say.  Why do C and its derivatives
   such as Perl have short-circuiting Boolean operators?  Because
   in older languages, such as Fortran and Pascal, where you did
   not have them, one quickly found that this was cumbersome and
   annoying.

4) It is more intuitive to the reader and the writer to minimize
   strange action at a distance.  It's more to remember; or, perhaps
   better phrased, more to forget.  That's why we don't like 
   variables set in one place magically affecting innocent code
   elsewhere.  Maybe it's more applicable here to say that that's
   why having mixed precedences and associativities confuses people.
   If in an expression like A->B->C->D, you had to know a prior when
   evaluating A that D was going to be coming up, it would require 
   greater look-ahead, more mental storage.  Even if a computer could
   do it, people would find it harder.  That's why we don't write

       &{&{$fnctbl{expr}}(arg1)}(arg2)

    when we can simply write

       $fnctbl{expr}->(arg1)->(arg2)

    It is not intuitive to people to have to do too much look-ahead, 
    or too much storage.  Having distance items interact with one
    another is confusing, and we've already got that situation with
    backreferences, as in /(\w+)(\w+)\s+\2(\w+)/, which depending on 
    how you start weighting those +'s into +?'s, can really move
    matters around.  Let's not exacerbate the counterintuitiveness.

5)  It is more Perlian because of the principle that things that look 
    different should actually *be* different.  /A|B/ and /B|A/ look
    quite different.  Thus, they should likewise *be* different.

>I didn't need the long-winded explanation, and I don't need help with
>understanding how that regexp matches what it does.  I understand it
>perfectly well already.  I'm no neophyte with regular expressions, even if
>Perl 5 does offer some regexp features I've never bothered to exploit...

All NFAs prior to POSIX behaved in the fashion that Perl's continue
to behave in.  I am surprised that over the long course of your 
experiences with regexes, that you never noticed this fundamental
principle before.

>My point is that the current behavior, while reasonable, isn't quite right.

You're wrong.  Don't call it "not right".  It's perfectly correct
and consistent.  It follows directly from historical behavior of
these things, and quite simply, it's in the rules.  It's preferable
for the many reasons I outlined for you above.

You just want something different.  That does not make this wrong.
If I want ice cream and you want hamburgers, this does not make me
wrong.  However, for you to tell me I really want hamburgers when

>design flaw in Perl 5's regular expressions.  I was hoping we could have a
>rational debate about it, rather than propagating the same design flaw into
>Perl 6 through inertia or for legacy reasons. 

It is not a design "flaw".  See above.

>> You should probably read the whole chapter.

>That was uncalled for!  Just because I disagree with a particular detail of
>the design doesn't mean that I didn't _understand_ it...

It was hardly uncalled for.  You did not prove that you understood
it.  You have not yet done so, as you did not cover any of the
ground which I did above.  Perhaps if I had said "Read the source
code", that might have been uncalled for.  But I did not say that,
and your retort was gratuitously combative.

You see, it just so happens that there's much more in that chapter,
including the referenced Little Engine section, that when read as
a whole, might well shed better understanding than mere extracts.
You should also check out _Mastering Regular Expressions_ says about
that.  I won't quote for you.  I'm tired of typing when you could
be reading.

What is completely uncalled for, though, is to call something a bug
just because you either don't understand it or because you don't
like it.  Lamentably, such behavior is remarkably frequent.  You
are hardly the first.

--tom

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About