Front page | perl.perl6.language.regex |
Postings from December 2000
Re: Perl 5's "non-greedy" matching can be TOO greedy!
From: Tom Christiansen
December 14, 2000 15:36
Re: Perl 5's "non-greedy" matching can be TOO greedy!
Message ID: 26316.976836998@chthon
>No question that's how it's been implemented. But WHY would anyone want
>such behavior? When is it beneficial?
It is beneficial because this is how it's always been, because it
is faster, because it is more expressive, because it is more powerful,
because it is more intuitive, and because it is more perlian.
0) All NFAs before POSIX acted this way. It is historically
consistent and perfectly expected.
1) It is obviously faster to come to an answer earlier on in the
execution than it would be to come to an answer later. It's
like an expression whose evaluation short-circuits. Also, when
the matching sematics permit back tracking and back references,
the combinatoric possibilities can easily explode into virtual
unsolvability as the 2**N algorithm loses its race to the heat
death of the universe. Yes, if Perl did overall-longest or
overall-shorted, this would produce a more predictable time;
however, as we see with DFAs and POSIX NFAs, this prediction
plays out as guaranteed *WORST-CASE* time. It is not acceptable
to make everyone pay the worst-case time. Never penalize
the whole world for the needs or desires or the few.
2) Consider the simple case, /A|B/. In your overall longest/shortest,
guaranteed worst-case time, both submatch A and submatch B must
be calculated, and then the lengths of their matches both be compared.
Perl, fortunately, does not do that. Rather, the first one in that
sequence wins. That means that under the current scheme, the
patterns /A|B/ and /B|A/ have different semantics. Under your
worst-case scheme, they do not. Because /A|B/ and /B|A/ mean
something different, more expressivity is provided. This is the
same scenario, albeit expressed slightly differently, as your
situation. The issues manifest in both are equivalent.
3) This leads to increased power. It's like the difference between
a short-circuiting "or" and one that blindly plods ahead trying
to figure something out even when all is for naught. Compare A&&B
with A&B, for example. If A is 0, then B need not be computed,
yet in the second version, one runs subexpression B nevertheless.
If according to the rules of one particular system, patX and
patY mean different things, whereas in a second system, they are
completely interchangeable, then the first system can express
nuances that the second one cannot. When you have more nuances,
more expressivity, then you have more power, because you can say
things you could not otherwise say. Why do C and its derivatives
such as Perl have short-circuiting Boolean operators? Because
in older languages, such as Fortran and Pascal, where you did
not have them, one quickly found that this was cumbersome and
4) It is more intuitive to the reader and the writer to minimize
strange action at a distance. It's more to remember; or, perhaps
better phrased, more to forget. That's why we don't like
variables set in one place magically affecting innocent code
elsewhere. Maybe it's more applicable here to say that that's
why having mixed precedences and associativities confuses people.
If in an expression like A->B->C->D, you had to know a prior when
evaluating A that D was going to be coming up, it would require
greater look-ahead, more mental storage. Even if a computer could
do it, people would find it harder. That's why we don't write
when we can simply write
It is not intuitive to people to have to do too much look-ahead,
or too much storage. Having distance items interact with one
another is confusing, and we've already got that situation with
backreferences, as in /(\w+)(\w+)\s+\2(\w+)/, which depending on
how you start weighting those +'s into +?'s, can really move
matters around. Let's not exacerbate the counterintuitiveness.
5) It is more Perlian because of the principle that things that look
different should actually *be* different. /A|B/ and /B|A/ look
quite different. Thus, they should likewise *be* different.
>I didn't need the long-winded explanation, and I don't need help with
>understanding how that regexp matches what it does. I understand it
>perfectly well already. I'm no neophyte with regular expressions, even if
>Perl 5 does offer some regexp features I've never bothered to exploit...
All NFAs prior to POSIX behaved in the fashion that Perl's continue
to behave in. I am surprised that over the long course of your
experiences with regexes, that you never noticed this fundamental
>My point is that the current behavior, while reasonable, isn't quite right.
You're wrong. Don't call it "not right". It's perfectly correct
and consistent. It follows directly from historical behavior of
these things, and quite simply, it's in the rules. It's preferable
for the many reasons I outlined for you above.
You just want something different. That does not make this wrong.
If I want ice cream and you want hamburgers, this does not make me
wrong. However, for you to tell me I really want hamburgers when
>design flaw in Perl 5's regular expressions. I was hoping we could have a
>rational debate about it, rather than propagating the same design flaw into
>Perl 6 through inertia or for legacy reasons.
It is not a design "flaw". See above.
>> You should probably read the whole chapter.
>That was uncalled for! Just because I disagree with a particular detail of
>the design doesn't mean that I didn't _understand_ it...
It was hardly uncalled for. You did not prove that you understood
it. You have not yet done so, as you did not cover any of the
ground which I did above. Perhaps if I had said "Read the source
code", that might have been uncalled for. But I did not say that,
and your retort was gratuitously combative.
You see, it just so happens that there's much more in that chapter,
including the referenced Little Engine section, that when read as
a whole, might well shed better understanding than mere extracts.
You should also check out _Mastering Regular Expressions_ says about
that. I won't quote for you. I'm tired of typing when you could
What is completely uncalled for, though, is to call something a bug
just because you either don't understand it or because you don't
like it. Lamentably, such behavior is remarkably frequent. You
are hardly the first.