develooper Front page | perl.perl6.language.regex | Postings from December 2000

Re: Perl 5's "non-greedy" matching can be TOO greedy!

From:
Deven T. Corzine
Date:
December 15, 2000 14:28
Subject:
Re: Perl 5's "non-greedy" matching can be TOO greedy!
Message ID:
Pine.LNX.4.10.10012151720520.31056-100000@escher.ties.org

On Fri, 15 Dec 2000, Tom Christiansen wrote:

> >At worst, this should take no more than double the amount of time that the
> >single pass did, probably less.  Hardly a cause to concern ourselves with
> >the heat death of the universe.
> 
> Oh really?  We have shown that for the kind of global overall
> analysis that you are asking for, that in the general case, all
> possible paths much be taken.  You cannot short-circuit, because
> you must first consider all possibilities and then weigh each valid
> result against each other valid result.
> 
> Consider something like /.*/ or /.*?/.  For a string a length N,
> there are
> 
>     (N+1) (N+2)
>     -----------
> 	 2
> 
> substrings that that matches.  That means that an 80-byte string
> has some 3321 possible substrings, all of which must be considered.
> 
> In the short-circuiting version, the Engine need consider but one
> single solitary case for each of those.  3321 is not the double of 1.
> 
> Consider now something like /(.*)(.*)/ or /(.*?)(.*?)/ or /(.*)(.*?)/
> or /(.*)(.*?)/.  You now have
> 
> 
> 		   2
>     ( (N+1) (N+2) )
>     ---------------
> 	   4
> 
> cases to consider, or, in the case of an 80-byte string, some
> 11,029,041 possible choices.  

Where does the combinatorial math have any relevance?  I'm not suggesting
checking every possibility.

> And with the current, normal, standard, short-circuiting system, 
> the Engine has to consider, hm... could it be just one possibility?
> And that's just with two wildcards.  People are often writing more
> than that.

You can still short-circuit.  I'm not suggesting examining any further into
the text to be searched than it already does.  For the second pass, you can
scan backwards for a rightmost match and short-circuit that, if you want.
(But that would leave some small chance of a shorter match in the middle.)

> Can you now see why this would be a problem?  And how even in the
> cases where it didn't actually break old programs (many of which it
> would!) that it would cause many many them to apparently hang, 
> racing for electron death?

It would be a problem if you actually had a combinatorial algorithm.

At worst, you're re-testing the regexp N times, where N is the length of
the matched string from the first pass, NOT the length of the original
string to be searched.

Oh, and where are these "many" old programs it would break?  I'd still love
to hear a real-world example that would actually break...

Deven




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About