develooper Front page | perl.perl5.porters | Postings from September 2003

Re: optimizing /\s*,\s*/

Thread Previous | Thread Next
From:
david nicol
Date:
September 8, 2003 17:54
Subject:
Re: optimizing /\s*,\s*/
Message ID:
1063068819.1375.61.camel@plaza.davidnicol.com

I know the regex engine only from outside it.  But it seems
to me that this optimization would be a good thing and I am surprised
that we did not already have it.  The general case of the optimization
is use of the star -- zero-or-more -- counting modifier.  

In case of zero-or-more, we want to skip the current test and look ahead
to the things we are supposed to have at least one of, to guard against
failure.

In cases where the starred class includes the required piece, we do
not want to apply the optimization because it would break greediness.

	# LOOKAHEAD BREAKS GREEDINESS
	/(\w*)ou(\w*)/; # match words with "ou" in them

	# NON-GREEDY LOOKAHEAD DOES NOT BREAKS GREEDINESS
	/(\w*?)ou(\w*)/; # match words with "ou" in them 




So here's what I interpret the wish as being for:

 When the regex engine sees a star counting-modifier, it should
 check to see if the star is followed by something that we are going to
 have to see once.  When it is, the first thing the finished regex
 should do is to look for the required piece -- the EXACT from
 the debug expansion -- and only then, back up and see if what we just
 skipped to get to the EXACT matches the star or not.

This sequence rewriting would affect all regexen where either (1)
something starred is followed by something exact and the exact
thing is not matched by the starred thing or (2) something starred
and non-greedied is followed by something exact.

In such a case, we start by looking for the exact thing, and if we
find it, then we go backwards and check what we skipped to get there.

which would mean that
               '  b   ,  ' =~ /\s*,\s*/;
               #123456789

would start by looking for the comma, taking it to position 7,
and then start backtracking to see how many \s can be found,
and would then know that the match begins at position 4.



What is being proposed is something heavier than what regex
dilletantes expect from the engine, but lighter than what the
engine apparently does now.  What I expected from the
engine, which it apparently does not do, is to do in-order-matching
but remember the end points of star-sequences and skip them as
whole units.  I expected the engine to match positions one and
two against the first whitepsace gobbler, but then on finding a non-comma,
proceed directly to position 4 instead of trying again from position 2.

That might be a simple patch, as it would entail a simple change in
where the current match-begin pointer goes, in a simple case of cleaning
up after a failed star-match.  The problem with this approach is that
it would break multiple inconsistent star-matches? I am not sure
and I am really thinking aloud (or on-list) here.  

Redefining the wishlist entry as a bug, and the bug being that
"the regex engine retries matches from within stretches that have
matched in their continuity a repeating sub-expression instead of
skipping directly to the end of the matched portion and retrying there"
does that make sense?  Is that a fair paraphrase?

Can anyone come up with some data and a regex that would make retrying
at the end of star-expressions give an incorrect result?







Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About