develooper Front page | perl.perl5.porters | Postings from August 2002

Re: Ideas for 5.10

Thread Previous | Thread Next
From:
Nicholas Clark
Date:
August 16, 2002 14:49
Subject:
Re: Ideas for 5.10
Message ID:
20020816204733.GE460@Bagpuss.unfortu.net
On Fri, Aug 09, 2002 at 10:57:14PM -0400, Benjamin Goldberg wrote:
> Nicholas Clark wrote:

> > It's probably not unreasonable to limit $/ to non-greedy regexps,
> > because most of the time they're more likely to do what you want.
> > (fast).
> 
> It's a bit difficult to identify a regex as being greedy or non-greedy;
> especially, consider when parts of it are greedy, but are limited by
> something, eg: qr/<[^>]*>/ is not really greedy, even though it might
> look like it is.  Or: qr/BEGIN(?:(?!END).)*END/, which also isn't
> greedy.
> 
> It would be easier to simply tell users, "don't use greedy regexen." 
> Or, "If you *must* do that, stick a \u at the beginning."

I think we may be able to get away with working out which regexps can
be matched  correctly without having to change the regexp engine or read
the entire file into memory with what I hope is the majority of regexps.

[explanation paragraph, for those not following the thread closely]

I believe the only danger comes from regexps that can match both a long
string within the file and also match a shorter substring. Because they
are greedy, by the rules of perl regexp matching the longer match should
be found first and returned. The problem with $/ comes if only the first
part of the long string that should match is in the PerlIO buffer passed
to the regexp engine, so the match on the long string fails. Normally with
this $/ as regexp idea this wouldn't be a problem, as a failed match would
simply cause more file to be read into the buffer and the match re-attempted.
However, if the greedy regexp can also match the shorter string which *is*
fully in the buffer then the regexp engine will return that shorter match,
which isn't correct for this IO case.

I believe that most regexps that $/ is set to won't use any characters that
could match "\n", except possibly at the end of the regexp. The main class
of regexp that won't are negated character ranges such as your [^>]*> above.
(For these we may want a regexp flag saying "non-greedy - honest" as a hint)
For regexps such as /^__[A-Z]+__$/ that we know can't match a newline in the
middle, we simply look to see if there is a newline in the PerlIO buffer
between the file pointer and the end of buffer. If there is, we call the
regexp engine, saying that the end of where it can match to is the last
newline we know of in the PerlIO buffer.
If there isn't a newline in the PerlIO buffer (or the regexp engine fails to
find a match) we read more data from disk (or whatever) until we find a
newline. (I'm envisaging reading a disk block, then scanning for the last
newline, rather than some sort of fgets())

For regexps that we can't figure out whether they are greedy and could
backtrack we fall back to reading until EOF and presenting the lot to the
regexp engine. And a flag the knowledgeable could add to their regexp saying
"I promise this regexp can't backtrack in a greedy way" or whatever is most
useful to the PerlIO/regexp system would be useful.

Nicholas Clark
-- 
Even better than the real thing:	http://nms-cgi.sourceforge.net/

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About