develooper Front page | perl.perl6.language.regex | Postings from January 2001

Re: Exposing regexp engine & compiled regexp's

Thread Previous | Thread Next
Damian Conway
January 8, 2001 16:01
Re: Exposing regexp engine & compiled regexp's
Message ID:
Branden wrote:

   > I read your RFC 93. It mentions using a sub to read from the
   > string. I just think it uses the sub in two conflicting ways, one
   > for requesting more data from the stream and other for telling
   > there was a match.

It's really using the sub as a interface to whatever source of data it's
trying to match.
   > I thought, too, that requesting it to return
   > _exactly_ the number of characters that was requested goes against
   > most unix syscalls convention (like read...), where it's requested
   > to read at most that number of characters.

Err. The RFC never says the request is for exactly a certain number of
characters; just that the subroutine will be told how many characters
are *known* to be needed in order for the regex to continue matching.
The RFC specifically mentions the possibility of returning fewer than
requested characters.

   > What I think is that it could be handled by a OO module. Suppose
   > there's how to hook into the regexp engine guts, getting responses
   > as the ones you mentioned above. One could write a OO module, with
   > methods for reading more data, checking end of data, and
   > acknowledging a failed or succeeded match. Then, it could overload
   > the =~ operator, making the regexp engine call the module's methods
   > instead of its own's.
   > Then, what you proposed in RFC 93 through 
   >     sub { ... } =~ m/.../; 
   > could be handled by 
   >     my $mymatch = MyClassForMatchingFromFileHandles->new($myhandle); 
   >     $mymatch =~ m/.../; 

This is an interesting alternative. The main problem is that matching
against a blessed object already has a useful meaning in Perl: stringify
the object (calling its overloaded stringification operator if possible)
and match against the resulting string.

My other problem with this approach is that it's relatively heavy. Let's
take the example in the RFC and implement it both ways:

	# As the RFC proposes:

		sub from_STDIN {
			$_[1] ? $fh->pushback($_[0]) : $fh->getn($_[0])

		\&from_STDIN =~ /pat/;

	# As Branden proposes:

		package From_STDIN;

		sub new       { bless $_[1], $_[0] }

		sub MORE_DATA { $_[0]->getn($_[1]) }
		sub ON_FAIL   { $_[0]->pushback($_[1]) }

		use overload "=~" => 1;

		package main;

		From_STDIN->new($fh) =~ /pat/;

Hmmmm. Potentially more flexible, but also much more ponderous.

   > BTW, if you have a C++-based regexp engine with a clean design,
   > couldn't we use it as a base to a new regexp engine that supports
   > current (or new) perl's regexp syntax and features and has its guts
   > exposed?

It was very basic, and *very* slow. It was also DFA-based and hence
unable to implement full Perl regex semantics.

Furthermore, the regex engine is (and should be) one of the most heavily
optimized parts of Perl: probably not the place for clean, modular design. :-)

However, for what it's worth, I have no objection to making the code
available for everyone's amusement. Bear in mind that this was written
by a much early version of me (about 0.27), way back in the last
millenium, before C++ was standardized and before there was an STL.
Surprisingly, it still compiles and runs under g++ 2.8.1.

Grab it from:


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About