Front page | perl.perl5.porters |
Postings from September 2000
sexeger
Thread Next
From:
Jeff Pinyan
Date:
September 14, 2000 23:20
Subject:
sexeger
Message ID:
Pine.GSO.4.21.0009150054430.12100-100000@crusoe.crusoe.net
No, this isn't a cleverly disguised porn spam, but rather a mini essay on
reversed regular expressions.
I heard on the P5P list (I'm quite sure) something about a regex modifier
/r, that would start matching at the END of a string. Now, I'm not about
to say I know how to implement this on the source code side, but it would
be a matter of reversing the regex nodes (and reversing some of their
meanings). in the simplest cases. In more complex cases, as shown next,
entire sections can be optimized out:
#!/usr/bin/perl
use Benchmark 'timethese';
$rsmall = reverse ($small = 'aaaa' x 100);
$rbig = reverse ($big = 'aaaa' x 1_000);
timethese(-3, {
'EOS small' => q{ $small =~ /aaaa(?!.*aaaa)/ },
'EOS big' => q{ $big =~ /aaaa(?!.*aaaa)/ },
'BOS small' => q{ (reverse $small) =~ /aaaa/ },
'BOS big' => q{ (reverse $big) =~ /aaaa/ },
'cached BOS small' => q{ $rsmall =~ /aaaa/ },
'cached BOS big' => q{ $rbig =~ /aaaa/ },
});
__END__
Benchmark: running
BOS big, BOS small, EOS big, EOS small,
cached BOS big, cached BOS small,
each for at least 3 CPU seconds...
BOS big: 3 wallclock secs
( 3.00 usr + 0.00 sys = 3.00 CPU) @ 1095.67/s (n=3287)
BOS small: 4 wallclock secs
( 3.02 usr + 0.00 sys = 3.02 CPU) @ 12560.93/s (n=37934)
EOS big: 4 wallclock secs
( 4.69 usr + 0.00 sys = 4.69 CPU) @ 1.28/s (n=6)
EOS small: 4 wallclock secs
( 3.01 usr + 0.00 sys = 3.01 CPU) @ 89.70/s (n=270)
cached BOS big: 3 wallclock secs
( 3.05 usr + 0.01 sys = 3.06 CPU) @ 194393.79/s (n=594845)
cached BOS small: 5 wallclock secs
( 3.04 usr + 0.00 sys = 3.04 CPU) @ 166147.04/s (n=505087)
In this simple example, we're matching the LAST occurrence of 'aaaa' in a
string. In the standard approach, we'd use something like:
/aaaa(?!.*aaaa)/
This could be optimized to
/aaaa(?:[^a]+a{1,3})*$/
making the results:
Benchmark: running
EOS big, EOS small,
each for at least 3 CPU seconds...
EOS big: 4 wallclock secs
( 3.50 usr + -0.01 sys = 3.49 CPU) @ 20.34/s (n=71)
EOS small: 3 wallclock secs
( 3.29 usr + 0.00 sys = 3.29 CPU) @ 207.90/s (n=684)
But in the case of matching a string from the end, we can simply match
/aaaa/
because the LAST occurrence of 'aaaa' in a string is the FIRST occurrence
in the reverse of the string. This is ok for string literals; and if we
use rindex() instead, we get good speeds, but not as fast the cached
beginning-of-string method:
rindex big: 3 wallclock secs
( 2.99 usr + 0.01 sys = 3.00 CPU) @ 124290.00/s (n=372870)
rindex small: 3 wallclock secs
( 3.67 usr + 0.00 sys = 3.67 CPU) @ 127551.23/s (n=468113)
So what do you do for complex regular expressions? Let's take the simple
grammar:
$comment = qr{ ![^!]*! }x;
$string = qr{ <[^\\<>](?:\\.[^\\<>]*)*> }x;
$other = qr{ [^<>!]+ }x;
A language written as such could be matched by the following regex:
$CODE =~ /\A(?: $comment | $string | $other )+\z/xo;
To match the LAST comment in the code, we'd have to do
$CODE =~ /($comment)(?: | $string | $other )*\z/xo;
Here's a sample block of code:
========================================================================
! this is a sample program ! int x = 10;
! what a silly grammar ! str y = <cool " beans>;
if (len(y) GREATER_THAN x) { ! can't use < and >... heehee !
! empty comment coming up !
print y, < is longer than >, x, < characters>;
!!
! and more stuff to come soon !
chop(y,x);
print <I sliced 'y' down to >, x, < characters for you>;
}
========================================================================
So the last comment is "! and more stuff to come soon !". (I tried this
code, and it was taking a very long time to match. Is there a fallacy in
my regex somewhere?)
Let's use the reversal tactic.
$REVcomment = qr{ ![^!]*! }x;
$REVstring = qr{ > (?: [^<>\\]* .\\ )* [^<>\\]* < }x;
$REVother = qr{ [^<>!]* }x;
reverse $CODE =~ /\A (?: $REVstring | $REVother )* ($REVcomment) /xo;
# you have to reverse $1, of course
When I ran my code through a Benchmark...
Benchmark: running
back+cache, backwards,
each for at least 5 CPU seconds...
back+cache: 15 wallclock secs
( 5.36 usr + 0.00 sys = 5.36 CPU) @ 3183.77/s (n=17065)
backwards: 10 wallclock secs
( 5.02 usr + 0.00 sys = 5.02 CPU) @ 2487.25/s (n=12486)
As you can see, the caching has a slight speed increase (of course this is
expected -- it takes time to reverse a string). As I said, the forward
approach was taking FAR too long for me to sit and wait for an answer.
SO.
Where is this headed? I'm not 100% sure. It's not a totally simple task
to reverse the nodes of a regular expression. Especially if you run into
the trap that (?=...) can match variable-width, while (?<=...) cannot:
/(?=A(?:_\d)+Z)(...)/
reversed would be
/(?<=Z(?:\d_)+)(\d_A)/
which isn't allowed currently. The regex would have to be crafted in a
different fashion (which might be incompatible with complex regexes):
/Z(?:\d_)+(\d_A)/
A problem arises.
So is regex reversal possible? Not for all cases. But it's definitely
something to look into, especially if look-behind can be modified to allow
variable-width patterns.
--
Jeff "japhy" Pinyan japhy@pobox.com http://www.pobox.com/~japhy/
PerlMonth - An Online Perl Magazine http://www.perlmonth.com/
The Perl Archive - Articles, Forums, etc. http://www.perlarchive.com/
CPAN - #1 Perl Resource (my id: PINYAN) http://search.cpan.org/
Thread Next