develooper Front page | perl.perl5.porters | Postings from July 2001

regexes: more end-of-string optimizations

Thread Next
Jeff 'japhy/Marillion' Pinyan
July 29, 2001 00:01
regexes: more end-of-string optimizations
Message ID:
In my crusade to beef the optimizations of the regex engine, I realized
where the "match from the back" idea is particularly useful:

  $_ = "a b c d e f g h i";

This is currently takes far too long to fail.  If it is written as


it fails immediately.  However, it also doesn't succeed like we had wanted
it to originally.

This leads me to my optimization proposal:  unless (??{ ... }) interferes,
any regex with a $ or \Z or \z end-of-string anchor should be able to fail
immediately upon inspection of the end of the string.

  end of regex		what Perl knows is at the end of the string
  ------------		-------------------------------------------
  [a-z]$		[a-z]\n?
  [a-z]+$		[a-z]\n?
  [0-9][a-z]*$		[0-9a-z]\n?
  ([a-z])_\1$           [a-z]\n?

Now, we can extrapolate this:


Perl knows a matching string must have a string of one or more lowercase
vowels at the end of the string, and preceding it must be a string of one
or more lowercase letters.

BUT!  Those two sets are not mutually exclusive.  If the engine just does
the match in reverse (due to it matching as it looks for failures) then
the [aeiou]+ might match too much.

  "faeiou" =~ /([a-z]+)([aeiou]+)$/;
  # $1 = 'faeio'
  # $2 = 'u'

If this were matched as the optimization executed, then we'd get $1 being
'f' and $2 being 'aeiou'.  This MIGHT NOT BE PREFERRED.  Perhaps, if this
were preferred, the user would have to supply the /r flag to the regex.

Let's assume this is not preferred, so it makes it hard on us (me).  Then
we need to do more than just ensure that the overlapped characters are
matched in the proper capturing set -- we need to ensure that the most
amount possible are.

Sure, we could turn the quantifier on [aeiou] to the non-greedy +? but
that is not scalable to large strings, nor to the problem in general:

  "rstaemniou" =~ /([b-df-hj-np-tv-z]+)([a-z]+)([aeiou]+)$/;
  # $1 = 'rst'
  # $2 = 'aemnio'
  # $3 = 'u'

The non-fix approach of reversing that sets and $1 to 'r', $2 to 'staemn',
and $3 to 'iou'.  Not good.  Now, if do the reversing and put the
non-greedy quantifiers on the last two +'s, we get bad values.  $1 ends up
as 'mn', $2 is 'io', and $3 is 'u'.  Now, if THIS is the preferred
behavior, ok.  But let's assume it's NOT.  Let's assume we want the SAME
matches, just to match them as we're optimizing this end-of-the-string

Argh.  It seems ugly, no?  It's not so ugly if we allow modification of
the regex as we perform this optimization.  (But maybe that in itself is
ugly... ;) ).

Let's reverse the string and regex to make this simpler to see.

  u  oinmea tsr
  $1   $2   $3


As it stands, we don't get the values shown above.  So how do we get
them?  We could employ some weird (I admit) logic.  When going from
subset to superset, nongreediness should work; when going from superset to
subset, a look-behind will suffice:


But that means lots of backtracking.  Blegh.  So we unroll the loop:


Nifty, eh?  And it can be automated (or at least, I can automate it).

I'm sorry if I'm boring the shit out of some of you with my continuous
regex laments/etc., but I'm getting deeper into this stuff, and (sure as
I've been told) the brain-sucking is starting to happen.

Thanks for your time.

Jeff "japhy" Pinyan
RPI Acacia brother #734
** Look for "Regular Expressions in Perl" published by Manning, in 2002 **

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About