develooper Front page | perl.perl5.porters | Postings from May 2003

Regexp::Optimizer 0.01 released

From:
Dan Kogai
Date:
May 31, 2003 04:25
Subject:
Regexp::Optimizer 0.01 released
Message ID:
87FE3FF4-935A-11D7-9E89-000393AE4244@dan.co.jp
Porters,

perltodo says
> Factoring out common suffices/prefices in regexps (trie optimization)
>
>        Currently, the user has to optimize "foo|far" and "foo|goo" into
>        "f(?:oo|ar)" and "[fg]oo" by hand; this could be done 
> automatically.

I have written a module that attempts to do this as Regexp::Optimizer 
as follows

http://www.dan.co.jp/~dankogai/Regexp-Optimizer-0.01.tar.gz

Also uploaded to CPAN.

It comes with two modules.  Regexp::Optimizer is a frontend which 
actually takes regular expression and optimize it (even with nested 
group).  Which is implemented as a inherited class of Regex::List which 
offers list2re() that converts list of words into a trie-optimized 
regular expression.

Though Regexp::Optimizer is still rather rudimentary, Regex::List is 
pretty robust.  I will attach pod2text'ed Regexp::List right after my 
signature.

* Unlike Regex::Presuf, it applies prefix aggregation only to leaf 
nodes of the trie so it is much faster -- fast enough to turn every 
word in /usr/share/dict/words into an optimized regexp (2 minutes on my 
FreeBSD box).

* Another difference from Regex::Presuf is that the order is 
"respected".  Order is in fact preserved unless there is no "?" and 
character class aggregation.  I made it that way because I needed to 
use it for a list of regular expressions, not just words.  When it 
comes to group alterations, order does matter ( qr/fooo|fo/ is 
DIFFERENT from qr/fo|fooo/).

Comments very welcome.

Enjoy!

Dan the Perl5 Porter
====================

NAME
     Regexp::List - builds regular expressions out of a list of words

SYNOPSIS
       use Regexp::List;
       my $l  = Regexp::List->new;
       my $re = $l->list2re(qw/foobar fooxar foozap/);
       # $re is now qr/foo(?:[bx]ar|zap)/

ABSTRACT
     This module offers "list2re" method that turns a list of words into 
an
     optimized regular expression which matches all words therein.

     The optimized regular expression is much more efficient than a
     simple-minded '|'-concatenation thereof.

DESCRIPTION
     This module use Object-Oriented approach so you can use this module 
as a
     base and tweak its features. This module is a base class of
     Regexp::Optimizer.

   EXPORT
     Since this is an OO module there is no symbol exported.

METHODS
     This module offers methods below;

     $l = Regexp::List->new(*key=>value, ...*)
         Constructor. When arguments are fed in *key => value,* manner, 
it
         sets attributes. See "$l->set" for details

     $re = $l->list2re(*list of words ...*)
         Does the job. Takes a list of words and turn it into an optimal
         regular expresson. See "IMPLEMENTATION" to find out how it is
         achieved. If you want to know the underlying black magic even
         further, see the source.

     $l->set(*key=>value, ...*)
         Sets attributes. There are many attributes supported but let me
         mention just a few that you may be interested.

         lookahead
             Whether you prepend a lookahead assertion or not. Default 
value
             is 1. This module is smart enough to omit the assertion 
when you
             don't need one.

               $re = $l->list2re(qw/1 2 3 infinity/);
               # qr/(?=[123i])(?:[123]|infinity)/
               $re = $l->set(lookahead=>0)->list2re(qw/1 2 3 infinity/);
               # qr/(?:[123]|infinity)/

         quotemeta
             Whether you quote metacharacters or not. Default is 1. If 
you
             really need this feature try Regexp::Optimizer instead.

               @list = qw/3 3.14 3.14159265358979/;
               $re = $l->list2re(@list);
               # qr/3(?:\.14(?:159265358979)?)?)/
               $re = $l->set(lookahead=>0)->list2re(@list);
               # qr/3(?:.14(?:159265358979)?)?)/
               # which does match 3.14 but also "11+3=14"

         modifies
             Currently it accepts 'i', 'm', 's', and 'x', the same as 
regular
             expression modifiers.

               @list = qw/Perl perl BASIC basic/;
               $re = $l->list2re(@list);
               # qr/(?=[BPbp])(?:[Pp]erl|BASIC|basic)/
               $re = $l->set(modifiers => 'i')->list2re(@list);
               # qr/(?=[bp])(?:perl|basic)/i
               print $l->set(modifiers => 'x')->list2re(@list);
               # Try for yourself;  Isn't it cute ?

     $l->expand($re);
         Utility methods to expand a regular expression. Handy when you 
want
         to check the complex regexes.

     $l->unexpand($re);
         Utility methods to unexpand a regular expression.

IMPLEMENTATION
     This module optimizes the regular expression as follows. Let's see 
what
     happens when qw/foobar fooxar foozap fooza/ is fed

     trie building (common prefix aggregation)
         first the corresponding *trie* structure is built

                +- bar
           foo -+- xar
                +- za -+- p
                       +- ''

     common suffix aggregation
         it checks if any leaf node can be optimized for common suffix. 
In
         this case we can do so to "bar" and "xar".

                +- b -+-ar
           foo -+- x -+
                +- za -+- p
                       +- ''

     character class conversion
         If a branch contains more than two single characters, it turns 
it
         into a character class.

           foo -+- [bx] --- ar
                +- za -+-p
                       +- ''

     empty leaf to "?"
         Empty leaf is converted to a '?' quantifier

           foo -+- [bx] --- ar
                +- za -+-p?

     join all leafs into a group
         And the final result is reached.

           foo(?:[bx]ar|zap?)

BENCHMARKS
     This module is faily robust. You can practically use this module to 
find
     a regular expression that matches all words in a dictionary. Here 
is a
     result by on perl 5.8.0, FreeBSD 4-Stable, Pentium III 800 Mhz with 
512
     MB RAM.

      # Sat May 31 09:11:06 2003 (     0.000000 s) Reading 
/usr/share/dict/words
      # Sat May 31 09:11:07 2003 (     0.847797 s) 235881 lines read.
      # Sat May 31 09:11:07 2003 (     0.000000 s) Making regexp.
      # Sat May 31 09:13:09 2003 (   121.596928 s) Done.
      # Sat May 31 09:13:09 2003 (     0.000000 s) Saving to t/words.rx
      # Sat May 31 09:13:09 2003 (     0.000000 s) Reading t/words.rx
      # Sat May 31 09:13:13 2003 (     3.679176 s) Done.
      # Sat May 31 09:13:13 2003 (     0.000000 s) Opening 
/usr/share/dict/words for comparison.
      # Sat May 31 09:13:13 2003 (     0.255222 s) 
/usr/share/dict/words:235881 lines found.
      # Sat May 31 09:13:13 2003 (     0.000000 s) Showtime!
      # 235881/235881
      # Sat May 31 10:44:17 2003 (  5464.370409 s) Done.
      # Sat May 31 10:44:17 2003 (  5464.370624 s) 43.167 matches/s

     The result of optimization is obvious as the number of alteration
     increases. Here is a result of a benchmark which matches randomly 
picked
     words against "/usr/share/dict/words".

       ====        2 words
               Rate naive optim
       naive 1.79/s    --  -28%
       optim 2.49/s   39%    --

       ====      256 words
             s/iter naive optim
       naive   31.7    --  -81%
       optim   5.95  433%    --

SEE ALSO
     Regexp::Optimizer -- uses this module as its base

     "eg/" directory in this package contains example scripts.

     Perl standard documents
          L<perltodo>, L<perlre>

     CPAN Modules
         Regexp::Presuf, Text::Trie

     Books
         Mastering Regular Expressions
         <http://www.oreilly.com/catalog/regex2/>

AUTHOR
     Dan Kogai <dankogai@dan.co.jp>

COPYRIGHT AND LICENSE
     Copyright 2003 by Dan Kogai, All Rights Reserved.

     This library is free software; you can redistribute it and/or 
modify it
     under the same terms as Perl itself.

dankogai@dan-attic[4705]:~/work/Regexp-Optimizer> 




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About