develooper Front page | perl.perl5.porters | Postings from October 2000

RegexParser-0.02 available (fwd)

Jeff Pinyan
October 30, 2000 13:22
RegexParser-0.02 available (fwd)
Message ID:
---------- Forwarded message ----------
From: Jeff Pinyan <>
Subject: RegexParser-0.02 available, a Perl regular expression parser, is available for public
scrutiny.  It's NOT on CPAN yet -- I plan to wait until I can nail any
bugs down.

It will also be undergoing some root canal surgery, since I plan to
rethink the method in which regex_to_string() works.

The most recent version is 0.02.  It can be downloaded at the following

Following is the documentation.

    RegexParser - module for breaking apart simple Perl regular

      use RegexParser 'regex_to_string';
      my $filename = regex_to_string qr/\w{10}-\d{3}\.txt/;
      # something like "jk3429jds2-014.txt"

      use RegexParser 'reverse_regex';
      my $last_num = reverse_regex qr/(\d+)\D*$/;
      # (?-ismx:^\D*(\d+))

      use RegexParser 'reverse_match'
      $numbers = "123 456 678 012";
      ($match) = reverse_match $numbers => qr/(\d+)\D*$/;
      # 012
      @matches = reverse_match $numbers => qr/(\d+)\D*(\d+)\D*$/;
      # (678,012)

    This module can break a regular expression down into "nodes" for two
    major uses. The first is for creating a string that matches a regular
    expression. The second is for reversing the regular expression, so
    that you can match from the end of a string more efficiently.

    Because there are several areas where this module can be improved and
    changed (if the need for it arises), the heuristics by which strings
    are matched is not constant.

    This is in place of a "bugs" list. If there appears to be an error in
    one of the areas that is stated as supported, email the author
    (information below).

    * backreferences
        It can handle backreferences, and nested backreferences. Since the
        engine uses `\\[0-3][0-7][0-7]' to match octals and `\\[1-9]\d*'
        to match backreferences, that means it can technically handle up to
        99 backreferences without getting shaky. However, due to the use
        of the `?' and `*' quantifiers on backreferences, read the
        "conditionals" item below.

    * grouping parentheses
        The `(?:...)' grouping parentheses are supported. Modifiers
        related to this structure are listed next.

    * regex modifiers
        Of the four modifiers, 'i', 's', 'm', and 'x', the engine
        currently only "supports" the 'x' modifier. 'i' and 's' have no
        "need" to be supported, since any string that matches without the
        'i' modifier can match with it, and the engine will not match `\n'
        by `.' anyway. The 'm' modifier might be supported in the future.

        Modifiers in the form of `(?i)' are currently not supported, but
        probably will be very soon.

    * anchors
        Currently, the engine doesn't pay attention to the beginning-of-line
        or end-of-line anchors when forming a string, since they can be
        implied by the fact that there's nothing else in the string. They
        are supported (and are properly reversed, as best as can be) when
        reversing a regular expression. In addition, the `\b' and `\B'
        anchors (word boundaries) are not supported when forming a string.
        Support for these anchors at the beginning and end of a string
        might come along soon (and be achieved by prepending or appending
        the proper characters if needed); support for internal placement
        might come later.

    * escapes
        The engine supports octal, hexadecimal, and control-sequence

    * alternation
        The engine supports alternation.

    * look-ahead and look-behind
        These are not supported. One reason is because there is the
        possibility of creating an infinite loop (such as `/(?!foo)foo/'
        which can never match). Look-behind is not supported for similar
        reasons. If these ever were to be supported, this engine could
        technically allow variable-width look-behinds, by employing regular
        expression reversal (this could get into an ugly loop).

    * cut
        The cut expression is not supported. It too can make patterns that
        can never match (see 'perlre' for an example).

    * interpolation
        The engine only handles regular expression elements, not things
        that should have been interpolated beforehand. Sending it
        `$foo|$bar' will thus match either `foo' or `bar', since it won't
        recognize the variables, and it does not enforce any context
        around the `$' anchor. If it did, this case would also bring up a
        case that could never match.

    * evaluation
        The `(?{CODE})' and `(??{LATER})' expressions are not supported,
        for rather obvious reasons -- they're difficult to parse and might
        be dangerous to allow.

    * conditionals
        The `(?(COND)...|...)' expression is currently not supported, but
        it might have to be in the near future, due to the nature of some
        regular expressions when they become reversed:

          /(\w)\d\1*/  ==>  /(?:(\w)\1*)?\d(?(1)\1|\w)/

        The explanation for that cruft is that the regular expression
        matches strings like "a9", "a9a", "a9aa", and so on. Upon reversal,
        it matches strings like "9a", "a9a", "aa9a", and so on. The
        beginning sequence is optional. This can cause problems. The
        reversed regular expression optionally matches the ending part. If
        it could match it, it matches the backreference. If it could not
        match it, then it matches some arbitrary `\w'.

    * inline comments
        These (`(?#...)') are supported. Remember that they end at the
        first `)'.

    * quantifiers
        Quantifiers of the form `*', `*?', `+', `+?', `?', and `??' match
        once when forming a string. Those of the form `{m,n}' match a
        random value of times, between *m* and *n*. If there is no *n*, it
        will match *m* times. This may change in future versions, because
        there may be the need to match differently in different cases (an
        example is `/\w*(^\d+)/' which matches when the `\w*' node matches
        0 times).

    * character classes
        These are supported. There is currently no error checking as far
        as ranges are considered. Negated classes are also supported.

    In event of a bug in the code, email the author at
    Please use an intelligible subject, such as "RegexParser vX.XX bug:
    'blah'". Give as much output as possible. For debugging output, set
    the $RegexParser::DEBUG variable to a true value.

    * add anchor support (at least `BOL' and `EOL')
    * modify `regex_to_string()' matching heuristics


  0.02 -- Rel. Oct 30, 2000

    Fixed a bug in the `(?:...)' support.
    Added ability to return backreferences in `regex_to_string()'.
    Added `reverse_match()' function.
    Added regex comment support via the `/x' modifier and `(?#...)'.

  0.01 -- Rel. Oct 27, 2000

    Original release.

SEE ALSO, which is standard and shows debugging output about regexes. And
    it wouldn't hurt to look at the regex man page (perlre).

    Copyright (C) 2000, Jeff `japhy' Pinyan. All rights reserved.

Jeff "japhy" Pinyan
PerlMonth - An Online Perl Magazine  
The Perl Archive - Articles, Forums, etc.
CPAN - #1 Perl Resource  (my id:  PINYAN) Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About