Front page | perl.perl5.porters |
Postings from October 2000
RegexParser-0.02 available (fwd)
From:
Jeff Pinyan
Date:
October 30, 2000 13:22
Subject:
RegexParser-0.02 available (fwd)
Message ID:
Pine.GSO.4.21.0010301622070.25313-100000@crusoe.crusoe.net
---------- Forwarded message ----------
From: Jeff Pinyan <jeffp@crusoe.net>
Subject: RegexParser-0.02 available
RegexParser.pm, a Perl regular expression parser, is available for public
scrutiny. It's NOT on CPAN yet -- I plan to wait until I can nail any
bugs down.
It will also be undergoing some root canal surgery, since I plan to
rethink the method in which regex_to_string() works.
The most recent version is 0.02. It can be downloaded at the following
URL:
http://www.pobox.com/~japhy/regexes/RegexParser-stable.tar.gz
Following is the documentation.
NAME
RegexParser - module for breaking apart simple Perl regular
expressions
SYNOPSIS
use RegexParser 'regex_to_string';
my $filename = regex_to_string qr/\w{10}-\d{3}\.txt/;
# something like "jk3429jds2-014.txt"
use RegexParser 'reverse_regex';
my $last_num = reverse_regex qr/(\d+)\D*$/;
# (?-ismx:^\D*(\d+))
use RegexParser 'reverse_match'
$numbers = "123 456 678 012";
($match) = reverse_match $numbers => qr/(\d+)\D*$/;
# 012
@matches = reverse_match $numbers => qr/(\d+)\D*(\d+)\D*$/;
# (678,012)
DESCRIPTION
This module can break a regular expression down into "nodes" for two
major uses. The first is for creating a string that matches a regular
expression. The second is for reversing the regular expression, so
that you can match from the end of a string more efficiently.
Because there are several areas where this module can be improved and
changed (if the need for it arises), the heuristics by which strings
are matched is not constant.
FUNCTIONALITY
This is in place of a "bugs" list. If there appears to be an error in
one of the areas that is stated as supported, email the author
(information below).
* backreferences
It can handle backreferences, and nested backreferences. Since the
engine uses `\\[0-3][0-7][0-7]' to match octals and `\\[1-9]\d*'
to match backreferences, that means it can technically handle up to
99 backreferences without getting shaky. However, due to the use
of the `?' and `*' quantifiers on backreferences, read the
"conditionals" item below.
* grouping parentheses
The `(?:...)' grouping parentheses are supported. Modifiers
related to this structure are listed next.
* regex modifiers
Of the four modifiers, 'i', 's', 'm', and 'x', the engine
currently only "supports" the 'x' modifier. 'i' and 's' have no
"need" to be supported, since any string that matches without the
'i' modifier can match with it, and the engine will not match `\n'
by `.' anyway. The 'm' modifier might be supported in the future.
Modifiers in the form of `(?i)' are currently not supported, but
probably will be very soon.
* anchors
Currently, the engine doesn't pay attention to the beginning-of-line
or end-of-line anchors when forming a string, since they can be
implied by the fact that there's nothing else in the string. They
are supported (and are properly reversed, as best as can be) when
reversing a regular expression. In addition, the `\b' and `\B'
anchors (word boundaries) are not supported when forming a string.
Support for these anchors at the beginning and end of a string
might come along soon (and be achieved by prepending or appending
the proper characters if needed); support for internal placement
might come later.
* escapes
The engine supports octal, hexadecimal, and control-sequence
escapes.
* alternation
The engine supports alternation.
* look-ahead and look-behind
These are not supported. One reason is because there is the
possibility of creating an infinite loop (such as `/(?!foo)foo/'
which can never match). Look-behind is not supported for similar
reasons. If these ever were to be supported, this engine could
technically allow variable-width look-behinds, by employing regular
expression reversal (this could get into an ugly loop).
* cut
The cut expression is not supported. It too can make patterns that
can never match (see 'perlre' for an example).
* interpolation
The engine only handles regular expression elements, not things
that should have been interpolated beforehand. Sending it
`$foo|$bar' will thus match either `foo' or `bar', since it won't
recognize the variables, and it does not enforce any context
around the `$' anchor. If it did, this case would also bring up a
case that could never match.
* evaluation
The `(?{CODE})' and `(??{LATER})' expressions are not supported,
for rather obvious reasons -- they're difficult to parse and might
be dangerous to allow.
* conditionals
The `(?(COND)...|...)' expression is currently not supported, but
it might have to be in the near future, due to the nature of some
regular expressions when they become reversed:
/(\w)\d\1*/ ==> /(?:(\w)\1*)?\d(?(1)\1|\w)/
The explanation for that cruft is that the regular expression
matches strings like "a9", "a9a", "a9aa", and so on. Upon reversal,
it matches strings like "9a", "a9a", "aa9a", and so on. The
beginning sequence is optional. This can cause problems. The
reversed regular expression optionally matches the ending part. If
it could match it, it matches the backreference. If it could not
match it, then it matches some arbitrary `\w'.
* inline comments
These (`(?#...)') are supported. Remember that they end at the
first `)'.
* quantifiers
Quantifiers of the form `*', `*?', `+', `+?', `?', and `??' match
once when forming a string. Those of the form `{m,n}' match a
random value of times, between *m* and *n*. If there is no *n*, it
will match *m* times. This may change in future versions, because
there may be the need to match differently in different cases (an
example is `/\w*(^\d+)/' which matches when the `\w*' node matches
0 times).
* character classes
These are supported. There is currently no error checking as far
as ranges are considered. Negated classes are also supported.
In event of a bug in the code, email the author at japhy@pobox.com.
Please use an intelligible subject, such as "RegexParser vX.XX bug:
'blah'". Give as much output as possible. For debugging output, set
the $RegexParser::DEBUG variable to a true value.
TO DO LIST
* add anchor support (at least `BOL' and `EOL')
* modify `regex_to_string()' matching heuristics
HISTORY
0.02 -- Rel. Oct 30, 2000
Fixed a bug in the `(?:...)' support.
Added ability to return backreferences in `regex_to_string()'.
Added `reverse_match()' function.
Added regex comment support via the `/x' modifier and `(?#...)'.
0.01 -- Rel. Oct 27, 2000
Original release.
SEE ALSO
re.pm, which is standard and shows debugging output about regexes. And
it wouldn't hurt to look at the regex man page (perlre).
AUTHOR
Copyright (C) 2000, Jeff `japhy' Pinyan. All rights reserved.
--
Jeff "japhy" Pinyan japhy@pobox.com http://www.pobox.com/~japhy/
PerlMonth - An Online Perl Magazine http://www.perlmonth.com/
The Perl Archive - Articles, Forums, etc. http://www.perlarchive.com/
CPAN - #1 Perl Resource (my id: PINYAN) http://search.cpan.org/
-
RegexParser-0.02 available (fwd)
by Jeff Pinyan