Front page | perl.perl6.language.regex |
Postings from August 2000
RFC 144 (v2) Behavior of empty regex should be simple
From:
Perl6 RFC Librarian
Date:
August 27, 2000 12:26
Subject:
RFC 144 (v2) Behavior of empty regex should be simple
Message ID:
20000827192621.29442.qmail@tmtowtdi.perl.org
This and other RFCs are available on the web at
http://dev.perl.org/rfc/
=head1 TITLE
Behavior of empty regex should be simple
=head1 VERSION
Maintainer: Mark Dominus <mjd@plover.com>
Date: 24 August 2000
Last Modified: 27 August 2000
Version: 2
Mailing List: perl6-language-regex@perl.org
Number: 144
=head1 ABSTRACT
=head2 Standard Documentation
According to L<perlop>:
=over 4
=item m/PATTERN/cgimosx
=item /PATTERN/cgimosx
If the PATTERN evaluates to the empty string, the last successfully
matched regular expression is used instead.
=back
This behavior should be changed. If the PATTERN is empty, Perl should
look for the empty string. (That is, if the PATTERN is empty, it
should always match.)
=head1 DESCRIPTION
Literal empty patterns, such as:
$s =~ // ;
are not the problem here. The real problem is that the special case
is invoked for interpolated patterns also. For example,
chomp($pat = <STDIN>);
$s =~ /\Q$pat\E/;
looks to see if $pat is a substring of $s, unless $pat is empty, in
which case it matches $s against the last regex that was matched
successfully. That regex might be far away, in some other module.
If the far-away regex happened to contain backreference groups, the
backreference variables will be set accordingly.
To make this safe in Perl 5, the programmer has to write something
peculiar like
$s =~ /(?=)\Q$pat\E/;
to ensure that the regex, after interpolation, is never empty.
I propose that this 'last successful match' behavior be discarded
entirely, and that an empty pattern always match the empty string.
=head1 RATIONALE
=head2 The Feature Was Not Useful, I
The special behavior for empty patterns has never been particularly
useful. For example, you could imagine code like this:
for $pat (@patterns) {
if ($a =~ /$pat/ && $b =~ //) {
# do something
}
}
This would be more efficient than the equivalent
for $pat (@patterns) {
if ($a =~ /$pat/ && $b =~ /$pat/) {
# do something
}
}
because $pat would be compiled only once per loop instead of twice.
It is now more straightforward and efficient to do this sort of thing
explicitly with the qr// operator:
@patterns = map qr/$_/, @patterns;
for $pat (@patterns) {
if ($a =~ /$pat/ && $b =~ /$pat/) {
# do something
}
}
=head2 The Feature Was Not Useful, II
People sometimes propose the following use for the empty pattern
special case: They have a pattern, and many strings, and they want to
see if every string matches the pattern. This code works, but is
inefficient:
sub match_all {
my $pat = shift;
for (@_) {
return 0 unless /$pat/;
}
return 1;
}
This is because C</$pat/> must be recompiled for each string, or checked
to see whether recompilation is necessary.
This code does not work:
sub match_all {
my $pat = shift;
for (@_) {
return 0 unless /$pat/o;
}
return 1;
}
because C<$pat> changes with each call.
One solution is to use 'eval' here to generate the pattern matching
code (with C</o>) at run time.
People have sometimes tried to use C<//> here, but usually without
success. The idea is:
sub match_all {
my $pat = shift;
# load $pat into 'last successfully matched' space
for (@_) {
return 0 unless //;
}
return 1;
}
The problem here is that there is no way to designate $pat as the last
successfully matched regex without actually finding a string that
matches it. In the past people attempting this strategy have appeared
in C<comp.lang.perl.misc> asking how to find a string that matches a
given regex. As far as I know, no useful solutions have been offered.
(In fact, there may not be any such string. Consider the pattern
C</a\bx/> for example.)
A better, simpler solution to this problem is to use the C<qr> operator:
sub match_all {
my $pat = shift;
$pat = qr($pat);
for (@_) {
return 0 unless /$pat/;
}
return 1;
}
=head2 This feature has resulted in bugs
Any code that contains the innocent-looking
if (/\Q$string\E/) {
...
}
is potentially booby-trapped. Such code is common. An example of
this type appears in L<perlfaq6>.
=head1 Alternatives
Rather than eliminating the special case entirely, alternative changes
are sometimes proposed.
=head2 Empty pattern to mean 'last match' instead of 'last successful match'
This behavior would be more useful than the current behavior and is
sometimes proposed as an alternative.
For example, the application discussed in the section 'The feature was
not useful, II' above would be feasible if the empty pattern matched
the last-matched pattern, because it would no longer be necessary to
manufacture a matching string. But it is simpler and more
straightforward to solve the problem with C<qr//>.
Moreover, this alternative behavior retains all the drawbacks of the
current behavior: It yields subtle and intermittent bugs and
introduces strange action-at-a-distance effects.
=head2 Retain special behavior for literal empty pattern?
In the past it has been proposed that the special behavior be
discarded for patterns with interpolated variables, but retained for
empty patterns that appear literally in the source. C</\Q$string\E/>
would look for $string, regardless of whether $string was empty,
becayse C</\Q$string\E/> is not a I<literal> empty pattern. But
$s =~ // ;
would still retain the special behavior and match $s against whatever
was the last pattern to be successfully matched.
Since the feature is of marginal usefulness (especially now that
C<qr//> is available) it should be eliminated anyway to reduce code
and documentation bloat. However, see C<Translation Issues> below.
=head1 IMPLEMENTATION
No special implementation is necessary.
=head1 TRANSLATION ISSUES
Old Perl 5 scripts that may depend on this feature may be hard to
translate unless the feature is implemented in Perl 6 also.
If the literal empty pattern were to retain the special behavior, then
Perl 5 code like this:
if (/$FOO/) {
...
}
could be translated to Perl 6 code like this:
if (do { my $_tmp = $FOO; ($_tmp eq '') ? // : /$_tmp/ }) {
...
}
However, it's not clear that such translation is actually desirable.
=head1 REFERENCES
L<perlop> discussion of empty patterns, quoted above.
-
RFC 144 (v2) Behavior of empty regex should be simple
by Perl6 RFC Librarian