Front page | perl.perl5.porters |
Postings from August 1999
[PATCH 5.005_58] REx documentation
Thread Next
From:
Ilya Zakharevich
Date:
August 27, 1999 16:04
Subject:
[PATCH 5.005_58] REx documentation
Message ID:
19990827190218.A19561@monk.mps.ohio-state.edu
--- ./pod/perlre.pod~~ Mon Aug 2 16:20:36 1999
+++ ./pod/perlre.pod Fri Aug 27 15:53:30 1999
@@ -289,7 +289,8 @@ Perl defines the following zero-width as
\A Match only at beginning of string
\Z Match only at end of string, or before newline at the end
\z Match only at end of string
- \G Match only where previous m//g left off (works only with /g)
+ \G Match only at pos(), say, at the end-of-match
+ of the previous m//g
A word boundary (C<\b>) is a spot between two characters
that has a C<\w> on one side of it and a C<\W> on the other side
@@ -383,7 +384,13 @@ Today it is more common to use the quote
metaquoting escape sequence to disable all metacharacters' special
meanings like this:
- /$unquoted\Q$quoted\E$unquoted/
+ /$unquoted\Q$quoted\E$unquoted/;
+
+Beware that if you put I<literal> backslashes (those not inside
+interpolated variables) between C<\Q> and C<\E>, double-quotish
+backslash interpolation may lead to confusing results. If you
+I<need> to use literal backslashes in scope of C<\Q>,
+consult L<perlop/"Gory details of parsing quoted constructs">.
=head2 Extended Patterns
@@ -394,7 +401,7 @@ the parentheses. The character after th
the extension.
The stability of these extensions varies widely. Some have been
-part of the core language for many years. Others are experimental
+part of the core language for many years. Others are still experimental
and may change without warning or be completely removed. Check
the documentation on an individual feature to verify its current
status.
@@ -502,8 +509,8 @@ only for fixed-width look-behind.
=item C<(?{ code })>
-B<WARNING>: This extended regular expression feature is considered
-highly experimental, and may be changed or deleted without notice.
+B<WARNING>: This extended regular expression feature is still
+experimental.
This zero-width assertion evaluate any embedded Perl code. It
always succeeds, and its C<code> is not interpolated. Currently,
@@ -564,8 +571,9 @@ module. See L<perlsec> for details abou
=item C<(?p{ code })>
-B<WARNING>: This extended regular expression feature is considered
-highly experimental, and may be changed or deleted without notice.
+B<WARNING>: This extended regular expression feature is still
+highly experimental. While the semantic is pretty much settled down,
+a simplified version of the syntax should be designed.
This is a "postponed" regular subexpression. The C<code> is evaluated
at run time, at the moment this subexpression may match. The result
@@ -589,14 +597,13 @@ The following pattern matches a parenthe
=item C<(?E<gt>pattern)>
-B<WARNING>: This extended regular expression feature is considered
-highly experimental, and may be changed or deleted without notice.
-
An "independent" subexpression, one which matches the substring
that a I<standalone> C<pattern> would match if anchored at the given
-position--but it matches no more than this substring. This
+position, and it matches I<nothing else> than this substring. This
construct is useful for optimizations of what would otherwise be
-"eternal" matches, because it will not backtrack (see L<"Backtracking">).
+"eternal" matches, because it will not backtrack (see L<"Backtracking">),
+as well as for many places where "grab all you can, and do not give
+anything back" semantic is desirable.
For example: C<^(?E<gt>a*)ab> will never match, since C<(?E<gt>a*)>
(anchored at the beginning of string, as above) will match I<all>
@@ -619,7 +626,7 @@ Consider this pattern:
m{ \(
(
- [^()]+
+ [^()]+ # x+ inside (group)+
|
\( [^()]* \)
)+
@@ -639,7 +646,7 @@ hung. However, a tiny change to this pa
m{ \(
(
- (?> [^()]+ )
+ (?> [^()]+ ) # Change x+ to (?> x+ )
|
\( [^()]* \)
)+
@@ -656,12 +663,30 @@ On simple groups, such as the pattern C<
effect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>.
This was only 4 times slower on a string with 1000000 C<a>s.
+The "grab all you can, and do not give anything back" semantic is desirable
+in many situation when on the first sight C<()*> looks like a correct
+solution. Say, suppose we parse text with comments being delimited by
+C<#> followed by an optional (horizontal) whitespace. Contrary to
+its appearence, C<#[ \t]*> I<is not> a correct subexpression to match
+the comment delimiter. The answer is one of
+
+ (?>#[ \t]*)
+ #[ \t]*(?![ \t])
+
+Say, to grab non-empty comments into $1, one should use one of
+
+ / (?> \# [ \t]* ) ( .+ ) /x;
+ / \# [ \t]* ( [^ \t] .* ) /x;
+
+It is a judgement call which one of these expressions better reflects
+the above specification of comments.
+
=item C<(?(condition)yes-pattern|no-pattern)>
=item C<(?(condition)yes-pattern)>
-B<WARNING>: This extended regular expression feature is considered
-highly experimental, and may be changed or deleted without notice.
+B<WARNING>: This extended regular expression feature is still
+experimental.
Conditional expression. C<(condition)> should be either an integer in
parentheses (which is valid if the corresponding pair of parentheses
@@ -684,7 +709,10 @@ themselves.
A fundamental feature of regular expression matching involves the
notion called I<backtracking>, which is currently used (when needed)
by all regular expression quantifiers, namely C<*>, C<*?>, C<+>,
-C<+?>, C<{n,m}>, and C<{n,m}?>.
+C<+?>, C<{n,m}>, and C<{n,m}?>. Though internally the regular engine
+may use different mechanisms than backtracking to find the match,
+for humans the allegory of backtracking gives a convenient way
+to predict how the regular engine will behave.
For a regular expression to match, the I<entire> regular expression must
match, not just part of it. So if the beginning of a pattern containing a
@@ -857,10 +885,11 @@ is not a zero-width assertion, but a one
B<WARNING>: particularly complicated regular expressions can take
exponential time to solve because of the immense number of possible
-ways they can use backtracking to try match. For example, this will
+ways they can use backtracking to try match. For example, without
+internal optimizations done by the regular expression engine, this would
take a painfully long time to run
- /((a{0,5}){0,5}){0,5}/
+ 'aaaaaaaaaaaa' =~ /((a{0,5}){0,5}){0,5}[c]/
And if you used C<*>'s instead of limiting it to 0 through 5 matches,
then it would take forever--or until you ran out of stack space.
@@ -1003,7 +1032,7 @@ may match zero-length substrings. Here'
@chars = split //, $string; # // is not magic in split
($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
-Thus Perl allows the C</()/> construct, which I<forcefully breaks
+Thus Perl allows such constructs by I<forcefully breaking
the infinite loop>. The rules for this are different for lower-level
loops given by the greedy modifiers C<*+{}>, and for higher-level
ones like the C</g> modifier or split() operator.
@@ -1043,6 +1072,8 @@ position one notch further in the string
The additional state of being I<matched with zero-length> is associated with
the matched string, and is reset by each assignment to pos().
+Zero-length matches at the end of the previous match are ignored
+during C<split>.
=head2 Creating custom RE engines
@@ -1093,8 +1124,8 @@ part of this regular expression needs to
=head1 BUGS
-This manpage is varies from difficult to understand to completely
-and utterly opaque.
+As with many other parts of Perl documentation, this manpage may win from
+separating "User manual"-style sections from "Reference manual"-style ones.
=head1 SEE ALSO
Thread Next
-
[PATCH 5.005_58] REx documentation
by Ilya Zakharevich