Front page | perl.perl5.porters |
Postings from April 2000
PATCH: perlre.pod (against 5.6.0)
Thread Next
From:
Tom Christiansen
Date:
April 29, 2000 10:16
Subject:
PATCH: perlre.pod (against 5.6.0)
Message ID:
14962.957028538@chthon
*** perlre56.pod Sat Apr 29 09:42:35 2000
--- perlre.pod Sat Apr 29 11:13:51 2000
***************
*** 4,109 ****
=head1 DESCRIPTION
! This page describes the syntax of regular expressions in Perl. For a
! description of how to I<use> regular expressions in matching
! operations, plus various examples of the same, see discussions
! of C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like Operators">.
!
! Matching operations can have various modifiers. Modifiers
! that relate to the interpretation of the regular expression inside
! are listed below. Modifiers that alter the way a regular expression
! is used by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and
! L<perlop/"Gory details of parsing quoted constructs">.
=over 4
=item i
! Do case-insensitive pattern matching.
If C<use locale> is in effect, the case map is taken from the current
locale. See L<perllocale>.
=item m
! Treat string as multiple lines. That is, change "^" and "$" from matching
! the start or end of the string to matching the start or end of any
! line anywhere within the string.
=item s
! Treat string as single line. That is, change "." to match any character
! whatsoever, even a newline, which normally it would not match.
The C</s> and C</m> modifiers both override the C<$*> setting. That
is, no matter what C<$*> contains, C</s> without C</m> will force
! "^" to match only at the beginning of the string and "$" to match
only at the end (or just before a newline at the end) of the string.
! Together, as /ms, they let the "." match any character whatsoever,
! while yet allowing "^" and "$" to match, respectively, just after
and just before newlines within the string.
=item x
! Extend your pattern's legibility by permitting whitespace and comments.
=back
! These are usually written as "the C</x> modifier", even though the delimiter
! in question might not really be a slash. Any of these
! modifiers may also be embedded within the regular expression itself using
! the C<(?...)> construct. See below.
The C</x> modifier itself needs a little more explanation. It tells
! the regular expression parser to ignore whitespace that is neither
! backslashed nor within a character class. You can use this to break up
! your regular expression into (slightly) more readable parts. The C<#>
! character is also treated as a metacharacter introducing a comment,
! just as in ordinary Perl code. This also means that if you want real
whitespace or C<#> characters in the pattern (outside a character
! class, where they are unaffected by C</x>), that you'll either have to
! escape them or encode them using octal or hex escapes. Taken together,
! these features go a long way towards making Perl's regular expressions
more readable. Note that you have to be careful not to include the
! pattern delimiter in the comment--perl has no way of knowing you did
! not intend to close the pattern early. See the C-comment deletion code
! in L<perlop>.
=head2 Regular Expressions
! The patterns used in Perl pattern matching derive from supplied in
! the Version 8 regex routines. (The routines are derived
! (distantly) from Henry Spencer's freely redistributable reimplementation
! of the V8 routines.) See L<Version 8 Regular Expressions> for
! details.
! In particular the following metacharacters have their standard I<egrep>-ish
meanings:
\ Quote the next metacharacter
! ^ Match the beginning of the line
. Match any character (except newline)
! $ Match the end of the line (or before newline at the end)
| Alternation
() Grouping
[] Character class
! By default, the "^" character is guaranteed to match only the
! beginning of the string, the "$" character only the end (or before the
! newline at the end), and Perl does certain optimizations with the
! assumption that the string contains only one line. Embedded newlines
! will not be matched by "^" or "$". You may, however, wish to treat a
! string as a multi-line buffer, such that the "^" will match after any
! newline within the string, and "$" will match before any newline. At the
! cost of a little more overhead, you can do this by using the /m modifier
! on the pattern match operator. (Older programs did this by setting C<$*>,
! but this practice is now deprecated.)
!
! To simplify multi-line substitutions, the "." character never matches a
! newline unless you use the C</s> modifier, which in effect tells Perl to pretend
! the string is a single line--even if it isn't. The C</s> modifier also
! overrides the setting of C<$*>, in case you have some (badly behaved) older
! code that sets it in another module.
The following standard quantifiers are recognized:
--- 4,116 ----
=head1 DESCRIPTION
! This page describes the syntax and semantics of Perl's regular
! expression engine. For a description of how to actually I<use>
! regular expressions in matching operations, plus various examples
! of the same, see the descriptions of of C<m//>, C<s///>, C<qr//>
! and C<??> in L<perlop/"Regexp Quote-Like Operators">. (These are
! typically called regexes for short, or more accurately, patterns,
! since Perl's "regular expressions" aren't properly regular in the
! special compsci sense of that word.)
!
! Matching operations can have various modifiers. Modifiers that
! relate to the interpretation of the regex are listed below. Modifiers
! that alter the way Perl uses a pattern are detailed in L<perlop/"Regexp
! Quote-Like Operators"> and L<perlop/"Gory details of parsing quoted
! constructs">.
=over 4
=item i
! Do case-insensitive pattern matching, including when matching
! backreferences.
If C<use locale> is in effect, the case map is taken from the current
locale. See L<perllocale>.
=item m
! Change C<^> and C<$> from matching the start or before the optional
! newline at the end of the string to matching the start or end of
! any line anywhere within the string.
=item s
! Change C<.> to match any character whatsoever, even a newline, which
! normally it would not match, even if the deprecated C<$*> variable
! were set.
The C</s> and C</m> modifiers both override the C<$*> setting. That
is, no matter what C<$*> contains, C</s> without C</m> will force
! C<^> to match only at the beginning of the string and C<$> to match
only at the end (or just before a newline at the end) of the string.
! Together, as C</ms>, they let the C<.> match any character whatsoever,
! while yet allowing C<^> and C<$> to match, respectively, just after
and just before newlines within the string.
=item x
! Permit whitespace and comments in comments in the pattern, enhancing
! (well, enabling) legibility. It's also more expressive.
=back
! These are usually written as "the C</x> modifier", even though the
! delimiter in question might not really be a slash. Any of these
! modifiers may also be embedded within the regex itself using the
! C<(?I<flags>...) construct. See below.
The C</x> modifier itself needs a little more explanation. It tells
! the regex parser to ignore whitespace that is neither backslashed
! nor within a character class. You can use this to break up your
! pattern into (slightly) more readable parts. The C<#> character
! is also treated as a metacharacter introducing a comment, just as
! in ordinary Perl code. This also means that if you want real
whitespace or C<#> characters in the pattern (outside a character
! class, where they are unaffected by C</x>), that you'll either have
! to escape them or encode them using octal or hex escapes. Taken
! together, these features go a long way towards making Perl's patterns
more readable. Note that you have to be careful not to include the
! pattern delimiter in the comment--perl has no way of knowing you
! did not intend to close the pattern early. See the C-comment
! deletion code in L<perlop>.
=head2 Regular Expressions
! The patterns used in Perl pattern matching derive from the standard
! Version 8 Unix regex routines. (The routines are derived (distantly)
! from Henry Spencer's freely redistributable reimplementation of the
! V8 routines.) See L<Version 8 Regular Expressions> for details.
! In particular the following metacharacters have their standard B<egrep>-ish
meanings:
\ Quote the next metacharacter
! ^ Match the beginning of the string
. Match any character (except newline)
! $ Match before the optional newline at the end the string
| Alternation
() Grouping
[] Character class
! By default, the C<^> metacharacter is matches only the beginning
! of the string, the C<$> metacharacter only before an optional
! trailing newline at the end, so Perl does certain optimizations
! with the assumption that the string contains only one line. Embedded
! newlines will not normally be noticed by C<^> or C<$>. You may,
! however, wish to treat a string as a multi-line buffer, such that
! the C<^> will match after any newline within the string, and C<$>
! will match before any newline. (These don't actually match the
! newlines, though, so C</foo^bar/> can never match, for example.)
! At the cost of a little more overhead, you can do this by using the
! C</m> modifier on the pattern match operator. (Older programs did
! this by setting C<$*>, but this practice is now deprecated.)
!
! For historical reasons and to simplify substitutions, the C<.>
! character never matches a newline unless you use the C</s> modifier.
! The C</s> modifier also overrides the setting of C<$*>, in case you
! have some (badly behaved) older code that sets it in another module.
The following standard quantifiers are recognized:
***************
*** 114,141 ****
{n,} Match at least n times
{n,m} Match at least n but not more than m times
! (If a curly bracket occurs in any other context, it is treated
! as a regular character.) The "*" modifier is equivalent to C<{0,}>, the "+"
! modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited
! to integral values less than a preset limit defined when perl is built.
! This is usually 32766 on the most common platforms. The actual limit can
! be seen in the error message generated by code such as this:
$_ **= $_ , / {$_} / for 2 .. 42;
! By default, a quantified subpattern is "greedy", that is, it will match as
! many times as possible (given a particular starting location) while still
! allowing the rest of the pattern to match. If you want it to match the
! minimum number of times possible, follow the quantifier with a "?". Note
! that the meanings don't change, just the "greediness":
*? Match 0 or more times
+? Match 1 or more times
?? Match 0 or 1 time
- {n}? Match exactly n times
{n,}? Match at least n times
{n,m}? Match at least n but not more than m times
Because patterns are processed as double quoted strings, the following
also work:
--- 121,165 ----
{n,} Match at least n times
{n,m} Match at least n but not more than m times
! (If a brace occurs in any other context, it is treated as a regular
! character.) The C<*> modifier is equivalent to C<{0,}>, the C<+>
! modifier to C<{1,}>, and the C<?> modifier to C<{0,1}>. I<n> and
! I<m> are limited to integral values less than a preset limit defined
! when perl is built. This is usually 32766 on the most common
! platforms. The actual limit can be seen in the error message
! generated by code such as this:
$_ **= $_ , / {$_} / for 2 .. 42;
! By default, quantifiers are "greedy", that is, they will match as
! many times as possible (given a particular starting location) while
! still allowing the rest of the pattern to match. If you want
! to match the minimum number of times possible, follow the quantifier
! with a C<?>. Note that the meanings don't change, just the
! "greediness":
*? Match 0 or more times
+? Match 1 or more times
?? Match 0 or 1 time
{n,}? Match at least n times
{n,m}? Match at least n but not more than m times
+ Perl matches patterns "eagerly", that is, as soon as possible. Even
+ with minimal matches, Perl still finds the leftmost possible match.
+ The question mark only changes the sense from leftmost-longest to
+ leftmost-shortest, and only for that quantifier. Unlike regex
+ languages with overall greed, in Perl, the leftmost aspect is still
+ more important than longest/shortest. Only when two matches start
+ at the same point are their lengths considered. Otherwise, the
+ lefter one always wins. That's why both C</a*/> and C</a*?/> match
+ all possible strings irrespective of content, and at the earliest
+ possible point--right before the beginning of the string.
+
+ A question mark was chosen for this and for the minimal-matching
+ construct because question marks are rare in older regular expressions,
+ and because whenever you see one, you should stop and "question"
+ exactly what is going on. That's psychology...
+
Because patterns are processed as double quoted strings, the following
also work:
***************
*** 145,157 ****
\f form feed (FF)
\a alarm (bell) (BEL)
\e escape (think troff) (ESC)
! \033 octal char (think of a PDP-11)
\x1B hex char
\x{263a} wide hex char (Unicode SMILEY)
\c[ control char
! \N{name} named char
\l lowercase next char (think vi)
! \u uppercase next char (think vi)
\L lowercase till \E (think vi)
\U uppercase till \E (think vi)
\E end case modification (think vi)
--- 169,182 ----
\f form feed (FF)
\a alarm (bell) (BEL)
\e escape (think troff) (ESC)
! \033 octal char (think of a PDP-11); 0 is optional except on
! single digits
\x1B hex char
\x{263a} wide hex char (Unicode SMILEY)
\c[ control char
! \N{name} named char (requires use charnames)
\l lowercase next char (think vi)
! \u titlecase next char (think vi)
\L lowercase till \E (think vi)
\U uppercase till \E (think vi)
\E end case modification (think vi)
***************
*** 181,193 ****
\C Match a single C char (octet) even under utf8.
A C<\w> matches a single alphanumeric character, not a whole word.
! Use C<\w+> to match a string of Perl-identifier characters (which isn't
! the same as matching an English word). If C<use locale> is in effect, the
! list of alphabetic characters generated by C<\w> is taken from the
! current locale. See L<perllocale>. You may use C<\w>, C<\W>, C<\s>, C<\S>,
! C<\d>, and C<\D> within character classes, but if you try to use them
! as endpoints of a range, that's not a range, the "-" is understood literally.
! See L<utf8> for details about C<\pP>, C<\PP>, and C<\X>.
The POSIX character class syntax
--- 206,219 ----
\C Match a single C char (octet) even under utf8.
A C<\w> matches a single alphanumeric character, not a whole word.
! Use C<\w+> to match a string of Perl-identifier characters (which
! isn't the same as matching an English word). If C<use locale> is
! in effect, the list of alphabetic characters generated by C<\w> is
! taken from the current locale. See L<perllocale>. You may use
! C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, and C<\D> within character
! classes, but if you try to use them as endpoints of a range, that's
! not a range, the C<-> is understood literally. See L<utf8> for
! details about C<\pP>, C<\PP>, and C<\X>.
The POSIX character class syntax
***************
*** 211,225 ****
xdigit
For example use C<[:upper:]> to match all the uppercase characters.
! Note that the C<[]> are part of the C<[::]> construct, not part of the whole
! character class. For example:
[01[:alpha:]%]
matches one, zero, any alphabetic character, and the percentage sign.
If the C<utf8> pragma is used, the following equivalences to Unicode
! \p{} constructs hold:
alpha IsAlpha
alnum IsAlnum
--- 237,251 ----
xdigit
For example use C<[:upper:]> to match all the uppercase characters.
! Note that the C<[]> are part of the C<[::]> construct, not part of
! the whole character class. For example:
[01[:alpha:]%]
matches one, zero, any alphabetic character, and the percentage sign.
If the C<utf8> pragma is used, the following equivalences to Unicode
! C<\p{}> constructs hold:
alpha IsAlpha
alnum IsAlnum
***************
*** 238,245 ****
For example C<[:lower:]> and C<\p{IsLower}> are equivalent.
If the C<utf8> pragma is not used but the C<locale> pragma is, the
! classes correlate with the isalpha(3) interface (except for `word',
! which is a Perl extension, mirroring C<\w>).
The assumedly non-obviously named classes are:
--- 264,271 ----
For example C<[:lower:]> and C<\p{IsLower}> are equivalent.
If the C<utf8> pragma is not used but the C<locale> pragma is, the
! classes correlate with the standard isalpha(3) interface (except
! for C<word>, which is a Perl extension, mirroring C<\w>).
The assumedly non-obviously named classes are:
***************
*** 250,256 ****
Any control character. Usually characters that don't produce output as
such but instead control the terminal somehow: for example newline and
backspace are control characters. All characters with ord() less than
! 32 are most often classified as control characters.
=item graph
--- 276,282 ----
Any control character. Usually characters that don't produce output as
such but instead control the terminal somehow: for example newline and
backspace are control characters. All characters with ord() less than
! decimal 32 are most often classified as control characters.
=item graph
***************
*** 266,278 ****
=item xdigit
! Any hexadecimal digit. Though this may feel silly (/0-9a-f/i would
work just fine) it is included for completeness.
=back
! You can negate the [::] character classes by prefixing the class name
! with a '^'. This is a Perl extension. For example:
POSIX trad. Perl utf8 Perl
--- 292,304 ----
=item xdigit
! Any hexadecimal digit. Though this may feel silly (C</0-9a-f/i> would
work just fine) it is included for completeness.
=back
! You can negate the C<[::]> character classes by prefixing the class name
! with a C<^>. This is a Perl extension. For example:
POSIX trad. Perl utf8 Perl
***************
*** 280,334 ****
[:^space:] \S \P{IsSpace}
[:^word:] \W \P{IsWord}
! The POSIX character classes [.cc.] and [=cc=] are recognized but
! B<not> supported and trying to use them will cause an error.
Perl defines the following zero-width assertions:
\b Match a word boundary
\B Match a non-(word boundary)
\A Match only at beginning of string
! \Z Match only at end of string, or before newline at the end
! \z Match only at end of string
\G Match only at pos() (e.g. at the end-of-match position
of prior m//g)
! A word boundary (C<\b>) is a spot between two characters
! that has a C<\w> on one side of it and a C<\W> on the other side
! of it (in either order), counting the imaginary characters off the
! beginning and end of the string as matching a C<\W>. (Within
! character classes C<\b> represents backspace rather than a word
! boundary, just as it normally does in any double-quoted string.)
! The C<\A> and C<\Z> are just like "^" and "$", except that they
! won't match multiple times when the C</m> modifier is used, while
! "^" and "$" will match at every internal line boundary. To match
! the actual end of the string and not ignore an optional trailing
! newline, use C<\z>.
The C<\G> assertion can be used to chain global matches (using
C<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">.
! It is also useful when writing C<lex>-like scanners, when you have
several patterns that you want to match against consequent substrings
of your string, see the previous reference. The actual location
! where C<\G> will match can also be influenced by using C<pos()> as
an lvalue. See L<perlfunc/pos>.
The bracketing construct C<( ... )> creates capture buffers. To
! refer to the digit'th buffer use \<digit> within the
! match. Outside the match use "$" instead of "\". (The
! \<digit> notation works in certain circumstances outside
! the match. See the warning below about \1 vs $1 for details.)
! Referring back to another part of the match is called a
! I<backreference>.
There is no limit to the number of captured substrings that you may
! use. However Perl also uses \10, \11, etc. as aliases for \010,
! \011, etc. (Recall that 0 means octal, so \011 is the 9'th ASCII
! character, a tab.) Perl resolves this ambiguity by interpreting
! \10 as a backreference only if at least 10 left parentheses have
! opened before it. Likewise \11 is a backreference only if at least
! 11 left parentheses have opened before it. And so on. \1 through
! \9 are always interpreted as backreferences."
Examples:
--- 306,359 ----
[:^space:] \S \P{IsSpace}
[:^word:] \W \P{IsWord}
! The POSIX character classes C<[.cc.]> and C<[=cc=]> are recognized but
! I<not> supported and trying to use them will cause an error.
Perl defines the following zero-width assertions:
\b Match a word boundary
\B Match a non-(word boundary)
\A Match only at beginning of string
! \Z Match before optional newline at end of string
! \z Match at end of string (not in front of the newline)
\G Match only at pos() (e.g. at the end-of-match position
of prior m//g)
! A word boundary (C<\b>) is a spot between two characters that has
! a C<\w> on one side of it and a C<\W> on the other side of it (in
! either order), counting the imaginary characters off the beginning
! and end of the string as matching a C<\W>. (Within character classes
! C<\b> represents backspace rather than a word boundary, just as it
! normally does in any double-quoted string.) The C<\A> and C<\Z> are
! just like C<^> and C<$>, except that they won't match internally
! when the C</m> modifier is used, whereas C<^> and C<$> can match
! next to any internal newline. To match the actual end of the string
! and not ignore an optional trailing newline, use C<\z>.
The C<\G> assertion can be used to chain global matches (using
C<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">.
! It is also useful when writing B<lex>-like scanners, when you have
several patterns that you want to match against consequent substrings
of your string, see the previous reference. The actual location
! where C<\G> will match can also be influenced by using pos() as
an lvalue. See L<perlfunc/pos>.
The bracketing construct C<( ... )> creates capture buffers. To
! refer to the digit'th buffer use \<digit> within the match. Outside
! the match use C<$> to access the numbered variables, instead of
! C<\> to access backreferences. (The \<digit> notation works in
! certain circumstances outside the match. See the warning below
! about \1 vs $1 for details.) Referring back to another part of the
! match is called a I<backreference>.
There is no limit to the number of captured substrings that you may
! use. However Perl also uses C<\10>, C<\11>, etc. as aliases for
! C<\010>, C<\011>, etc. (Recall that 0 means octal, so C<\01> is
! the 9'th ASCII character, a tab.) Perl resolves this ambiguity by
! interpreting C<\10> as a backreference only if at least 10 left
! parentheses have opened before it. Likewise C<\11> is a backreference
! only if at least 11 left parentheses have opened before it. And
! so on. C<\1> through C<\9> are always interpreted as backreferences."
Examples:
***************
*** 345,383 ****
}
Several special variables also refer back to portions of the previous
! match. C<$+> returns whatever the last bracket match matched.
! C<$&> returns the entire matched string. (At one point C<$0> did
! also, but now it returns the name of the program.) C<$`> returns
! everything before the matched string. And C<$'> returns everything
! after the matched string.
The numbered variables ($1, $2, $3, etc.) and the related punctuation
! set (C<<$+>, C<$&>, C<$`>, and C<$'>) are all dynamically scoped
! until the end of the enclosing block or until the next successful
! match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
!
! B<WARNING>: Once Perl sees that you need one of C<$&>, C<$`>, or
! C<$'> anywhere in the program, it has to provide them for every
! pattern match. This may substantially slow your program. Perl
! uses the same mechanism to produce $1, $2, etc, so you also pay a
! price for each pattern that contains capturing parentheses. (To
! avoid this cost while retaining the grouping behaviour, use the
! extended regular expression C<(?: ... )> instead.) But if you never
! use C<$&>, C<$`> or C<$'>, then patterns I<without> capturing
! parentheses will not be penalized. So avoid C<$&>, C<$'>, and C<$`>
! if you can, but if you can't (and some algorithms really appreciate
! them), once you've used them once, use them at will, because you've
! already paid the price. As of 5.005, C<$&> is not so costly as the
! other two.
! Backslashed metacharacters in Perl are alphanumeric, such as C<\b>,
! C<\w>, C<\n>. Unlike some other regular expression languages, there
are no backslashed symbols that aren't alphanumeric. So anything
! that looks like \\, \(, \), \<, \>, \{, or \} is always
! interpreted as a literal character, not a metacharacter. This was
! once used in a common idiom to disable or quote the special meanings
! of regular expression metacharacters in a string that you want to
! use for a pattern. Simply quote all non-alphanumeric characters:
$pattern =~ s/(\W)/\\$1/g;
--- 370,411 ----
}
Several special variables also refer back to portions of the previous
! match. C<$+> returns whatever the last bracket match matched. The
! C<$`>-C<$&>-C<$'> trio are mnemonically named to correspond to the
! pieces in a `match'. C<$`> returns everything before the matched
! string. C<$&> returns the entire matched string. And C<$'> returns
! everything after the matched string.
The numbered variables ($1, $2, $3, etc.) and the related punctuation
! set (C<<$+>, C<$`>, C<$&>, and C<$'>) are all automatically localized
! to the enclosing dynamic scope. Their values are therefore ephemeral
! and best copied into more enduring variables. (See L<perlsyn/"Compound
! Statements">.)
!
! Once Perl sees that you need one of C<$&>, C<$`>, or C<$'> anywhere
! in the program, it has to provide them for every pattern match.
! This will slow down pattern matches a bit, and if most of your
! program is spent matching patterns, you may notice this. Perl uses
! the same mechanism to produce $1, $2, etc, so you also pay a price
! for each pattern that contains capturing parentheses. (To avoid
! this cost while retaining the grouping behaviour, use the extended
! regular expression C<(?:I<X>...)> instead.) But if you never use
! C<$`>, C<$&>, or C<$'>, then patterns I<without> capturing parentheses
! will not be penalized. So avoid C<$'>, C<$&>, and C<$`> if you
! can, but if you can't (and some algorithms really appreciate them),
! once you've used them once, use them at will, because you've already
! paid the price. As of 5.005, C<$&> is not so costly as the other
! two.
! Backslashed alphanumerics in Perl are often special, such as C<\b>,
! C<\w>, C<\n>. Unlike some other regex languages, there
are no backslashed symbols that aren't alphanumeric. So anything
! that looks like C<\\>, C<\(>, C<\)>, C<\<>, C<< \> >>, C<\{>, or
! C<\}> is always interpreted as a literal character, not a metacharacter.
! This was once used in a common idiom to disable or quote the special
! meanings of regex metacharacters in a string that you
! want to use for a pattern. Simply quote all non-alphanumeric
! characters:
$pattern =~ s/(\W)/\\$1/g;
***************
*** 398,425 ****
Perl also defines a consistent extension syntax for features not
found in standard tools like B<awk> and B<lex>. The syntax is a
pair of parentheses with a question mark as the first thing within
! the parentheses. The character after the question mark indicates
! the extension.
! The stability of these extensions varies widely. Some have been
! part of the core language for many years. Others are experimental
! and may change without warning or be completely removed. Check
! the documentation on an individual feature to verify its current
! status.
!
! A question mark was chosen for this and for the minimal-matching
! construct because 1) question marks are rare in older regular
! expressions, and 2) whenever you see one, you should stop and
! "question" exactly what is going on. That's psychology...
=over 10
=item C<(?#text)>
A comment. The text is ignored. If the C</x> modifier enables
! whitespace formatting, a simple C<#> will suffice. Note that Perl closes
! the comment as soon as it sees a C<)>, so there is no way to put a literal
! C<)> in the comment.
=item C<(?imsx-imsx)>
--- 426,447 ----
Perl also defines a consistent extension syntax for features not
found in standard tools like B<awk> and B<lex>. The syntax is a
pair of parentheses with a question mark as the first thing within
! the parentheses, such as C<(?I<X>...). The value of I<X> after the
! question mark determines which extension is selected.
! Stability of these extensions varies widely. Some have been part
! of the core language for many years. Others are experimental and
! may change without warning or be completely removed. Check the
! documentation on an individual feature to verify its current status.
=over 10
=item C<(?#text)>
A comment. The text is ignored. If the C</x> modifier enables
! whitespace formatting, a simple C<#> will suffice. Note that Perl
! closes the comment as soon as it sees a C<)>, so there is no way
! to put a literal C<)> in the comment.
=item C<(?imsx-imsx)>
***************
*** 431,442 ****
C<(?i)> at the front of the pattern. For example:
$pattern = "foobar";
! if ( /$pattern/i ) { }
# more flexible:
$pattern = "(?i)foobar";
! if ( /$pattern/ ) { }
Letters after a C<-> turn those modifiers off. These modifiers are
localized inside an enclosing group (if any). For example,
--- 453,464 ----
C<(?i)> at the front of the pattern. For example:
$pattern = "foobar";
! if ( /$pattern/i ) { }
# more flexible:
$pattern = "(?i)foobar";
! if ( /$pattern/ ) { }
Letters after a C<-> turn those modifiers off. These modifiers are
localized inside an enclosing group (if any). For example,
***************
*** 452,458 ****
=item C<(?imsx-imsx:pattern)>
This is for clustering, not capturing; it groups subexpressions like
! "()", but doesn't make backreferences as "()" does. So
@fields = split(/\b(?:a|b|c)\b/)
--- 474,480 ----
=item C<(?imsx-imsx:pattern)>
This is for clustering, not capturing; it groups subexpressions like
! C<()>, but doesn't make backreferences as C<()> does. So
@fields = split(/\b(?:a|b|c)\b/)
***************
*** 464,470 ****
characters if you don't need to.
Any letters between C<?> and C<:> act as flags modifiers as with
! C<(?imsx-imsx)>. For example,
/(?s-i:more.*than).*million/i
--- 486,492 ----
characters if you don't need to.
Any letters between C<?> and C<:> act as flags modifiers as with
! C<(?imsx-imsx)>. For example,
/(?s-i:more.*than).*million/i
***************
*** 481,494 ****
A zero-width negative look-ahead assertion. For example C</foo(?!bar)/>
matches any occurrence of "foo" that isn't followed by "bar". Note
! however that look-ahead and look-behind are NOT the same thing. You cannot
! use this for look-behind.
! If you are looking for a "bar" that isn't preceded by a "foo", C</(?!foo)bar/>
! will not do what you want. That's because the C<(?!foo)> is just saying that
! the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will
! match. You would have to do something like C</(?!foo)...bar/> for that. We
! say "like" because there's the case of your "bar" not having three characters
before it. You could cover that this way: C</(?:(?!foo)...|^.{0,2})bar/>.
Sometimes it's still easier just to say:
--- 503,517 ----
A zero-width negative look-ahead assertion. For example C</foo(?!bar)/>
matches any occurrence of "foo" that isn't followed by "bar". Note
! however that look-ahead and look-behind are I<not> the same thing.
! You cannot use this for look-behind.
! If you are looking for a "bar" that isn't preceded by a "foo",
! C</(?!foo)bar/> will not do what you want. That's because the
! C<(?!foo)> is just saying that the next thing cannot be "foo"--and
! it's not, it's a "bar", so "foobar" will match. You would have to
! do something like C</(?!foo)...bar/> for that. We say "like"
! because there's the case of your "bar" not having three characters
before it. You could cover that this way: C</(?:(?!foo)...|^.{0,2})bar/>.
Sometimes it's still easier just to say:
***************
*** 513,559 ****
B<WARNING>: This extended regular expression feature is considered
highly experimental, and may be changed or deleted without notice.
! This zero-width assertion evaluate any embedded Perl code. It
! always succeeds, and its C<code> is not interpolated. Currently,
! the rules to determine where the C<code> ends are somewhat convoluted.
The C<code> is properly scoped in the following sense: If the assertion
is backtracked (compare L<"Backtracking">), all changes introduced after
C<local>ization are undone, so that
$_ = 'a' x 8;
! m<
(?{ $cnt = 0 }) # Initialize $cnt.
(
! a
(?{
local $cnt = $cnt + 1; # Update $cnt, backtracking-safe.
})
! )*
aaaa
(?{ $res = $cnt }) # On success copy to non-localized
# location.
>x;
! will set C<$res = 4>. Note that after the match, $cnt returns to the globally
! introduced value, because the scopes that restrict C<local> operators
! are unwound.
! This assertion may be used as a C<(?(condition)yes-pattern|no-pattern)>
switch. If I<not> used in this way, the result of evaluation of
C<code> is put into the special variable C<$^R>. This happens
immediately, so C<$^R> can be used from other C<(?{ code })> assertions
! inside the same regular expression.
The assignment to C<$^R> above is properly localized, so the old
value of C<$^R> is restored if the assertion is backtracked; compare
L<"Backtracking">.
! For reasons of security, this construct is forbidden if the regular
! expression involves run-time interpolation of variables, unless the
! perilous C<use re 'eval'> pragma has been used (see L<re>), or the
! variables contain results of C<qr//> operator (see
! L<perlop/"qr/STRING/imosx">).
This restriction is because of the wide-spread and remarkably convenient
custom of using run-time determined strings as patterns. For example:
--- 536,588 ----
B<WARNING>: This extended regular expression feature is considered
highly experimental, and may be changed or deleted without notice.
! This zero-width element evaluates to any embedded Perl code.
! Currently, the rules to determine where the C<code> ends are somewhat
! convoluted. It is not an assertion, because it does not assert
! anything: the success of the match is unrelated to the code's return
! value.
The C<code> is properly scoped in the following sense: If the assertion
is backtracked (compare L<"Backtracking">), all changes introduced after
C<local>ization are undone, so that
$_ = 'a' x 8;
! m<
(?{ $cnt = 0 }) # Initialize $cnt.
(
! a
(?{
local $cnt = $cnt + 1; # Update $cnt, backtracking-safe.
})
! )*
aaaa
(?{ $res = $cnt }) # On success copy to non-localized
# location.
>x;
! will set C<$res = 4>. Note that after the match, $cnt returns to
! the globally introduced value, because the scopes that restrict
! C<local> operators are unwound.
! This construct may be used as a C<(?(condition)yes-pattern|no-pattern)>
switch. If I<not> used in this way, the result of evaluation of
C<code> is put into the special variable C<$^R>. This happens
immediately, so C<$^R> can be used from other C<(?{ code })> assertions
! inside the same pattern.
The assignment to C<$^R> above is properly localized, so the old
value of C<$^R> is restored if the assertion is backtracked; compare
L<"Backtracking">.
! For reasons of security, this construct is normally forbidden if
! the regex involves variable interpolation, unless the perilous C<use
! re 'eval'> pragma has been used (see L<re>), or the variables contain
! results of C<qr//> operator (see L<perlop/"qr/STRING/imosx">).
! Currently, no distinction is made between the interpolation of
! actual embedded code and the interpolation of simple variables in
! a pattern that merely happens to contain a code expression. This
! confusion is not to be considered a feature, and may be fixed in a
! future release.
This restriction is because of the wide-spread and remarkably convenient
custom of using run-time determined strings as patterns. For example:
***************
*** 577,589 ****
A simplified version of the syntax may be introduced for commonly
used idioms.
! This is a "postponed" regular subexpression. The C<code> is evaluated
! at run time, at the moment this subexpression may match. The result
! of evaluation is considered as a regular expression and matched as
! if it were inserted instead of this construct.
!
! The C<code> is not interpolated. As before, the rules to determine
! where the C<code> ends are currently somewhat convoluted.
The following pattern matches a parenthesized group:
--- 606,619 ----
A simplified version of the syntax may be introduced for commonly
used idioms.
! Execute I<code> and interpolate its result as more pattern. The
! C<code> is evaluated at run time, at the moment this subexpression
! may match. The result of evaluation is a regex and is matched just
! as though it had been used directly.
!
! As with the C<?{ code }> construct (whose result is ignored), the
! rules to determine where the C<code> ends are currently somewhat
! convoluted.
The following pattern matches a parenthesized group:
***************
*** 602,614 ****
B<WARNING>: This extended regular expression feature is considered
highly experimental, and may be changed or deleted without notice.
! An "independent" subexpression, one which matches the substring
! that a I<standalone> C<pattern> would match if anchored at the given
! position, and it matches I<nothing other than this substring>. This
! construct is useful for optimizations of what would otherwise be
! "eternal" matches, because it will not backtrack (see L<"Backtracking">).
! It may also be useful in places where the "grab all you can, and do not
! give anything back" semantic is desirable.
For example: C<< ^(?>a*)ab >> will never match, since C<< (?>a*) >>
(anchored at the beginning of string, as above) will match I<all>
--- 632,645 ----
B<WARNING>: This extended regular expression feature is considered
highly experimental, and may be changed or deleted without notice.
! A non-backtracking subexpression, one that matches the substring
! that a "standalone" C<pattern> would match if anchored at the given
! position. It is somewhat reminiscent of a "cut" operator in logic
! programming languages. This is mostly useful as an efficiency hack
! to optimize of what would otherwise be "eternal" matches, because
! it will not relinquish any characters eaten during backtrack (see
! L<"Backtracking">). It may also be useful in places where the "grab
! all you can, and do not give anything back" semantic is desirable.
For example: C<< ^(?>a*)ab >> will never match, since C<< (?>a*) >>
(anchored at the beginning of string, as above) will match I<all>
***************
*** 625,641 ****
makes a zero-length assertion into an analogue of C<< (?>...) >>.
(The difference between these two constructs is that the second one
uses a capturing group, thus shifting ordinals of backreferences
! in the rest of a regular expression.)
Consider this pattern:
m{ \(
! (
[^()]+ # x+
! |
\( [^()]* \)
)+
! \)
}x
That will efficiently match a nonempty group with matching parentheses
--- 656,672 ----
makes a zero-length assertion into an analogue of C<< (?>...) >>.
(The difference between these two constructs is that the second one
uses a capturing group, thus shifting ordinals of backreferences
! in the rest of a pattern.)
Consider this pattern:
m{ \(
! (
[^()]+ # x+
! |
\( [^()]* \)
)+
! \)
}x
That will efficiently match a nonempty group with matching parentheses
***************
*** 649,669 ****
exponential performance will make it appear that your program has
hung. However, a tiny change to this pattern
! m{ \(
! (
(?> [^()]+ ) # change x+ above to (?> x+ )
! |
\( [^()]* \)
)+
! \)
}x
! which uses C<< (?>...) >> matches exactly when the one above does (verifying
! this yourself would be a productive exercise), but finishes in a fourth
! the time when used on a similar string with 1000000 C<a>s. Be aware,
! however, that this pattern currently triggers a warning message under
! the C<use warnings> pragma or B<-w> switch saying it
! C<"matches the null string many times">):
On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable
effect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>.
--- 680,700 ----
exponential performance will make it appear that your program has
hung. However, a tiny change to this pattern
! m{ \(
! (
(?> [^()]+ ) # change x+ above to (?> x+ )
! |
\( [^()]* \)
)+
! \)
}x
! which uses C<< (?>...) >> matches exactly when the one above does
! (verifying this yourself would be a productive exercise), but
! finishes in a fourth the time when used on a similar string with
! 1000000 C<a>s. Be aware, however, that this pattern currently
! triggers a warning message under the C<use warnings> pragma or B<-w>
! switch saying it C<"matches the null string many times">):
On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable
effect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>.
***************
*** 703,711 ****
For example:
! m{ ( \( )?
! [^()]+
! (?(1) \) )
}x
matches a chunk of non-parentheses, possibly included in parentheses
--- 734,742 ----
For example:
! m{ ( \( )?
! [^()]+
! (?(1) \) )
}x
matches a chunk of non-parentheses, possibly included in parentheses
***************
*** 715,732 ****
=head2 Backtracking
! NOTE: This section presents an abstract approximation of regular
! expression behavior. For a more rigorous (and complicated) view of
! the rules involved in selecting a match among possible alternatives,
! see L<Combining pieces together>.
!
! A fundamental feature of regular expression matching involves the
! notion called I<backtracking>, which is currently used (when needed)
! by all regular expression quantifiers, namely C<*>, C<*?>, C<+>,
! C<+?>, C<{n,m}>, and C<{n,m}?>. Backtracking is often optimized
! internally, but the general principle outlined here is valid.
! For a regular expression to match, the I<entire> regular expression must
match, not just part of it. So if the beginning of a pattern containing a
quantifier succeeds in a way that causes later parts in the pattern to
fail, the matching engine backs up and recalculates the beginning
--- 746,763 ----
=head2 Backtracking
! NOTE: This section presents an abstract approximation of the how
! the regex engine behaves. For a somewhat more rigorous (and harder
! to understand) view of the rules involved in selecting a match among
! possible alternatives, see L<Combining pieces together>.
!
! A fundamental feature of pattern matching involves the notion called
! I<backtracking>, which is currently used (when needed) by all regex
! quantifiers, namely C<*>, C<*?>, C<+>, C<+?>, C<{n,m}>, and C<{n,m}?>.
! Backtracking is often optimized internally, but the general principle
! outlined here is valid.
! For a pattern to match, the I<entire> pattern must
match, not just part of it. So if the beginning of a pattern containing a
quantifier succeeds in a way that causes later parts in the pattern to
fail, the matching engine backs up and recalculates the beginning
***************
*** 740,752 ****
print "$2 follows $1.\n";
}
! When the match runs, the first part of the regular expression (C<\b(foo)>)
finds a possible match right at the beginning of the string, and loads up
$1 with "Foo". However, as soon as the matching engine sees that there's
no whitespace following the "Foo" that it had saved in $1, it realizes its
mistake and starts over again one character after where it had the
tentative match. This time it goes all the way until the next occurrence
! of "foo". The complete regular expression matches this time, and you get
the expected output of "table follows foo."
Sometimes minimal matching can help a lot. Imagine you'd like to match
--- 771,783 ----
print "$2 follows $1.\n";
}
! When the match runs, the first part of the pattern (C<\b(foo)>)
finds a possible match right at the beginning of the string, and loads up
$1 with "Foo". However, as soon as the matching engine sees that there's
no whitespace following the "Foo" that it had saved in $1, it realizes its
mistake and starts over again one character after where it had the
tentative match. This time it goes all the way until the next occurrence
! of "foo". The complete pattern matches this time, and you get
the expected output of "table follows foo."
Sometimes minimal matching can help a lot. Imagine you'd like to match
***************
*** 781,787 ****
That won't work at all, because C<.*> was greedy and gobbled up the
whole string. As C<\d*> can match on an empty string the complete
! regular expression matched successfully.
Beginning is <I have 2 numbers: 53147>, number is <>.
--- 812,818 ----
That won't work at all, because C<.*> was greedy and gobbled up the
whole string. As C<\d*> can match on an empty string the complete
! pattern matched successfully.
Beginning is <I have 2 numbers: 53147>, number is <>.
***************
*** 865,878 ****
The search engine will initially match C<\D*> with "ABC". Then it will
try to match C<(?!123> with "123", which fails. But because
! a quantifier (C<\D*>) has been used in the regular expression, the
search engine can backtrack and retry the match differently
! in the hope of matching the complete regular expression.
The pattern really, I<really> wants to succeed, so it uses the
! standard pattern back-off-and-retry and lets C<\D*> expand to just "AB" this
! time. Now there's indeed something following "AB" that is not
! "123". It's "C123", which suffices.
We can deal with this by using both an assertion and a negation.
We'll say that the first part in $1 must be followed both by a digit
--- 896,909 ----
The search engine will initially match C<\D*> with "ABC". Then it will
try to match C<(?!123> with "123", which fails. But because
! a quantifier (C<\D*>) has been used in the pattern, the
search engine can backtrack and retry the match differently
! in the hope of matching the complete pattern.
The pattern really, I<really> wants to succeed, so it uses the
! standard pattern back-off-and-retry and lets C<\D*> expand to just
! "AB" this time. Now there's indeed something following "AB" that
! is not "123". It's "C123", which suffices.
We can deal with this by using both an assertion and a negation.
We'll say that the first part in $1 must be followed both by a digit
***************
*** 886,904 ****
6: got ABC
! In other words, the two zero-width assertions next to each other work as though
! they're ANDed together, just as you'd use any built-in assertions: C</^$/>
! matches only if you're at the beginning of the line AND the end of the
! line simultaneously. The deeper underlying truth is that juxtaposition in
! regular expressions always means AND, except when you write an explicit OR
! using the vertical bar. C</ab/> means match "a" AND (then) match "b",
! although the attempted matches are made at different positions because "a"
! is not a zero-width assertion, but a one-width assertion.
B<WARNING>: particularly complicated regular expressions can take
exponential time to solve because of the immense number of possible
ways they can use backtracking to try match. For example, without
! internal optimizations done by the regular expression engine, this will
take a painfully long time to run:
'aaaaaaaaaaaa' =~ /((a{0,5}){0,5}){0,5}[c]/
--- 917,936 ----
6: got ABC
! In other words, the two zero-width assertions next to each other
! work as though they're ANDed together, just as you'd use any built-in
! assertions: C</^$/> matches only if you're at the beginning of the
! line AND the end of the line simultaneously. The deeper underlying
! truth is that juxtaposition in regexes always means AND, except
! when you write an explicit OR using the vertical bar. C</ab/> means
! match "a" AND (then) match "b", although the attempted matches are
! made at different positions because "a" is not a zero-width assertion,
! but a one-width assertion.
B<WARNING>: particularly complicated regular expressions can take
exponential time to solve because of the immense number of possible
ways they can use backtracking to try match. For example, without
! internal optimizations done by the regex engine, this will
take a painfully long time to run:
'aaaaaaaaaaaa' =~ /((a{0,5}){0,5}){0,5}[c]/
***************
*** 906,948 ****
And if you used C<*>'s instead of limiting it to 0 through 5 matches,
then it would take forever--or until you ran out of stack space.
! A powerful tool for optimizing such beasts is what is known as an
! "independent group",
! which does not backtrack (see L<C<< (?>pattern) >>>). Note also that
! zero-length look-ahead/look-behind assertions will not backtrack to make
! the tail match, since they are in "logical" context: only
! whether they match is considered relevant. For an example
! where side-effects of look-ahead I<might> have influenced the
! following match, see L<C<< (?>pattern) >>>.
=head2 Version 8 Regular Expressions
! In case you're not familiar with the "regular" Version 8 regex
routines, here are the pattern-matching rules not described above.
Any single character matches itself, unless it is a I<metacharacter>
with a special meaning described here or above. You can cause
characters that normally function as metacharacters to be interpreted
! literally by prefixing them with a "\" (e.g., "\." matches a ".", not any
! character; "\\" matches a "\"). A series of characters matches that
series of characters in the target string, so the pattern C<blurfl>
would match "blurfl" in the target string.
You can specify a character class, by enclosing a list of characters
in C<[]>, which will match any one character from the list. If the
! first character after the "[" is "^", the class matches any character not
! in the list. Within a list, the "-" character specifies a
! range, so that C<a-z> represents all characters between "a" and "z",
! inclusive. If you want either "-" or "]" itself to be a member of a
! class, put it at the start of the list (possibly after a "^"), or
! escape it with a backslash. "-" is also taken literally when it is
! at the end of the list, just before the closing "]". (The
! following all specify the same class of three characters: C<[-az]>,
! C<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>, which
! specifies a class containing twenty-six characters.)
! Also, if you try to use the character classes C<\w>, C<\W>, C<\s>,
! C<\S>, C<\d>, or C<\D> as endpoints of a range, that's not a range,
! the "-" is understood literally.
Note also that the whole range idea is rather unportable between
character sets--and even within character sets they may cause results
--- 938,979 ----
And if you used C<*>'s instead of limiting it to 0 through 5 matches,
then it would take forever--or until you ran out of stack space.
! A powerful tool for optimizing such beasts is the non-backtracking
! subexpression. (see L<C<< (?>pattern) >>>). Note also that
! zero-length look-ahead/look-behind assertions will not backtrack
! to make the tail match, since they are in "logical" context: only
! whether they match is considered relevant. For an example where
! side-effects of look-ahead I<might> have influenced the following
! match, see L<C<< (?>pattern) >>>.
=head2 Version 8 Regular Expressions
! In case you're not familiar with the standard Version 8 regex
routines, here are the pattern-matching rules not described above.
Any single character matches itself, unless it is a I<metacharacter>
with a special meaning described here or above. You can cause
characters that normally function as metacharacters to be interpreted
! literally by prefixing them with a C<\> (e.g., C<\.> matches a ".", not any
! character; C<\\> matches a "\"). A series of characters matches that
series of characters in the target string, so the pattern C<blurfl>
would match "blurfl" in the target string.
You can specify a character class, by enclosing a list of characters
in C<[]>, which will match any one character from the list. If the
! first character after the C<[> is C<^>, the class matches any
! character not in the list. Within a list, the C<-> character
! specifies a range, so that C<a-z> represents all characters between
! "a" and "z", inclusive. If you want either C<-> or C<]> itself to
! be a member of a class, put it at the start of the list (possibly
! after a C<^>), or escape it with a backslash. C<-> is also taken
! literally when it is at the end of the list, just before the closing
! C<]>. (The following all specify the same class of three characters:
! C<[-az]>, C<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>,
! which specifies a class containing twenty-six characters.) Also,
! if you try to use the character classes C<\w>, C<\W>, C<\s>, C<\S>,
! C<\d>, or C<\D> as endpoints of a range, that's not a range, the
! C<-> is understood literally.
Note also that the whole range idea is rather unportable between
character sets--and even within character sets they may cause results
***************
*** 955,985 ****
used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
"\f" a form feed, etc. More generally, \I<nnn>, where I<nnn> is a string
of octal digits, matches the character whose ASCII value is I<nnn>.
! Similarly, \xI<nn>, where I<nn> are hexadecimal digits, matches the
character whose ASCII value is I<nn>. The expression \cI<x> matches the
! ASCII character control-I<x>. Finally, the "." metacharacter matches any
character except "\n" (unless you use C</s>).
! You can specify a series of alternatives for a pattern using "|" to
! separate them, so that C<fee|fie|foe> will match any of "fee", "fie",
! or "foe" in the target string (as would C<f(e|i|o)e>). The
first alternative includes everything from the last pattern delimiter
! ("(", "[", or the beginning of the pattern) up to the first "|", and
! the last alternative contains everything from the last "|" to the next
! pattern delimiter. That's why it's common practice to include
! alternatives in parentheses: to minimize confusion about where they
! start and end.
!
! Alternatives are tried from left to right, so the first
! alternative found for which the entire expression matches, is the one that
! is chosen. This means that alternatives are not necessarily greedy. For
! example: when matching C<foo|foot> against "barefoot", only the "foo"
! part will match, as that is the first alternative tried, and it successfully
! matches the target string. (This might not seem important, but it is
! important when you are capturing matched text using parentheses.)
!
! Also remember that "|" is interpreted as a literal within square brackets,
! so if you write C<[fee|fie|foe]> you're really only matching C<[feio|]>.
Within a pattern, you may designate subpatterns for later reference
by enclosing them in parentheses, and you may refer back to the
--- 986,1018 ----
used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
"\f" a form feed, etc. More generally, \I<nnn>, where I<nnn> is a string
of octal digits, matches the character whose ASCII value is I<nnn>.
! Similarly, C<\xI<nn>>, where I<nn> are hexadecimal digits, matches the
character whose ASCII value is I<nn>. The expression \cI<x> matches the
! ASCII character control-I<x>. Finally, the C<.> metacharacter matches any
character except "\n" (unless you use C</s>).
! You can specify a series of alternatives for a pattern using C<|>
! to separate them, so that C<fee|fie|foe> will match any of "fee",
! "fie", or "foe" in the target string (as would C<f(e|i|o)e>). The
first alternative includes everything from the last pattern delimiter
! (C<(>, C<[>, or the beginning of the pattern) up to the first C<|>,
! and the last alternative contains everything from the last C<|> to
! the next pattern delimiter. That's why it's common practice to
! include alternatives in parentheses: to minimize confusion about
! where they start and end.
!
! Alternatives are tried from left to right, so the first alternative
! found for which the entire expression matches, is the one that is
! chosen. This means that alternatives are not necessarily greedy.
! For example: when matching C<foo|foot> against "barefoot", only the
! "foo" part will match, as that is the first alternative tried, and
! it successfully matches the target string. (This might not seem
! important, but it is important when you are capturing matched text
! using parentheses.)
!
! Also remember that C<|> is interpreted as a literal within square
! brackets, so if you write C<[fee|fie|foe]> you're really only
! matching C<[feio|]>.
Within a pattern, you may designate subpatterns for later reference
by enclosing them in parentheses, and you may refer back to the
***************
*** 998,1032 ****
$pattern =~ s/(\W)/\\\1/g;
! This is grandfathered for the RHS of a substitute to avoid shocking the
! B<sed> addicts, but it's a dirty habit to get into. That's because in
! PerlThink, the righthand side of a C<s///> is a double-quoted string. C<\1> in
! the usual double-quoted string means a control-A. The customary Unix
! meaning of C<\1> is kludged in for C<s///>. However, if you get into the habit
! of doing that, you get yourself into trouble if you then add an C</e>
! modifier.
! s/(\d+)/ \1 + 1 /eg; # causes warning under -w
Or if you try to do
s/(\d+)/\1000/;
! You can't disambiguate that by saying C<\{1}000>, whereas you can fix it with
! C<${1}000>. The operation of interpolation should not be confused
! with the operation of matching a backreference. Certainly they mean two
! different things on the I<left> side of the C<s///>.
=head2 Repeated patterns matching zero-length substring
! B<WARNING>: Difficult material (and prose) ahead. This section needs a rewrite.
Regular expressions provide a terse and powerful programming language. As
with most other power tools, power comes together with the ability
to wreak havoc.
A common abuse of this power stems from the ability to make infinite
! loops using regular expressions, with something as innocuous as:
'foo' =~ m{ ( o? )* }x;
--- 1031,1066 ----
$pattern =~ s/(\W)/\\\1/g;
! This is grandfathered for the RHS of a substitute to avoid shocking
! the B<sed> addicts, but it's a dirty habit to get into. That's
! because in PerlThink, the righthand side of a C<s///> is a double-quoted
! string. C<\1> in the usual double-quoted string means a control-A.
! The customary Unix meaning of C<\1> is kludged in for C<s///>.
! However, if you get into the habit of doing that, you get yourself
! into trouble if you then add an C</e> modifier.
! s/(\d+)/ \1 + 1 /eg; # triggers optional warnings
Or if you try to do
s/(\d+)/\1000/;
! You can't disambiguate that by saying C<\{1}000>, whereas you can
! fix it with C<${1}000>. The operation of interpolation should not
! be confused with the operation of matching a backreference. Certainly
! they mean two different things on the I<left> side of the C<s///>.
=head2 Repeated patterns matching zero-length substring
! B<WARNING>: Difficult material (and prose) ahead. This section
! needs a rewrite.
Regular expressions provide a terse and powerful programming language. As
with most other power tools, power comes together with the ability
to wreak havoc.
A common abuse of this power stems from the ability to make infinite
! loops using regexes, with something as innocuous as:
'foo' =~ m{ ( o? )* }x;
***************
*** 1061,1077 ****
m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;
! is made equivalent to
! m{ (?: NON_ZERO_LENGTH )*
! |
! (?: ZERO_LENGTH )?
}x;
The higher level-loops preserve an additional state between iterations:
! whether the last match was zero-length. To break the loop, the following
match after a zero-length match is prohibited to have a length of zero.
! This prohibition interacts with backtracking (see L<"Backtracking">),
and so the I<second best> match is chosen if the I<best> match is of
zero length.
--- 1095,1111 ----
m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;
! is made equivalent to
! m{ (?: NON_ZERO_LENGTH )*
! |
! (?: ZERO_LENGTH )?
}x;
The higher level-loops preserve an additional state between iterations:
! whether the last match was zero-length. To break the loop, the following
match after a zero-length match is prohibited to have a length of zero.
! This prohibition interacts with backtracking (see L<"Backtracking">),
and so the I<second best> match is chosen if the I<best> match is of
zero length.
***************
*** 1080,1132 ****
$_ = 'bar';
s/\w??/<$&>/g;
! results in C<"<><b><><a><><r><>">. At each position of the string the best
! match given by non-greedy C<??> is the zero-length match, and the I<second
! best> match is what is matched by C<\w>. Thus zero-length matches
! alternate with one-character-long matches.
!
! Similarly, for repeated C<m/()/g> the second-best match is the match at the
! position one notch further in the string.
!
! The additional state of being I<matched with zero-length> is associated with
! the matched string, and is reset by each assignment to pos().
! Zero-length matches at the end of the previous match are ignored
! during C<split>.
=head2 Combining pieces together
! Each of the elementary pieces of regular expressions which were described
before (such as C<ab> or C<\Z>) could match at most one substring
! at the given position of the input string. However, in a typical regular
! expression these elementary pieces are combined into more complicated
! patterns using combining operators C<ST>, C<S|T>, C<S*> etc
! (in these examples C<S> and C<T> are regular subexpressions).
Such combinations can include alternatives, leading to a problem of choice:
! if we match a regular expression C<a|ab> against C<"abc">, will it match
substring C<"a"> or C<"ab">? One way to describe which substring is
actually matched is the concept of backtracking (see L<"Backtracking">).
However, this description is too low-level and makes you think
in terms of a particular implementation.
! Another description starts with notions of "better"/"worse". All the
! substrings which may be matched by the given regular expression can be
! sorted from the "best" match to the "worst" match, and it is the "best"
! match which is chosen. This substitutes the question of "what is chosen?"
! by the question of "which matches are better, and which are worse?".
Again, for elementary pieces there is no such question, since at most
one match at a given position is possible. This section describes the
notion of better/worse for combining operators. In the description
! below C<S> and C<T> are regular subexpressions.
=over
=item C<ST>
Consider two possible matches, C<AB> and C<A'B'>, C<A> and C<A'> are
! substrings which can be matched by C<S>, C<B> and C<B'> are substrings
! which can be matched by C<T>.
If C<A> is better match for C<S> than C<A'>, C<AB> is a better
match than C<A'B'>.
--- 1114,1170 ----
$_ = 'bar';
s/\w??/<$&>/g;
! results in C<"<><b><><a><><r><>">. At each position of the string
! the best match given by non-greedy C<??> is the zero-length match,
! and the I<second best> match is what is matched by C<\w>. Thus
! zero-length matches alternate with one-character-long matches.
!
! Similarly, for repeated C<m/()/g> the second-best match is the match
! at the position one notch further in the string.
!
! The additional state of being I<matched with zero-length> is
! associated with the matched string, and is reset by each assignment
! to pos(). Zero-length matches at the end of the previous match are
! ignored during C<split>.
=head2 Combining pieces together
! B<WARNING>: Difficult material (and prose) ahead. This section
! needs a rewrite.
!
! Each of the elementary pieces of regular expressions described
before (such as C<ab> or C<\Z>) could match at most one substring
! at the given position of the input string. However, in a typical
! regex, these elementary pieces are combined into more complicated
! patterns using combining operators C<ST>, C<S|T>, C<S*> etc (in
! these examples C<S> and C<T> are regular subexpressions).
Such combinations can include alternatives, leading to a problem of choice:
! if we match a pattern C<a|ab> against C<"abc">, will it match
substring C<"a"> or C<"ab">? One way to describe which substring is
actually matched is the concept of backtracking (see L<"Backtracking">).
However, this description is too low-level and makes you think
in terms of a particular implementation.
! Another description starts with notions of "better"/"worse". All
! the substrings that may be matched by the given pattern can be
! sorted from the "best" match to the "worst" match, and it is the
! "best" match that's chosen. This substitutes the question of "what
! is chosen?" with the question of "which matches are better, and which
! are worse?"
Again, for elementary pieces there is no such question, since at most
one match at a given position is possible. This section describes the
notion of better/worse for combining operators. In the description
! below, C<S> and C<T> are regular subexpressions.
=over
=item C<ST>
Consider two possible matches, C<AB> and C<A'B'>, C<A> and C<A'> are
! substrings that can be matched by C<S>, C<B> and C<B'> are substrings
! which can be matched by C<T>.
If C<A> is better match for C<S> than C<A'>, C<AB> is a better
match than C<A'B'>.
***************
*** 1169,1175 ****
Only the best match for C<S> is considered. (This is important only if
C<S> has capturing parentheses, and backreferences are used somewhere
! else in the whole regular expression.)
=item C<(?!S)>, C<(?<!S)>
--- 1207,1213 ----
Only the best match for C<S> is considered. (This is important only if
C<S> has capturing parentheses, and backreferences are used somewhere
! else in the whole pattern.)
=item C<(?!S)>, C<(?<!S)>
***************
*** 1178,1184 ****
=item C<(??{ EXPR })>
! The ordering is the same as for the regular expression which is
the result of EXPR.
=item C<(?(condition)yes-pattern|no-pattern)>
--- 1216,1222 ----
=item C<(??{ EXPR })>
! The ordering is the same as for the pattern that is
the result of EXPR.
=item C<(?(condition)yes-pattern|no-pattern)>
***************
*** 1191,1259 ****
The above recipes describe the ordering of matches I<at a given position>.
One more rule is needed to understand how a match is determined for the
! whole regular expression: a match at an earlier position is always better
than a match at a later position.
! =head2 Creating custom RE engines
Overloaded constants (see L<overload>) provide a simple way to extend
! the functionality of the RE engine.
! Suppose that we want to enable a new RE escape-sequence C<\Y|> which
matches at boundary between white-space characters and non-whitespace
characters. Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly
at these positions, so we want to have each C<\Y|> in the place of the
! more complicated version. We can create a module C<customre> to do
! this:
! package customre;
use overload;
sub import {
! shift;
! die "No argument to customre::import allowed" if @_;
! overload::constant 'qr' => \&convert;
}
sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
! my %rules = ( '\\' => '\\',
'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ );
sub convert {
! my $re = shift;
! $re =~ s{
! \\ ( \\ | Y . )
! }
! { $rules{$1} or invalid($re,$1) }sgex;
! return $re;
}
! Now C<use customre> enables the new escape in constant regular
! expressions, i.e., those without any runtime variable interpolations.
! As documented in L<overload>, this conversion will work only over
! literal parts of regular expressions. For C<\Y|$re\Y|> the variable
! part of this regular expression needs to be converted explicitly
! (but only if the special meaning of C<\Y|> should be enabled inside $re):
! use customre;
$re = <>;
chomp $re;
! $re = customre::convert $re;
/\Y|$re\Y|/;
=head1 BUGS
This document varies from difficult to understand to completely
and utterly opaque. The wandering prose riddled with jargon is
! hard to fathom in several places.
!
! This document needs a rewrite that separates the tutorial content
! from the reference content.
=head1 SEE ALSO
L<perlop/"Regexp Quote-Like Operators">.
L<perlop/"Gory details of parsing quoted constructs">.
L<perlfaq6>.
--- 1229,1300 ----
The above recipes describe the ordering of matches I<at a given position>.
One more rule is needed to understand how a match is determined for the
! whole pattern: a match at an earlier position is always better
than a match at a later position.
! =head2 Defining Your Own Backslash Escapes
Overloaded constants (see L<overload>) provide a simple way to extend
! the functionality of the regex engine.
! Suppose that we want to enable a new regex escape-sequence C<\Y|> that
matches at boundary between white-space characters and non-whitespace
characters. Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly
at these positions, so we want to have each C<\Y|> in the place of the
! more complicated version. We can create a C<custom_re> module to do this:
! package custom_re;
use overload;
sub import {
! shift;
! die "No argument to custom_re::import allowed" if @_;
! overload::constant 'qr' => \&convert;
}
sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
! my %rules = ( '\\' => '\\',
'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ );
+
sub convert {
! my $re = shift;
! $re =~ s{
! \\ ( \\ | Y . )
! } {
! $rules{$1} || invalid($re,$1)
! }sgex;
! return $re;
}
! Now C<use custom_re> enables the new escape in constant patterns,
! i.e., those without variable interpolation. As documented in
! L<overload>, this conversion will work only over literal parts of
! regexes. For C<\Y|$re\Y|> the variable part of this pattern needs
! to be converted explicitly (but only if the special meaning of
! C<\Y|> should be enabled inside $re):
! use custom_re;
$re = <>;
chomp $re;
! $re = custom_re::convert $re;
/\Y|$re\Y|/;
=head1 BUGS
This document varies from difficult to understand to completely
and utterly opaque. The wandering prose riddled with jargon is
! hard to fathom in several places. The expert material
! should be extracted out into a I<perlreguts>(1) manpage.
=head1 SEE ALSO
L<perlop/"Regexp Quote-Like Operators">.
+ L<perlrequick>.
+
+ L<perlretut>.
+
L<perlop/"Gory details of parsing quoted constructs">.
L<perlfaq6>.
***************
*** 1261,1266 ****
--- 1302,1309 ----
L<perlfunc/pos>.
L<perllocale>.
+
+ L<perldebugs/"Debugger Internals">.
I<Mastering Regular Expressions> by Jeffrey Friedl, published
by O'Reilly and Associates.
Thread Next
-
PATCH: perlre.pod (against 5.6.0)
by Tom Christiansen