develooper Front page | perl.perl5.porters | Postings from April 2000

PATCH: perlre.pod (against 5.6.0)

Thread Next
From:
Tom Christiansen
Date:
April 29, 2000 10:16
Subject:
PATCH: perlre.pod (against 5.6.0)
Message ID:
14962.957028538@chthon
*** perlre56.pod	Sat Apr 29 09:42:35 2000
--- perlre.pod	Sat Apr 29 11:13:51 2000
***************
*** 4,109 ****
  
  =head1 DESCRIPTION
  
! This page describes the syntax of regular expressions in Perl.  For a
! description of how to I<use> regular expressions in matching
! operations, plus various examples of the same, see discussions
! of C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like Operators">.
! 
! Matching operations can have various modifiers.  Modifiers
! that relate to the interpretation of the regular expression inside
! are listed below.  Modifiers that alter the way a regular expression
! is used by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and 
! L<perlop/"Gory details of parsing quoted constructs">.
  
  =over 4
  
  =item i
  
! Do case-insensitive pattern matching.
  
  If C<use locale> is in effect, the case map is taken from the current
  locale.  See L<perllocale>.
  
  =item m
  
! Treat string as multiple lines.  That is, change "^" and "$" from matching
! the start or end of the string to matching the start or end of any
! line anywhere within the string.
  
  =item s
  
! Treat string as single line.  That is, change "." to match any character
! whatsoever, even a newline, which normally it would not match.
  
  The C</s> and C</m> modifiers both override the C<$*> setting.  That
  is, no matter what C<$*> contains, C</s> without C</m> will force
! "^" to match only at the beginning of the string and "$" to match
  only at the end (or just before a newline at the end) of the string.
! Together, as /ms, they let the "." match any character whatsoever,
! while yet allowing "^" and "$" to match, respectively, just after
  and just before newlines within the string.
  
  =item x
  
! Extend your pattern's legibility by permitting whitespace and comments.
  
  =back
  
! These are usually written as "the C</x> modifier", even though the delimiter
! in question might not really be a slash.  Any of these
! modifiers may also be embedded within the regular expression itself using
! the C<(?...)> construct.  See below.
  
  The C</x> modifier itself needs a little more explanation.  It tells
! the regular expression parser to ignore whitespace that is neither
! backslashed nor within a character class.  You can use this to break up
! your regular expression into (slightly) more readable parts.  The C<#>
! character is also treated as a metacharacter introducing a comment,
! just as in ordinary Perl code.  This also means that if you want real
  whitespace or C<#> characters in the pattern (outside a character
! class, where they are unaffected by C</x>), that you'll either have to 
! escape them or encode them using octal or hex escapes.  Taken together,
! these features go a long way towards making Perl's regular expressions
  more readable.  Note that you have to be careful not to include the
! pattern delimiter in the comment--perl has no way of knowing you did
! not intend to close the pattern early.  See the C-comment deletion code
! in L<perlop>.
  
  =head2 Regular Expressions
  
! The patterns used in Perl pattern matching derive from supplied in
! the Version 8 regex routines.  (The routines are derived
! (distantly) from Henry Spencer's freely redistributable reimplementation
! of the V8 routines.)  See L<Version 8 Regular Expressions> for
! details.
  
! In particular the following metacharacters have their standard I<egrep>-ish
  meanings:
  
      \	Quote the next metacharacter
!     ^	Match the beginning of the line
      .	Match any character (except newline)
!     $	Match the end of the line (or before newline at the end)
      |	Alternation
      ()	Grouping
      []	Character class
  
! By default, the "^" character is guaranteed to match only the
! beginning of the string, the "$" character only the end (or before the
! newline at the end), and Perl does certain optimizations with the
! assumption that the string contains only one line.  Embedded newlines
! will not be matched by "^" or "$".  You may, however, wish to treat a
! string as a multi-line buffer, such that the "^" will match after any
! newline within the string, and "$" will match before any newline.  At the
! cost of a little more overhead, you can do this by using the /m modifier
! on the pattern match operator.  (Older programs did this by setting C<$*>,
! but this practice is now deprecated.)
! 
! To simplify multi-line substitutions, the "." character never matches a
! newline unless you use the C</s> modifier, which in effect tells Perl to pretend
! the string is a single line--even if it isn't.  The C</s> modifier also
! overrides the setting of C<$*>, in case you have some (badly behaved) older
! code that sets it in another module.
  
  The following standard quantifiers are recognized:
  
--- 4,116 ----
  
  =head1 DESCRIPTION
  
! This page describes the syntax and semantics of Perl's regular
! expression engine.  For a description of how to actually I<use>
! regular expressions  in matching operations, plus various examples
! of the same, see the descriptions of of C<m//>, C<s///>, C<qr//>
! and C<??> in L<perlop/"Regexp Quote-Like Operators">.  (These are
! typically called regexes for short, or more accurately, patterns,
! since Perl's "regular expressions" aren't properly regular in the
! special compsci sense of that word.)
! 
! Matching operations can have various modifiers.  Modifiers that
! relate to the interpretation of the regex are listed below.  Modifiers
! that alter the way Perl uses a pattern are detailed in L<perlop/"Regexp
! Quote-Like Operators"> and L<perlop/"Gory details of parsing quoted
! constructs">.
  
  =over 4
  
  =item i
  
! Do case-insensitive pattern matching, including when matching
! backreferences.
  
  If C<use locale> is in effect, the case map is taken from the current
  locale.  See L<perllocale>.
  
  =item m
  
! Change C<^> and C<$> from matching the start or before the optional
! newline at the end of the string to matching the start or end of
! any line anywhere within the string.
  
  =item s
  
! Change C<.> to match any character whatsoever, even a newline, which
! normally it would not match, even if the deprecated C<$*> variable
! were set.
  
  The C</s> and C</m> modifiers both override the C<$*> setting.  That
  is, no matter what C<$*> contains, C</s> without C</m> will force
! C<^> to match only at the beginning of the string and C<$> to match
  only at the end (or just before a newline at the end) of the string.
! Together, as C</ms>, they let the C<.> match any character whatsoever,
! while yet allowing C<^> and C<$> to match, respectively, just after
  and just before newlines within the string.
  
  =item x
  
! Permit whitespace and comments in comments in the pattern, enhancing
! (well, enabling) legibility.  It's also more expressive.
  
  =back
  
! These are usually written as "the C</x> modifier", even though the
! delimiter in question might not really be a slash.  Any of these
! modifiers may also be embedded within the regex itself using the
! C<(?I<flags>...) construct.  See below.
  
  The C</x> modifier itself needs a little more explanation.  It tells
! the regex parser to ignore whitespace that is neither backslashed
! nor within a character class.  You can use this to break up your
! pattern into (slightly) more readable parts.  The C<#> character
! is also treated as a metacharacter introducing a comment, just as
! in ordinary Perl code.  This also means that if you want real
  whitespace or C<#> characters in the pattern (outside a character
! class, where they are unaffected by C</x>), that you'll either have
! to escape them or encode them using octal or hex escapes.  Taken
! together, these features go a long way towards making Perl's patterns
  more readable.  Note that you have to be careful not to include the
! pattern delimiter in the comment--perl has no way of knowing you
! did not intend to close the pattern early.  See the C-comment
! deletion code in L<perlop>.
  
  =head2 Regular Expressions
  
! The patterns used in Perl pattern matching derive from the standard
! Version 8 Unix regex routines.  (The routines are derived (distantly)
! from Henry Spencer's freely redistributable reimplementation of the
! V8 routines.)  See L<Version 8 Regular Expressions> for details.
  
! In particular the following metacharacters have their standard B<egrep>-ish
  meanings:
  
      \	Quote the next metacharacter
!     ^	Match the beginning of the string
      .	Match any character (except newline)
!     $	Match before the optional newline at the end the string
      |	Alternation
      ()	Grouping
      []	Character class
  
! By default, the C<^> metacharacter is matches only the beginning
! of the string, the C<$> metacharacter only before an optional
! trailing newline at the end, so Perl does certain optimizations
! with the assumption that the string contains only one line.  Embedded
! newlines will not normally be noticed by C<^> or C<$>.  You may,
! however, wish to treat a string as a multi-line buffer, such that
! the C<^> will match after any newline within the string, and C<$>
! will match before any newline.  (These don't actually match the
! newlines, though, so C</foo^bar/> can never match, for example.)
! At the cost of a little more overhead, you can do this by using the
! C</m> modifier on the pattern match operator.  (Older programs did
! this by setting C<$*>, but this practice is now deprecated.)
! 
! For historical reasons and to simplify substitutions, the C<.>
! character never matches a newline unless you use the C</s> modifier.
! The C</s> modifier also overrides the setting of C<$*>, in case you
! have some (badly behaved) older code that sets it in another module.
  
  The following standard quantifiers are recognized:
  
***************
*** 114,141 ****
      {n,}   Match at least n times
      {n,m}  Match at least n but not more than m times
  
! (If a curly bracket occurs in any other context, it is treated
! as a regular character.)  The "*" modifier is equivalent to C<{0,}>, the "+"
! modifier to C<{1,}>, and the "?" modifier to C<{0,1}>.  n and m are limited
! to integral values less than a preset limit defined when perl is built.
! This is usually 32766 on the most common platforms.  The actual limit can
! be seen in the error message generated by code such as this:
  
      $_ **= $_ , / {$_} / for 2 .. 42;
  
! By default, a quantified subpattern is "greedy", that is, it will match as
! many times as possible (given a particular starting location) while still
! allowing the rest of the pattern to match.  If you want it to match the
! minimum number of times possible, follow the quantifier with a "?".  Note
! that the meanings don't change, just the "greediness":
  
      *?	   Match 0 or more times
      +?	   Match 1 or more times
      ??	   Match 0 or 1 time
-     {n}?   Match exactly n times
      {n,}?  Match at least n times
      {n,m}? Match at least n but not more than m times
  
  Because patterns are processed as double quoted strings, the following
  also work:
  
--- 121,165 ----
      {n,}   Match at least n times
      {n,m}  Match at least n but not more than m times
  
! (If a brace occurs in any other context, it is treated as a regular
! character.)  The C<*> modifier is equivalent to C<{0,}>, the C<+>
! modifier to C<{1,}>, and the C<?> modifier to C<{0,1}>.  I<n> and
! I<m> are limited to integral values less than a preset limit defined
! when perl is built.  This is usually 32766 on the most common
! platforms.  The actual limit can be seen in the error message
! generated by code such as this:
  
      $_ **= $_ , / {$_} / for 2 .. 42;
  
! By default, quantifiers are "greedy", that is, they will match as
! many times as possible (given a particular starting location) while
! still allowing the rest of the pattern to match.  If you want 
! to match the minimum number of times possible, follow the quantifier
! with a C<?>.  Note that the meanings don't change, just the
! "greediness":
  
      *?	   Match 0 or more times
      +?	   Match 1 or more times
      ??	   Match 0 or 1 time
      {n,}?  Match at least n times
      {n,m}? Match at least n but not more than m times
  
+ Perl matches patterns "eagerly", that is, as soon as possible.  Even
+ with minimal matches, Perl still finds the leftmost possible match.
+ The question mark only changes the sense from leftmost-longest to
+ leftmost-shortest, and only for that quantifier.  Unlike regex
+ languages with overall greed, in Perl, the leftmost aspect is still
+ more important than longest/shortest.  Only when two matches start
+ at the same point are their lengths considered.  Otherwise, the
+ lefter one always wins.  That's why both C</a*/> and C</a*?/> match
+ all possible strings irrespective of content, and at the earliest
+ possible point--right before the beginning of the string.
+ 
+ A question mark was chosen for this and for the minimal-matching
+ construct because question marks are rare in older regular expressions,
+ and because whenever you see one, you should stop and "question"
+ exactly what is going on.  That's psychology...
+ 
  Because patterns are processed as double quoted strings, the following
  also work:
  
***************
*** 145,157 ****
      \f		form feed             (FF)
      \a		alarm (bell)          (BEL)
      \e		escape (think troff)  (ESC)
!     \033	octal char (think of a PDP-11)
      \x1B	hex char
      \x{263a}	wide hex char         (Unicode SMILEY)
      \c[		control char
!     \N{name}	named char
      \l		lowercase next char (think vi)
!     \u		uppercase next char (think vi)
      \L		lowercase till \E (think vi)
      \U		uppercase till \E (think vi)
      \E		end case modification (think vi)
--- 169,182 ----
      \f		form feed             (FF)
      \a		alarm (bell)          (BEL)
      \e		escape (think troff)  (ESC)
!     \033	octal char (think of a PDP-11); 0 is optional except on
! 	        single digits
      \x1B	hex char
      \x{263a}	wide hex char         (Unicode SMILEY)
      \c[		control char
!     \N{name}	named char (requires use charnames)
      \l		lowercase next char (think vi)
!     \u		titlecase next char (think vi)
      \L		lowercase till \E (think vi)
      \U		uppercase till \E (think vi)
      \E		end case modification (think vi)
***************
*** 181,193 ****
      \C	Match a single C char (octet) even under utf8.
  
  A C<\w> matches a single alphanumeric character, not a whole word.
! Use C<\w+> to match a string of Perl-identifier characters (which isn't 
! the same as matching an English word).  If C<use locale> is in effect, the
! list of alphabetic characters generated by C<\w> is taken from the
! current locale.  See L<perllocale>.  You may use C<\w>, C<\W>, C<\s>, C<\S>,
! C<\d>, and C<\D> within character classes, but if you try to use them
! as endpoints of a range, that's not a range, the "-" is understood literally.
! See L<utf8> for details about C<\pP>, C<\PP>, and C<\X>.
  
  The POSIX character class syntax
  
--- 206,219 ----
      \C	Match a single C char (octet) even under utf8.
  
  A C<\w> matches a single alphanumeric character, not a whole word.
! Use C<\w+> to match a string of Perl-identifier characters (which
! isn't the same as matching an English word).  If C<use locale> is
! in effect, the list of alphabetic characters generated by C<\w> is
! taken from the current locale.  See L<perllocale>.  You may use
! C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, and C<\D> within character
! classes, but if you try to use them as endpoints of a range, that's
! not a range, the C<-> is understood literally.  See L<utf8> for
! details about C<\pP>, C<\PP>, and C<\X>.
  
  The POSIX character class syntax
  
***************
*** 211,225 ****
      xdigit
  
  For example use C<[:upper:]> to match all the uppercase characters.
! Note that the C<[]> are part of the C<[::]> construct, not part of the whole
! character class.  For example:
  
      [01[:alpha:]%]
  
  matches one, zero, any alphabetic character, and the percentage sign.
  
  If the C<utf8> pragma is used, the following equivalences to Unicode
! \p{} constructs hold:
  
      alpha       IsAlpha
      alnum       IsAlnum
--- 237,251 ----
      xdigit
  
  For example use C<[:upper:]> to match all the uppercase characters.
! Note that the C<[]> are part of the C<[::]> construct, not part of
! the whole character class.  For example:
  
      [01[:alpha:]%]
  
  matches one, zero, any alphabetic character, and the percentage sign.
  
  If the C<utf8> pragma is used, the following equivalences to Unicode
! C<\p{}> constructs hold:
  
      alpha       IsAlpha
      alnum       IsAlnum
***************
*** 238,245 ****
  For example C<[:lower:]> and C<\p{IsLower}> are equivalent.
  
  If the C<utf8> pragma is not used but the C<locale> pragma is, the
! classes correlate with the isalpha(3) interface (except for `word',
! which is a Perl extension, mirroring C<\w>).
  
  The assumedly non-obviously named classes are:
  
--- 264,271 ----
  For example C<[:lower:]> and C<\p{IsLower}> are equivalent.
  
  If the C<utf8> pragma is not used but the C<locale> pragma is, the
! classes correlate with the standard isalpha(3) interface (except
! for C<word>, which is a Perl extension, mirroring C<\w>).
  
  The assumedly non-obviously named classes are:
  
***************
*** 250,256 ****
  Any control character.  Usually characters that don't produce output as
  such but instead control the terminal somehow: for example newline and
  backspace are control characters.  All characters with ord() less than
! 32 are most often classified as control characters.
  
  =item graph
  
--- 276,282 ----
  Any control character.  Usually characters that don't produce output as
  such but instead control the terminal somehow: for example newline and
  backspace are control characters.  All characters with ord() less than
! decimal 32 are most often classified as control characters.
  
  =item graph
  
***************
*** 266,278 ****
  
  =item xdigit
  
! Any hexadecimal digit.  Though this may feel silly (/0-9a-f/i would
  work just fine) it is included for completeness.
  
  =back
  
! You can negate the [::] character classes by prefixing the class name
! with a '^'. This is a Perl extension.  For example:
  
      POSIX	trad. Perl  utf8 Perl
  
--- 292,304 ----
  
  =item xdigit
  
! Any hexadecimal digit.  Though this may feel silly (C</0-9a-f/i> would
  work just fine) it is included for completeness.
  
  =back
  
! You can negate the C<[::]> character classes by prefixing the class name
! with a C<^>. This is a Perl extension.  For example:
  
      POSIX	trad. Perl  utf8 Perl
  
***************
*** 280,334 ****
      [:^space:]	    \S	    \P{IsSpace}
      [:^word:]	    \W	    \P{IsWord}
  
! The POSIX character classes [.cc.] and [=cc=] are recognized but
! B<not> supported and trying to use them will cause an error.
  
  Perl defines the following zero-width assertions:
  
      \b	Match a word boundary
      \B	Match a non-(word boundary)
      \A	Match only at beginning of string
!     \Z	Match only at end of string, or before newline at the end
!     \z	Match only at end of string
      \G	Match only at pos() (e.g. at the end-of-match position
          of prior m//g)
  
! A word boundary (C<\b>) is a spot between two characters
! that has a C<\w> on one side of it and a C<\W> on the other side
! of it (in either order), counting the imaginary characters off the
! beginning and end of the string as matching a C<\W>.  (Within
! character classes C<\b> represents backspace rather than a word
! boundary, just as it normally does in any double-quoted string.)
! The C<\A> and C<\Z> are just like "^" and "$", except that they
! won't match multiple times when the C</m> modifier is used, while
! "^" and "$" will match at every internal line boundary.  To match
! the actual end of the string and not ignore an optional trailing
! newline, use C<\z>.
  
  The C<\G> assertion can be used to chain global matches (using
  C<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">.
! It is also useful when writing C<lex>-like scanners, when you have
  several patterns that you want to match against consequent substrings
  of your string, see the previous reference.  The actual location
! where C<\G> will match can also be influenced by using C<pos()> as
  an lvalue.  See L<perlfunc/pos>.
  
  The bracketing construct C<( ... )> creates capture buffers.  To
! refer to the digit'th buffer use \<digit> within the
! match.  Outside the match use "$" instead of "\".  (The
! \<digit> notation works in certain circumstances outside 
! the match.  See the warning below about \1 vs $1 for details.)
! Referring back to another part of the match is called a
! I<backreference>.
  
  There is no limit to the number of captured substrings that you may
! use.  However Perl also uses \10, \11, etc. as aliases for \010,
! \011, etc.  (Recall that 0 means octal, so \011 is the 9'th ASCII
! character, a tab.)  Perl resolves this ambiguity by interpreting
! \10 as a backreference only if at least 10 left parentheses have
! opened before it.  Likewise \11 is a backreference only if at least
! 11 left parentheses have opened before it.  And so on.  \1 through
! \9 are always interpreted as backreferences."
  
  Examples:
  
--- 306,359 ----
      [:^space:]	    \S	    \P{IsSpace}
      [:^word:]	    \W	    \P{IsWord}
  
! The POSIX character classes C<[.cc.]> and C<[=cc=]> are recognized but
! I<not> supported and trying to use them will cause an error.
  
  Perl defines the following zero-width assertions:
  
      \b	Match a word boundary
      \B	Match a non-(word boundary)
      \A	Match only at beginning of string
!     \Z	Match before optional newline at end of string
!     \z	Match at end of string (not in front of the newline)
      \G	Match only at pos() (e.g. at the end-of-match position
          of prior m//g)
  
! A word boundary (C<\b>) is a spot between two characters that has
! a C<\w> on one side of it and a C<\W> on the other side of it (in
! either order), counting the imaginary characters off the beginning
! and end of the string as matching a C<\W>.  (Within character classes
! C<\b> represents backspace rather than a word boundary, just as it
! normally does in any double-quoted string.) The C<\A> and C<\Z> are
! just like C<^> and C<$>, except that they won't match internally
! when the C</m> modifier is used, whereas C<^> and C<$> can match
! next to any internal newline.  To match the actual end of the string
! and not ignore an optional trailing newline, use C<\z>.
  
  The C<\G> assertion can be used to chain global matches (using
  C<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">.
! It is also useful when writing B<lex>-like scanners, when you have
  several patterns that you want to match against consequent substrings
  of your string, see the previous reference.  The actual location
! where C<\G> will match can also be influenced by using pos() as
  an lvalue.  See L<perlfunc/pos>.
  
  The bracketing construct C<( ... )> creates capture buffers.  To
! refer to the digit'th buffer use \<digit> within the match.  Outside
! the match use C<$> to access the numbered variables, instead of
! C<\> to access backreferences.  (The \<digit> notation works in
! certain circumstances outside the match.  See the warning below
! about \1 vs $1 for details.)  Referring back to another part of the
! match is called a I<backreference>.
  
  There is no limit to the number of captured substrings that you may
! use.  However Perl also uses C<\10>, C<\11>, etc. as aliases for
! C<\010>, C<\011>, etc.  (Recall that 0 means octal, so C<\01> is
! the 9'th ASCII character, a tab.)  Perl resolves this ambiguity by
! interpreting C<\10> as a backreference only if at least 10 left
! parentheses have opened before it.  Likewise C<\11> is a backreference
! only if at least 11 left parentheses have opened before it.  And
! so on.  C<\1> through C<\9> are always interpreted as backreferences."
  
  Examples:
  
***************
*** 345,383 ****
      }
  
  Several special variables also refer back to portions of the previous
! match.  C<$+> returns whatever the last bracket match matched.
! C<$&> returns the entire matched string.  (At one point C<$0> did
! also, but now it returns the name of the program.)  C<$`> returns
! everything before the matched string.  And C<$'> returns everything
! after the matched string.
  
  The numbered variables ($1, $2, $3, etc.) and the related punctuation
! set (C<<$+>, C<$&>, C<$`>, and C<$'>) are all dynamically scoped
! until the end of the enclosing block or until the next successful
! match, whichever comes first.  (See L<perlsyn/"Compound Statements">.)
! 
! B<WARNING>: Once Perl sees that you need one of C<$&>, C<$`>, or
! C<$'> anywhere in the program, it has to provide them for every
! pattern match.  This may substantially slow your program.  Perl
! uses the same mechanism to produce $1, $2, etc, so you also pay a
! price for each pattern that contains capturing parentheses.  (To
! avoid this cost while retaining the grouping behaviour, use the
! extended regular expression C<(?: ... )> instead.)  But if you never
! use C<$&>, C<$`> or C<$'>, then patterns I<without> capturing
! parentheses will not be penalized.  So avoid C<$&>, C<$'>, and C<$`>
! if you can, but if you can't (and some algorithms really appreciate
! them), once you've used them once, use them at will, because you've
! already paid the price.  As of 5.005, C<$&> is not so costly as the
! other two.
  
! Backslashed metacharacters in Perl are alphanumeric, such as C<\b>,
! C<\w>, C<\n>.  Unlike some other regular expression languages, there
  are no backslashed symbols that aren't alphanumeric.  So anything
! that looks like \\, \(, \), \<, \>, \{, or \} is always
! interpreted as a literal character, not a metacharacter.  This was
! once used in a common idiom to disable or quote the special meanings
! of regular expression metacharacters in a string that you want to
! use for a pattern. Simply quote all non-alphanumeric characters:
  
      $pattern =~ s/(\W)/\\$1/g;
  
--- 370,411 ----
      }
  
  Several special variables also refer back to portions of the previous
! match.  C<$+> returns whatever the last bracket match matched.  The
! C<$`>-C<$&>-C<$'> trio are mnemonically named to correspond to the
! pieces in a `match'.  C<$`> returns everything before the matched
! string.  C<$&> returns the entire matched string.  And C<$'> returns
! everything after the matched string.
  
  The numbered variables ($1, $2, $3, etc.) and the related punctuation
! set (C<<$+>, C<$`>, C<$&>, and C<$'>) are all automatically localized
! to the enclosing dynamic scope.  Their values are therefore ephemeral
! and best copied into more enduring variables.  (See L<perlsyn/"Compound
! Statements">.)
! 
! Once Perl sees that you need one of C<$&>, C<$`>, or C<$'> anywhere
! in the program, it has to provide them for every pattern match.
! This will slow down pattern matches a bit, and if most of your
! program is spent matching patterns, you may notice this.  Perl uses
! the same mechanism to produce $1, $2, etc, so you also pay a price
! for each pattern that contains capturing parentheses.  (To avoid
! this cost while retaining the grouping behaviour, use the extended
! regular expression C<(?:I<X>...)> instead.)  But if you never use
! C<$`>, C<$&>, or C<$'>, then patterns I<without> capturing parentheses
! will not be penalized.  So avoid C<$'>, C<$&>, and C<$`> if you
! can, but if you can't (and some algorithms really appreciate them),
! once you've used them once, use them at will, because you've already
! paid the price.  As of 5.005, C<$&> is not so costly as the other
! two.
  
! Backslashed alphanumerics in Perl are often special, such as C<\b>,
! C<\w>, C<\n>.  Unlike some other regex languages, there
  are no backslashed symbols that aren't alphanumeric.  So anything
! that looks like C<\\>, C<\(>, C<\)>, C<\<>, C<< \> >>, C<\{>, or
! C<\}> is always interpreted as a literal character, not a metacharacter.
! This was once used in a common idiom to disable or quote the special
! meanings of regex metacharacters in a string that you
! want to use for a pattern.  Simply quote all non-alphanumeric
! characters:
  
      $pattern =~ s/(\W)/\\$1/g;
  
***************
*** 398,425 ****
  Perl also defines a consistent extension syntax for features not
  found in standard tools like B<awk> and B<lex>.  The syntax is a
  pair of parentheses with a question mark as the first thing within
! the parentheses.  The character after the question mark indicates
! the extension.
  
! The stability of these extensions varies widely.  Some have been
! part of the core language for many years.  Others are experimental
! and may change without warning or be completely removed.  Check
! the documentation on an individual feature to verify its current
! status.
! 
! A question mark was chosen for this and for the minimal-matching
! construct because 1) question marks are rare in older regular
! expressions, and 2) whenever you see one, you should stop and
! "question" exactly what is going on.  That's psychology...
  
  =over 10
  
  =item C<(?#text)>
  
  A comment.  The text is ignored.  If the C</x> modifier enables
! whitespace formatting, a simple C<#> will suffice.  Note that Perl closes
! the comment as soon as it sees a C<)>, so there is no way to put a literal
! C<)> in the comment.
  
  =item C<(?imsx-imsx)>
  
--- 426,447 ----
  Perl also defines a consistent extension syntax for features not
  found in standard tools like B<awk> and B<lex>.  The syntax is a
  pair of parentheses with a question mark as the first thing within
! the parentheses, such as C<(?I<X>...).  The value of I<X> after the
! question mark determines which extension is selected.
  
! Stability of these extensions varies widely.  Some have been part
! of the core language for many years.  Others are experimental and
! may change without warning or be completely removed.  Check the
! documentation on an individual feature to verify its current status.
  
  =over 10
  
  =item C<(?#text)>
  
  A comment.  The text is ignored.  If the C</x> modifier enables
! whitespace formatting, a simple C<#> will suffice.  Note that Perl
! closes the comment as soon as it sees a C<)>, so there is no way
! to put a literal C<)> in the comment.
  
  =item C<(?imsx-imsx)>
  
***************
*** 431,442 ****
  C<(?i)> at the front of the pattern.  For example:
  
      $pattern = "foobar";
!     if ( /$pattern/i ) { } 
  
      # more flexible:
  
      $pattern = "(?i)foobar";
!     if ( /$pattern/ ) { } 
  
  Letters after a C<-> turn those modifiers off.  These modifiers are
  localized inside an enclosing group (if any).  For example,
--- 453,464 ----
  C<(?i)> at the front of the pattern.  For example:
  
      $pattern = "foobar";
!     if ( /$pattern/i ) { }
  
      # more flexible:
  
      $pattern = "(?i)foobar";
!     if ( /$pattern/ ) { }
  
  Letters after a C<-> turn those modifiers off.  These modifiers are
  localized inside an enclosing group (if any).  For example,
***************
*** 452,458 ****
  =item C<(?imsx-imsx:pattern)>
  
  This is for clustering, not capturing; it groups subexpressions like
! "()", but doesn't make backreferences as "()" does.  So
  
      @fields = split(/\b(?:a|b|c)\b/)
  
--- 474,480 ----
  =item C<(?imsx-imsx:pattern)>
  
  This is for clustering, not capturing; it groups subexpressions like
! C<()>, but doesn't make backreferences as C<()> does.  So
  
      @fields = split(/\b(?:a|b|c)\b/)
  
***************
*** 464,470 ****
  characters if you don't need to.
  
  Any letters between C<?> and C<:> act as flags modifiers as with
! C<(?imsx-imsx)>.  For example, 
  
      /(?s-i:more.*than).*million/i
  
--- 486,492 ----
  characters if you don't need to.
  
  Any letters between C<?> and C<:> act as flags modifiers as with
! C<(?imsx-imsx)>.  For example,
  
      /(?s-i:more.*than).*million/i
  
***************
*** 481,494 ****
  
  A zero-width negative look-ahead assertion.  For example C</foo(?!bar)/>
  matches any occurrence of "foo" that isn't followed by "bar".  Note
! however that look-ahead and look-behind are NOT the same thing.  You cannot
! use this for look-behind.
  
! If you are looking for a "bar" that isn't preceded by a "foo", C</(?!foo)bar/>
! will not do what you want.  That's because the C<(?!foo)> is just saying that
! the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will
! match.  You would have to do something like C</(?!foo)...bar/> for that.   We
! say "like" because there's the case of your "bar" not having three characters
  before it.  You could cover that this way: C</(?:(?!foo)...|^.{0,2})bar/>.
  Sometimes it's still easier just to say:
  
--- 503,517 ----
  
  A zero-width negative look-ahead assertion.  For example C</foo(?!bar)/>
  matches any occurrence of "foo" that isn't followed by "bar".  Note
! however that look-ahead and look-behind are I<not> the same thing.
! You cannot use this for look-behind.
  
! If you are looking for a "bar" that isn't preceded by a "foo",
! C</(?!foo)bar/> will not do what you want.  That's because the
! C<(?!foo)> is just saying that the next thing cannot be "foo"--and
! it's not, it's a "bar", so "foobar" will match.  You would have to
! do something like C</(?!foo)...bar/> for that.   We say "like"
! because there's the case of your "bar" not having three characters
  before it.  You could cover that this way: C</(?:(?!foo)...|^.{0,2})bar/>.
  Sometimes it's still easier just to say:
  
***************
*** 513,559 ****
  B<WARNING>: This extended regular expression feature is considered
  highly experimental, and may be changed or deleted without notice.
  
! This zero-width assertion evaluate any embedded Perl code.  It
! always succeeds, and its C<code> is not interpolated.  Currently,
! the rules to determine where the C<code> ends are somewhat convoluted.
  
  The C<code> is properly scoped in the following sense: If the assertion
  is backtracked (compare L<"Backtracking">), all changes introduced after
  C<local>ization are undone, so that
  
    $_ = 'a' x 8;
!   m< 
       (?{ $cnt = 0 })			# Initialize $cnt.
       (
!        a 
         (?{
             local $cnt = $cnt + 1;	# Update $cnt, backtracking-safe.
         })
!      )*  
       aaaa
       (?{ $res = $cnt })			# On success copy to non-localized
  					# location.
     >x;
  
! will set C<$res = 4>.  Note that after the match, $cnt returns to the globally
! introduced value, because the scopes that restrict C<local> operators
! are unwound.
  
! This assertion may be used as a C<(?(condition)yes-pattern|no-pattern)>
  switch.  If I<not> used in this way, the result of evaluation of
  C<code> is put into the special variable C<$^R>.  This happens
  immediately, so C<$^R> can be used from other C<(?{ code })> assertions
! inside the same regular expression.
  
  The assignment to C<$^R> above is properly localized, so the old
  value of C<$^R> is restored if the assertion is backtracked; compare
  L<"Backtracking">.
  
! For reasons of security, this construct is forbidden if the regular
! expression involves run-time interpolation of variables, unless the
! perilous C<use re 'eval'> pragma has been used (see L<re>), or the
! variables contain results of C<qr//> operator (see
! L<perlop/"qr/STRING/imosx">).  
  
  This restriction is because of the wide-spread and remarkably convenient
  custom of using run-time determined strings as patterns.  For example:
--- 536,588 ----
  B<WARNING>: This extended regular expression feature is considered
  highly experimental, and may be changed or deleted without notice.
  
! This zero-width element evaluates to any embedded Perl code.
! Currently, the rules to determine where the C<code> ends are somewhat
! convoluted.  It is not an assertion, because it does not assert
! anything: the success of the match is unrelated to the code's return
! value.
  
  The C<code> is properly scoped in the following sense: If the assertion
  is backtracked (compare L<"Backtracking">), all changes introduced after
  C<local>ization are undone, so that
  
    $_ = 'a' x 8;
!   m<
       (?{ $cnt = 0 })			# Initialize $cnt.
       (
!        a
         (?{
             local $cnt = $cnt + 1;	# Update $cnt, backtracking-safe.
         })
!      )*
       aaaa
       (?{ $res = $cnt })			# On success copy to non-localized
  					# location.
     >x;
  
! will set C<$res = 4>.  Note that after the match, $cnt returns to
! the globally introduced value, because the scopes that restrict
! C<local> operators are unwound.
  
! This construct may be used as a C<(?(condition)yes-pattern|no-pattern)>
  switch.  If I<not> used in this way, the result of evaluation of
  C<code> is put into the special variable C<$^R>.  This happens
  immediately, so C<$^R> can be used from other C<(?{ code })> assertions
! inside the same pattern.
  
  The assignment to C<$^R> above is properly localized, so the old
  value of C<$^R> is restored if the assertion is backtracked; compare
  L<"Backtracking">.
  
! For reasons of security, this construct is normally forbidden if
! the regex involves variable interpolation, unless the perilous C<use
! re 'eval'> pragma has been used (see L<re>), or the variables contain
! results of C<qr//> operator (see L<perlop/"qr/STRING/imosx">).
! Currently, no distinction is made between the interpolation of
! actual embedded code and the interpolation of simple variables in
! a pattern that merely happens to contain a code expression.  This
! confusion is not to be considered a feature, and may be fixed in a
! future release.
  
  This restriction is because of the wide-spread and remarkably convenient
  custom of using run-time determined strings as patterns.  For example:
***************
*** 577,589 ****
  A simplified version of the syntax may be introduced for commonly
  used idioms.
  
! This is a "postponed" regular subexpression.  The C<code> is evaluated
! at run time, at the moment this subexpression may match.  The result
! of evaluation is considered as a regular expression and matched as
! if it were inserted instead of this construct.
! 
! The C<code> is not interpolated.  As before, the rules to determine
! where the C<code> ends are currently somewhat convoluted.
  
  The following pattern matches a parenthesized group:
  
--- 606,619 ----
  A simplified version of the syntax may be introduced for commonly
  used idioms.
  
! Execute I<code> and interpolate its result as more pattern.  The
! C<code> is evaluated at run time, at the moment this subexpression
! may match.  The result of evaluation is a regex and is matched just
! as though it had been used directly.
! 
! As with the C<?{ code }> construct (whose result is ignored), the
! rules to determine where the C<code> ends are currently somewhat
! convoluted.
  
  The following pattern matches a parenthesized group:
  
***************
*** 602,614 ****
  B<WARNING>: This extended regular expression feature is considered
  highly experimental, and may be changed or deleted without notice.
  
! An "independent" subexpression, one which matches the substring
! that a I<standalone> C<pattern> would match if anchored at the given
! position, and it matches I<nothing other than this substring>.  This
! construct is useful for optimizations of what would otherwise be
! "eternal" matches, because it will not backtrack (see L<"Backtracking">).
! It may also be useful in places where the "grab all you can, and do not
! give anything back" semantic is desirable.
  
  For example: C<< ^(?>a*)ab >> will never match, since C<< (?>a*) >>
  (anchored at the beginning of string, as above) will match I<all>
--- 632,645 ----
  B<WARNING>: This extended regular expression feature is considered
  highly experimental, and may be changed or deleted without notice.
  
! A non-backtracking subexpression, one that matches the substring
! that a "standalone" C<pattern> would match if anchored at the given
! position.   It is somewhat reminiscent of a "cut" operator in logic
! programming languages.  This is mostly useful as an efficiency hack
! to optimize of what would otherwise be "eternal" matches, because
! it will not relinquish any characters eaten during backtrack (see
! L<"Backtracking">).  It may also be useful in places where the "grab
! all you can, and do not give anything back" semantic is desirable.
  
  For example: C<< ^(?>a*)ab >> will never match, since C<< (?>a*) >>
  (anchored at the beginning of string, as above) will match I<all>
***************
*** 625,641 ****
  makes a zero-length assertion into an analogue of C<< (?>...) >>.
  (The difference between these two constructs is that the second one
  uses a capturing group, thus shifting ordinals of backreferences
! in the rest of a regular expression.)
  
  Consider this pattern:
  
      m{ \(
! 	  ( 
  	    [^()]+		# x+
!           | 
              \( [^()]* \)
            )+
!        \) 
       }x
  
  That will efficiently match a nonempty group with matching parentheses
--- 656,672 ----
  makes a zero-length assertion into an analogue of C<< (?>...) >>.
  (The difference between these two constructs is that the second one
  uses a capturing group, thus shifting ordinals of backreferences
! in the rest of a pattern.)
  
  Consider this pattern:
  
      m{ \(
! 	  (
  	    [^()]+		# x+
!           |
              \( [^()]* \)
            )+
!        \)
       }x
  
  That will efficiently match a nonempty group with matching parentheses
***************
*** 649,669 ****
  exponential performance will make it appear that your program has
  hung.  However, a tiny change to this pattern
  
!     m{ \( 
! 	  ( 
  	    (?> [^()]+ )	# change x+ above to (?> x+ )
!           | 
              \( [^()]* \)
            )+
!        \) 
       }x
  
! which uses C<< (?>...) >> matches exactly when the one above does (verifying
! this yourself would be a productive exercise), but finishes in a fourth
! the time when used on a similar string with 1000000 C<a>s.  Be aware,
! however, that this pattern currently triggers a warning message under
! the C<use warnings> pragma or B<-w> switch saying it
! C<"matches the null string many times">):
  
  On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable
  effect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>.
--- 680,700 ----
  exponential performance will make it appear that your program has
  hung.  However, a tiny change to this pattern
  
!     m{ \(
! 	  (
  	    (?> [^()]+ )	# change x+ above to (?> x+ )
!           |
              \( [^()]* \)
            )+
!        \)
       }x
  
! which uses C<< (?>...) >> matches exactly when the one above does
! (verifying this yourself would be a productive exercise), but
! finishes in a fourth the time when used on a similar string with
! 1000000 C<a>s.  Be aware, however, that this pattern currently
! triggers a warning message under the C<use warnings> pragma or B<-w>
! switch saying it C<"matches the null string many times">):
  
  On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable
  effect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>.
***************
*** 703,711 ****
  
  For example:
  
!     m{ ( \( )? 
!        [^()]+ 
!        (?(1) \) ) 
       }x
  
  matches a chunk of non-parentheses, possibly included in parentheses
--- 734,742 ----
  
  For example:
  
!     m{ ( \( )?
!        [^()]+
!        (?(1) \) )
       }x
  
  matches a chunk of non-parentheses, possibly included in parentheses
***************
*** 715,732 ****
  
  =head2 Backtracking
  
! NOTE: This section presents an abstract approximation of regular
! expression behavior.  For a more rigorous (and complicated) view of
! the rules involved in selecting a match among possible alternatives,
! see L<Combining pieces together>.
! 
! A fundamental feature of regular expression matching involves the
! notion called I<backtracking>, which is currently used (when needed)
! by all regular expression quantifiers, namely C<*>, C<*?>, C<+>,
! C<+?>, C<{n,m}>, and C<{n,m}?>.  Backtracking is often optimized
! internally, but the general principle outlined here is valid.
  
! For a regular expression to match, the I<entire> regular expression must
  match, not just part of it.  So if the beginning of a pattern containing a
  quantifier succeeds in a way that causes later parts in the pattern to
  fail, the matching engine backs up and recalculates the beginning
--- 746,763 ----
  
  =head2 Backtracking
  
! NOTE: This section presents an abstract approximation of the how
! the regex engine behaves.  For a somewhat more rigorous (and harder
! to understand) view of the rules involved in selecting a match among
! possible alternatives, see L<Combining pieces together>.
! 
! A fundamental feature of pattern matching involves the notion called
! I<backtracking>, which is currently used (when needed) by all regex
! quantifiers, namely C<*>, C<*?>, C<+>, C<+?>, C<{n,m}>, and C<{n,m}?>.
! Backtracking is often optimized internally, but the general principle
! outlined here is valid.
  
! For a pattern to match, the I<entire> pattern must
  match, not just part of it.  So if the beginning of a pattern containing a
  quantifier succeeds in a way that causes later parts in the pattern to
  fail, the matching engine backs up and recalculates the beginning
***************
*** 740,752 ****
  	print "$2 follows $1.\n";
      }
  
! When the match runs, the first part of the regular expression (C<\b(foo)>)
  finds a possible match right at the beginning of the string, and loads up
  $1 with "Foo".  However, as soon as the matching engine sees that there's
  no whitespace following the "Foo" that it had saved in $1, it realizes its
  mistake and starts over again one character after where it had the
  tentative match.  This time it goes all the way until the next occurrence
! of "foo". The complete regular expression matches this time, and you get
  the expected output of "table follows foo."
  
  Sometimes minimal matching can help a lot.  Imagine you'd like to match
--- 771,783 ----
  	print "$2 follows $1.\n";
      }
  
! When the match runs, the first part of the pattern (C<\b(foo)>)
  finds a possible match right at the beginning of the string, and loads up
  $1 with "Foo".  However, as soon as the matching engine sees that there's
  no whitespace following the "Foo" that it had saved in $1, it realizes its
  mistake and starts over again one character after where it had the
  tentative match.  This time it goes all the way until the next occurrence
! of "foo". The complete pattern matches this time, and you get
  the expected output of "table follows foo."
  
  Sometimes minimal matching can help a lot.  Imagine you'd like to match
***************
*** 781,787 ****
  
  That won't work at all, because C<.*> was greedy and gobbled up the
  whole string. As C<\d*> can match on an empty string the complete
! regular expression matched successfully.
  
      Beginning is <I have 2 numbers: 53147>, number is <>.
  
--- 812,818 ----
  
  That won't work at all, because C<.*> was greedy and gobbled up the
  whole string. As C<\d*> can match on an empty string the complete
! pattern matched successfully.
  
      Beginning is <I have 2 numbers: 53147>, number is <>.
  
***************
*** 865,878 ****
  
  The search engine will initially match C<\D*> with "ABC".  Then it will
  try to match C<(?!123> with "123", which fails.  But because
! a quantifier (C<\D*>) has been used in the regular expression, the
  search engine can backtrack and retry the match differently
! in the hope of matching the complete regular expression.
  
  The pattern really, I<really> wants to succeed, so it uses the
! standard pattern back-off-and-retry and lets C<\D*> expand to just "AB" this
! time.  Now there's indeed something following "AB" that is not
! "123".  It's "C123", which suffices.
  
  We can deal with this by using both an assertion and a negation.
  We'll say that the first part in $1 must be followed both by a digit
--- 896,909 ----
  
  The search engine will initially match C<\D*> with "ABC".  Then it will
  try to match C<(?!123> with "123", which fails.  But because
! a quantifier (C<\D*>) has been used in the pattern, the
  search engine can backtrack and retry the match differently
! in the hope of matching the complete pattern.
  
  The pattern really, I<really> wants to succeed, so it uses the
! standard pattern back-off-and-retry and lets C<\D*> expand to just
! "AB" this time.  Now there's indeed something following "AB" that
! is not "123".  It's "C123", which suffices.
  
  We can deal with this by using both an assertion and a negation.
  We'll say that the first part in $1 must be followed both by a digit
***************
*** 886,904 ****
  
      6: got ABC
  
! In other words, the two zero-width assertions next to each other work as though
! they're ANDed together, just as you'd use any built-in assertions:  C</^$/>
! matches only if you're at the beginning of the line AND the end of the
! line simultaneously.  The deeper underlying truth is that juxtaposition in
! regular expressions always means AND, except when you write an explicit OR
! using the vertical bar.  C</ab/> means match "a" AND (then) match "b",
! although the attempted matches are made at different positions because "a"
! is not a zero-width assertion, but a one-width assertion.
  
  B<WARNING>: particularly complicated regular expressions can take
  exponential time to solve because of the immense number of possible
  ways they can use backtracking to try match.  For example, without
! internal optimizations done by the regular expression engine, this will
  take a painfully long time to run:
  
      'aaaaaaaaaaaa' =~ /((a{0,5}){0,5}){0,5}[c]/
--- 917,936 ----
  
      6: got ABC
  
! In other words, the two zero-width assertions next to each other
! work as though they're ANDed together, just as you'd use any built-in
! assertions:  C</^$/> matches only if you're at the beginning of the
! line AND the end of the line simultaneously.  The deeper underlying
! truth is that juxtaposition in regexes always means AND, except
! when you write an explicit OR using the vertical bar.  C</ab/> means
! match "a" AND (then) match "b", although the attempted matches are
! made at different positions because "a" is not a zero-width assertion,
! but a one-width assertion.
  
  B<WARNING>: particularly complicated regular expressions can take
  exponential time to solve because of the immense number of possible
  ways they can use backtracking to try match.  For example, without
! internal optimizations done by the regex engine, this will
  take a painfully long time to run:
  
      'aaaaaaaaaaaa' =~ /((a{0,5}){0,5}){0,5}[c]/
***************
*** 906,948 ****
  And if you used C<*>'s instead of limiting it to 0 through 5 matches,
  then it would take forever--or until you ran out of stack space.
  
! A powerful tool for optimizing such beasts is what is known as an
! "independent group",
! which does not backtrack (see L<C<< (?>pattern) >>>).  Note also that
! zero-length look-ahead/look-behind assertions will not backtrack to make
! the tail match, since they are in "logical" context: only 
! whether they match is considered relevant.  For an example
! where side-effects of look-ahead I<might> have influenced the
! following match, see L<C<< (?>pattern) >>>.
  
  =head2 Version 8 Regular Expressions
  
! In case you're not familiar with the "regular" Version 8 regex
  routines, here are the pattern-matching rules not described above.
  
  Any single character matches itself, unless it is a I<metacharacter>
  with a special meaning described here or above.  You can cause
  characters that normally function as metacharacters to be interpreted
! literally by prefixing them with a "\" (e.g., "\." matches a ".", not any
! character; "\\" matches a "\").  A series of characters matches that
  series of characters in the target string, so the pattern C<blurfl>
  would match "blurfl" in the target string.
  
  You can specify a character class, by enclosing a list of characters
  in C<[]>, which will match any one character from the list.  If the
! first character after the "[" is "^", the class matches any character not
! in the list.  Within a list, the "-" character specifies a
! range, so that C<a-z> represents all characters between "a" and "z",
! inclusive.  If you want either "-" or "]" itself to be a member of a
! class, put it at the start of the list (possibly after a "^"), or
! escape it with a backslash.  "-" is also taken literally when it is
! at the end of the list, just before the closing "]".  (The
! following all specify the same class of three characters: C<[-az]>,
! C<[az-]>, and C<[a\-z]>.  All are different from C<[a-z]>, which
! specifies a class containing twenty-six characters.)
! Also, if you try to use the character classes C<\w>, C<\W>, C<\s>,
! C<\S>, C<\d>, or C<\D> as endpoints of a range, that's not a range,
! the "-" is understood literally.
  
  Note also that the whole range idea is rather unportable between
  character sets--and even within character sets they may cause results
--- 938,979 ----
  And if you used C<*>'s instead of limiting it to 0 through 5 matches,
  then it would take forever--or until you ran out of stack space.
  
! A powerful tool for optimizing such beasts is the non-backtracking
! subexpression.  (see L<C<< (?>pattern) >>>).  Note also that
! zero-length look-ahead/look-behind assertions will not backtrack
! to make the tail match, since they are in "logical" context: only
! whether they match is considered relevant.  For an example where
! side-effects of look-ahead I<might> have influenced the following
! match, see L<C<< (?>pattern) >>>.
  
  =head2 Version 8 Regular Expressions
  
! In case you're not familiar with the standard Version 8 regex
  routines, here are the pattern-matching rules not described above.
  
  Any single character matches itself, unless it is a I<metacharacter>
  with a special meaning described here or above.  You can cause
  characters that normally function as metacharacters to be interpreted
! literally by prefixing them with a C<\> (e.g., C<\.> matches a ".", not any
! character; C<\\> matches a "\").  A series of characters matches that
  series of characters in the target string, so the pattern C<blurfl>
  would match "blurfl" in the target string.
  
  You can specify a character class, by enclosing a list of characters
  in C<[]>, which will match any one character from the list.  If the
! first character after the C<[> is C<^>, the class matches any
! character not in the list.  Within a list, the C<-> character
! specifies a range, so that C<a-z> represents all characters between
! "a" and "z", inclusive.  If you want either C<-> or C<]> itself to
! be a member of a class, put it at the start of the list (possibly
! after a C<^>), or escape it with a backslash.  C<-> is also taken
! literally when it is at the end of the list, just before the closing
! C<]>.  (The following all specify the same class of three characters:
! C<[-az]>, C<[az-]>, and C<[a\-z]>.  All are different from C<[a-z]>,
! which specifies a class containing twenty-six characters.) Also,
! if you try to use the character classes C<\w>, C<\W>, C<\s>, C<\S>,
! C<\d>, or C<\D> as endpoints of a range, that's not a range, the
! C<-> is understood literally.
  
  Note also that the whole range idea is rather unportable between
  character sets--and even within character sets they may cause results
***************
*** 955,985 ****
  used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
  "\f" a form feed, etc.  More generally, \I<nnn>, where I<nnn> is a string
  of octal digits, matches the character whose ASCII value is I<nnn>.
! Similarly, \xI<nn>, where I<nn> are hexadecimal digits, matches the
  character whose ASCII value is I<nn>. The expression \cI<x> matches the
! ASCII character control-I<x>.  Finally, the "." metacharacter matches any
  character except "\n" (unless you use C</s>).
  
! You can specify a series of alternatives for a pattern using "|" to
! separate them, so that C<fee|fie|foe> will match any of "fee", "fie",
! or "foe" in the target string (as would C<f(e|i|o)e>).  The
  first alternative includes everything from the last pattern delimiter
! ("(", "[", or the beginning of the pattern) up to the first "|", and
! the last alternative contains everything from the last "|" to the next
! pattern delimiter.  That's why it's common practice to include
! alternatives in parentheses: to minimize confusion about where they
! start and end.
! 
! Alternatives are tried from left to right, so the first
! alternative found for which the entire expression matches, is the one that
! is chosen. This means that alternatives are not necessarily greedy. For
! example: when matching C<foo|foot> against "barefoot", only the "foo"
! part will match, as that is the first alternative tried, and it successfully
! matches the target string. (This might not seem important, but it is
! important when you are capturing matched text using parentheses.)
! 
! Also remember that "|" is interpreted as a literal within square brackets,
! so if you write C<[fee|fie|foe]> you're really only matching C<[feio|]>.
  
  Within a pattern, you may designate subpatterns for later reference
  by enclosing them in parentheses, and you may refer back to the
--- 986,1018 ----
  used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
  "\f" a form feed, etc.  More generally, \I<nnn>, where I<nnn> is a string
  of octal digits, matches the character whose ASCII value is I<nnn>.
! Similarly, C<\xI<nn>>, where I<nn> are hexadecimal digits, matches the
  character whose ASCII value is I<nn>. The expression \cI<x> matches the
! ASCII character control-I<x>.  Finally, the C<.> metacharacter matches any
  character except "\n" (unless you use C</s>).
  
! You can specify a series of alternatives for a pattern using C<|>
! to separate them, so that C<fee|fie|foe> will match any of "fee",
! "fie", or "foe" in the target string (as would C<f(e|i|o)e>).  The
  first alternative includes everything from the last pattern delimiter
! (C<(>, C<[>, or the beginning of the pattern) up to the first C<|>,
! and the last alternative contains everything from the last C<|> to
! the next pattern delimiter.  That's why it's common practice to
! include alternatives in parentheses: to minimize confusion about
! where they start and end.
! 
! Alternatives are tried from left to right, so the first alternative
! found for which the entire expression matches, is the one that is
! chosen. This means that alternatives are not necessarily greedy.
! For example: when matching C<foo|foot> against "barefoot", only the
! "foo" part will match, as that is the first alternative tried, and
! it successfully matches the target string. (This might not seem
! important, but it is important when you are capturing matched text
! using parentheses.)
! 
! Also remember that C<|> is interpreted as a literal within square
! brackets, so if you write C<[fee|fie|foe]> you're really only
! matching C<[feio|]>.
  
  Within a pattern, you may designate subpatterns for later reference
  by enclosing them in parentheses, and you may refer back to the
***************
*** 998,1032 ****
  
      $pattern =~ s/(\W)/\\\1/g;
  
! This is grandfathered for the RHS of a substitute to avoid shocking the
! B<sed> addicts, but it's a dirty habit to get into.  That's because in
! PerlThink, the righthand side of a C<s///> is a double-quoted string.  C<\1> in
! the usual double-quoted string means a control-A.  The customary Unix
! meaning of C<\1> is kludged in for C<s///>.  However, if you get into the habit
! of doing that, you get yourself into trouble if you then add an C</e>
! modifier.
  
!     s/(\d+)/ \1 + 1 /eg;    	# causes warning under -w
  
  Or if you try to do
  
      s/(\d+)/\1000/;
  
! You can't disambiguate that by saying C<\{1}000>, whereas you can fix it with
! C<${1}000>.  The operation of interpolation should not be confused
! with the operation of matching a backreference.  Certainly they mean two
! different things on the I<left> side of the C<s///>.
  
  =head2 Repeated patterns matching zero-length substring
  
! B<WARNING>: Difficult material (and prose) ahead.  This section needs a rewrite.
  
  Regular expressions provide a terse and powerful programming language.  As
  with most other power tools, power comes together with the ability
  to wreak havoc.
  
  A common abuse of this power stems from the ability to make infinite
! loops using regular expressions, with something as innocuous as:
  
      'foo' =~ m{ ( o? )* }x;
  
--- 1031,1066 ----
  
      $pattern =~ s/(\W)/\\\1/g;
  
! This is grandfathered for the RHS of a substitute to avoid shocking
! the B<sed> addicts, but it's a dirty habit to get into.  That's
! because in PerlThink, the righthand side of a C<s///> is a double-quoted
! string.  C<\1> in the usual double-quoted string means a control-A.
! The customary Unix meaning of C<\1> is kludged in for C<s///>.
! However, if you get into the habit of doing that, you get yourself
! into trouble if you then add an C</e> modifier.
  
!     s/(\d+)/ \1 + 1 /eg;    	# triggers optional warnings
  
  Or if you try to do
  
      s/(\d+)/\1000/;
  
! You can't disambiguate that by saying C<\{1}000>, whereas you can
! fix it with C<${1}000>.  The operation of interpolation should not
! be confused with the operation of matching a backreference.  Certainly
! they mean two different things on the I<left> side of the C<s///>.
  
  =head2 Repeated patterns matching zero-length substring
  
! B<WARNING>: Difficult material (and prose) ahead.  This section
! needs a rewrite.
  
  Regular expressions provide a terse and powerful programming language.  As
  with most other power tools, power comes together with the ability
  to wreak havoc.
  
  A common abuse of this power stems from the ability to make infinite
! loops using regexes, with something as innocuous as:
  
      'foo' =~ m{ ( o? )* }x;
  
***************
*** 1061,1077 ****
  
     m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;
  
! is made equivalent to 
  
!    m{   (?: NON_ZERO_LENGTH )* 
!       | 
!         (?: ZERO_LENGTH )? 
      }x;
  
  The higher level-loops preserve an additional state between iterations:
! whether the last match was zero-length.  To break the loop, the following 
  match after a zero-length match is prohibited to have a length of zero.
! This prohibition interacts with backtracking (see L<"Backtracking">), 
  and so the I<second best> match is chosen if the I<best> match is of
  zero length.
  
--- 1095,1111 ----
  
     m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;
  
! is made equivalent to
  
!    m{   (?: NON_ZERO_LENGTH )*
!       |
!         (?: ZERO_LENGTH )?
      }x;
  
  The higher level-loops preserve an additional state between iterations:
! whether the last match was zero-length.  To break the loop, the following
  match after a zero-length match is prohibited to have a length of zero.
! This prohibition interacts with backtracking (see L<"Backtracking">),
  and so the I<second best> match is chosen if the I<best> match is of
  zero length.
  
***************
*** 1080,1132 ****
      $_ = 'bar';
      s/\w??/<$&>/g;
  
! results in C<"<><b><><a><><r><>">.  At each position of the string the best
! match given by non-greedy C<??> is the zero-length match, and the I<second 
! best> match is what is matched by C<\w>.  Thus zero-length matches
! alternate with one-character-long matches.
! 
! Similarly, for repeated C<m/()/g> the second-best match is the match at the 
! position one notch further in the string.
! 
! The additional state of being I<matched with zero-length> is associated with
! the matched string, and is reset by each assignment to pos().
! Zero-length matches at the end of the previous match are ignored
! during C<split>.
  
  =head2 Combining pieces together
  
! Each of the elementary pieces of regular expressions which were described
  before (such as C<ab> or C<\Z>) could match at most one substring
! at the given position of the input string.  However, in a typical regular
! expression these elementary pieces are combined into more complicated
! patterns using combining operators C<ST>, C<S|T>, C<S*> etc
! (in these examples C<S> and C<T> are regular subexpressions).
  
  Such combinations can include alternatives, leading to a problem of choice:
! if we match a regular expression C<a|ab> against C<"abc">, will it match
  substring C<"a"> or C<"ab">?  One way to describe which substring is
  actually matched is the concept of backtracking (see L<"Backtracking">).
  However, this description is too low-level and makes you think
  in terms of a particular implementation.
  
! Another description starts with notions of "better"/"worse".  All the
! substrings which may be matched by the given regular expression can be
! sorted from the "best" match to the "worst" match, and it is the "best"
! match which is chosen.  This substitutes the question of "what is chosen?"
! by the question of "which matches are better, and which are worse?".
  
  Again, for elementary pieces there is no such question, since at most
  one match at a given position is possible.  This section describes the
  notion of better/worse for combining operators.  In the description
! below C<S> and C<T> are regular subexpressions.
  
  =over
  
  =item C<ST>
  
  Consider two possible matches, C<AB> and C<A'B'>, C<A> and C<A'> are
! substrings which can be matched by C<S>, C<B> and C<B'> are substrings
! which can be matched by C<T>. 
  
  If C<A> is better match for C<S> than C<A'>, C<AB> is a better
  match than C<A'B'>.
--- 1114,1170 ----
      $_ = 'bar';
      s/\w??/<$&>/g;
  
! results in C<"<><b><><a><><r><>">.  At each position of the string
! the best match given by non-greedy C<??> is the zero-length match,
! and the I<second best> match is what is matched by C<\w>.  Thus
! zero-length matches alternate with one-character-long matches.
! 
! Similarly, for repeated C<m/()/g> the second-best match is the match
! at the position one notch further in the string.
! 
! The additional state of being I<matched with zero-length> is
! associated with the matched string, and is reset by each assignment
! to pos().  Zero-length matches at the end of the previous match are
! ignored during C<split>.
  
  =head2 Combining pieces together
  
! B<WARNING>: Difficult material (and prose) ahead.  This section
! needs a rewrite.
! 
! Each of the elementary pieces of regular expressions described
  before (such as C<ab> or C<\Z>) could match at most one substring
! at the given position of the input string.  However, in a typical
! regex, these elementary pieces are combined into more complicated
! patterns using combining operators C<ST>, C<S|T>, C<S*> etc (in
! these examples C<S> and C<T> are regular subexpressions).
  
  Such combinations can include alternatives, leading to a problem of choice:
! if we match a pattern C<a|ab> against C<"abc">, will it match
  substring C<"a"> or C<"ab">?  One way to describe which substring is
  actually matched is the concept of backtracking (see L<"Backtracking">).
  However, this description is too low-level and makes you think
  in terms of a particular implementation.
  
! Another description starts with notions of "better"/"worse".  All
! the substrings that may be matched by the given pattern can be
! sorted from the "best" match to the "worst" match, and it is the
! "best" match that's chosen.  This substitutes the question of "what
! is chosen?" with the question of "which matches are better, and which
! are worse?"
  
  Again, for elementary pieces there is no such question, since at most
  one match at a given position is possible.  This section describes the
  notion of better/worse for combining operators.  In the description
! below, C<S> and C<T> are regular subexpressions.
  
  =over
  
  =item C<ST>
  
  Consider two possible matches, C<AB> and C<A'B'>, C<A> and C<A'> are
! substrings that can be matched by C<S>, C<B> and C<B'> are substrings
! which can be matched by C<T>.
  
  If C<A> is better match for C<S> than C<A'>, C<AB> is a better
  match than C<A'B'>.
***************
*** 1169,1175 ****
  
  Only the best match for C<S> is considered.  (This is important only if
  C<S> has capturing parentheses, and backreferences are used somewhere
! else in the whole regular expression.)
  
  =item C<(?!S)>, C<(?<!S)>
  
--- 1207,1213 ----
  
  Only the best match for C<S> is considered.  (This is important only if
  C<S> has capturing parentheses, and backreferences are used somewhere
! else in the whole pattern.)
  
  =item C<(?!S)>, C<(?<!S)>
  
***************
*** 1178,1184 ****
  
  =item C<(??{ EXPR })>
  
! The ordering is the same as for the regular expression which is
  the result of EXPR.
  
  =item C<(?(condition)yes-pattern|no-pattern)>
--- 1216,1222 ----
  
  =item C<(??{ EXPR })>
  
! The ordering is the same as for the pattern that is
  the result of EXPR.
  
  =item C<(?(condition)yes-pattern|no-pattern)>
***************
*** 1191,1259 ****
  
  The above recipes describe the ordering of matches I<at a given position>.
  One more rule is needed to understand how a match is determined for the
! whole regular expression: a match at an earlier position is always better
  than a match at a later position.
  
! =head2 Creating custom RE engines
  
  Overloaded constants (see L<overload>) provide a simple way to extend
! the functionality of the RE engine.
  
! Suppose that we want to enable a new RE escape-sequence C<\Y|> which
  matches at boundary between white-space characters and non-whitespace
  characters.  Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly
  at these positions, so we want to have each C<\Y|> in the place of the
! more complicated version.  We can create a module C<customre> to do
! this:
  
!     package customre;
      use overload;
  
      sub import {
!       shift;
!       die "No argument to customre::import allowed" if @_;
!       overload::constant 'qr' => \&convert;
      }
  
      sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
  
!     my %rules = ( '\\' => '\\', 
  		  'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ );
      sub convert {
!       my $re = shift;
!       $re =~ s{ 
!                 \\ ( \\ | Y . )
!               }
!               { $rules{$1} or invalid($re,$1) }sgex; 
!       return $re;
      }
  
! Now C<use customre> enables the new escape in constant regular
! expressions, i.e., those without any runtime variable interpolations.
! As documented in L<overload>, this conversion will work only over
! literal parts of regular expressions.  For C<\Y|$re\Y|> the variable
! part of this regular expression needs to be converted explicitly
! (but only if the special meaning of C<\Y|> should be enabled inside $re):
  
!     use customre;
      $re = <>;
      chomp $re;
!     $re = customre::convert $re;
      /\Y|$re\Y|/;
  
  =head1 BUGS
  
  This document varies from difficult to understand to completely
  and utterly opaque.  The wandering prose riddled with jargon is
! hard to fathom in several places.
! 
! This document needs a rewrite that separates the tutorial content
! from the reference content.
  
  =head1 SEE ALSO
  
  L<perlop/"Regexp Quote-Like Operators">.
  
  L<perlop/"Gory details of parsing quoted constructs">.
  
  L<perlfaq6>.
--- 1229,1300 ----
  
  The above recipes describe the ordering of matches I<at a given position>.
  One more rule is needed to understand how a match is determined for the
! whole pattern: a match at an earlier position is always better
  than a match at a later position.
  
! =head2 Defining Your Own Backslash Escapes
  
  Overloaded constants (see L<overload>) provide a simple way to extend
! the functionality of the regex engine.
  
! Suppose that we want to enable a new regex escape-sequence C<\Y|> that
  matches at boundary between white-space characters and non-whitespace
  characters.  Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly
  at these positions, so we want to have each C<\Y|> in the place of the
! more complicated version.  We can create a C<custom_re> module to do this:
  
!     package custom_re;
      use overload;
  
      sub import {
! 	shift;
! 	die "No argument to custom_re::import allowed" if @_;
! 	overload::constant 'qr' => \&convert;
      }
  
      sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
  
!     my %rules = ( '\\' => '\\',
  		  'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ );
+ 
      sub convert {
! 	my $re = shift;
! 	$re =~ s{
! 	    \\ ( \\ | Y . )
! 	} { 
! 	    $rules{$1} || invalid($re,$1) 
! 	}sgex;
! 	return $re;
      }
  
! Now C<use custom_re> enables the new escape in constant patterns,
! i.e., those without variable interpolation.  As documented in
! L<overload>, this conversion will work only over literal parts of
! regexes.  For C<\Y|$re\Y|> the variable part of this pattern needs
! to be converted explicitly (but only if the special meaning of
! C<\Y|> should be enabled inside $re):
  
!     use custom_re;
      $re = <>;
      chomp $re;
!     $re = custom_re::convert $re;
      /\Y|$re\Y|/;
  
  =head1 BUGS
  
  This document varies from difficult to understand to completely
  and utterly opaque.  The wandering prose riddled with jargon is
! hard to fathom in several places.  The expert material
! should be extracted out into a I<perlreguts>(1) manpage.
  
  =head1 SEE ALSO
  
  L<perlop/"Regexp Quote-Like Operators">.
  
+ L<perlrequick>.
+ 
+ L<perlretut>.
+ 
  L<perlop/"Gory details of parsing quoted constructs">.
  
  L<perlfaq6>.
***************
*** 1261,1266 ****
--- 1302,1309 ----
  L<perlfunc/pos>.
  
  L<perllocale>.
+ 
+ L<perldebugs/"Debugger Internals">.
  
  I<Mastering Regular Expressions> by Jeffrey Friedl, published
  by O'Reilly and Associates.

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About