Front page | perl.perl6.language |
Postings from April 2006
[svn:perl6-synopsis] r8883 - doc/trunk/design/syn
Thread Next
From:
larry
Date:
April 20, 2006 02:09
Subject:
[svn:perl6-synopsis] r8883 - doc/trunk/design/syn
Message ID:
20060420090751.163BCCB9BC@x12.develooper.com
Author: larry
Date: Thu Apr 20 02:07:51 2006
New Revision: 8883
Modified:
doc/trunk/design/syn/S05.pod
Log:
Various clarifications.
Documented that null first alternative is ignored.
Removed colon separator after last modifier, now just use space.
Deleted the :once modifier. (A state variable suffices.)
A match object in boolean context isn't always forced to be eager.
Added :ratchet and :panic modifiers to limit backtracking in the parser.
Clarified when rules are allowed vs enforced in variable usage.
Added <%a|%b|%c> form for simple longest-token scoping.
Clarified that hash matches skip over key before value is matched.
Documented behavior of $<KEY>.
Added *+ ++ ?+ and :+ to force greed on specific atom.
Added token and parse rule variants for grammar productions.
Added <<<...>>> syntax.
Modified: doc/trunk/design/syn/S05.pod
==============================================================================
--- doc/trunk/design/syn/S05.pod (original)
+++ doc/trunk/design/syn/S05.pod Thu Apr 20 02:07:51 2006
@@ -11,11 +11,11 @@
=head1 VERSION
- Maintainer: Patrick Michaud <pmichaud@pobox.com>
+ Maintainer: Patrick Michaud <pmichaud@pobox.com> (& TimToady)
Date: 24 Jun 2002
- Last Modified: 6 Apr 2006
+ Last Modified: 20 Apr 2006
Number: 5
- Version: 15
+ Version: 16
This document summarizes Apocalypse 5, which is about the new regex
syntax. We now try to call them "rules" because they haven't been
@@ -30,8 +30,8 @@
it doesn't look like it. The individual capture variables (such as C<$0>,
C<$1>, etc.) are just elements of C<$/>.
-By the way, the numbered capture variables now start at C<$0>, C<$1>,
-C<$2>, etc. See below.
+By the way, the numbered capture variables now start at C<$0> rather than
+C<$1>. See below.
=head1 Unchanged syntactic features
@@ -68,6 +68,8 @@
=item *
The extended syntax (C</x>) is no longer required...it's the default.
+(In fact, it's pretty much mandatory--the only way to get back to
+the old syntax is with the C<:Perl5>/C<:P5> modifier.)
=item *
@@ -78,7 +80,11 @@
There is no C</e> evaluation modifier on substitutions; instead use:
- s/pattern/{ code() }/
+ s/pattern/{ doit() }/
+
+Instead of C</ee> say:
+
+ s/pattern/{ eval doit() }/
=item *
@@ -87,8 +93,9 @@
m:g:i/\s* (\w*) \s* ,?/;
Every modifier must start with its own colon. The delimiter must be
-separated from the final modifier by a colon or whitespace if it would
-be taken as an argument to the preceding modifier.
+separated from the final modifier by whitespace if it would be taken
+as an argument to the preceding modifier (which is true for any
+bracketing character).
=item *
@@ -127,19 +134,13 @@
is roughly equivalent to
- m:p/.*? pattern/
-
-=item *
-
-The new C<:once> modifier replaces the Perl 5 C<?...?> syntax:
+ m:p/.*? <( pattern )> /
- m:once/ pattern / # only matches first time
+Also note that any rule called as a subrule is implicitly anchored to the
+current position anyway.
=item *
-[Note: We're still not sure if :w is ultimately going to work exactly
-as described below. But this is how it works for now.]
-
The new C<:w> (C<:words>) modifier causes whitespace sequences to be
replaced by C<\s*> or C<\s+> subpattern as defined by the C<< <?ws> >> rule.
@@ -164,6 +165,9 @@
C<< <?ws> >> can't decide what to do until it sees the data. It still does
the right thing. If not, define your own C<< <?ws> >> and C<:w> will use that.
+In general you don't need to use C<:w> within grammars because
+the parse rules automatically handle whitespace policy for you.
+
=item *
New modifiers specify Unicode level:
@@ -177,9 +181,9 @@
=item *
-The new C<:perl5> modifier allows Perl 5 regex syntax to be used instead:
+The new C<:Perl5> modifier allows Perl 5 regex syntax to be used instead:
- m:perl5/(?mi)^[a-z]{1,2}(?=\s)/
+ m:Perl5/(?mi)^[a-z]{1,2}(?=\s)/
(It does not go so far as to allow you to put your modifiers at
the end.)
@@ -194,16 +198,16 @@
If followed by an C<x>, it means repetition. Use C<:x(4)> for the
general form. So
- s:4x { (<?ident>) = (\N+) $$}{$0 => $1};
+ s:4x [ (<?ident>) = (\N+) $$] [$0 => $1];
is the same as:
- s:x(4) { (<?ident>) = (\N+) $$}{$0 => $1};
+ s:x(4) [ (<?ident>) = (\N+) $$] [$0 => $1];
which is almost the same as:
$_.pos = 0;
- s:c{ (<?ident>) = (\N+) $$}{$0 => $1} for 1..4;
+ s:c [ (<?ident>) = (\N+) $$] [$0 => $1] for 1..4;
except that the string is unchanged unless all four matches are found.
However, ranges are allowed, so you can say C<:x(1..4)> to change anywhere
@@ -250,10 +254,15 @@
$str = "abracadabra";
if $str ~~ m:exhaustive/ a (.*) a / {
- @substrings = $/.matches(); # br brac bracad bracadabr
- # c cad cadabr d dabr br
+ say "@()"; # br brac bracad bracadabr c cad cadabr d dabr br
}
+Note that the C<~~> above can return as soon as the first match is found,
+and the rest of the matches may be performed lazily by C<@()>.
+
+[Conjecture: the C<:exhaustive> modifier should have an optional argument
+specifying how many seconds to run before giving up, since it's trivially
+easy to ask for the heat death of the universe to happen first.]
=item *
@@ -275,7 +284,24 @@
=item *
-The C<:i>, C<:w>, C<:perl5>, and Unicode-level modifiers can be
+The new C<:ratchet> modifier causes this rule to not backtrack by default.
+(Generally you do not use this modifier directly, since it's implied by
+C<token> and C<parse> declarations.) The effect of this modifier is
+to imply a C<:> after every construct that could backtrack, including
+bare C<*>, C<+>, and C<?> quantifiers, as well as alternations.
+
+=item *
+
+The new C<:panic> modifier causes this rule and all invoked subrules
+to try to backtrack on any rules that would otherwise default to
+not backtracking because they have C<:ratchet> set. Never panic
+unless you're desperate and want the pattern matcher to do a lot of
+unnecessary work. If you have an error in your grammar, it's almost
+certainly a bad idea to fix it by backtracking.
+
+=item *
+
+The C<:i>, C<:w>, C<:Perl5>, and Unicode-level modifiers can be
placed inside the rule (and are lexically scoped):
m/:w alignment = [:i left|right|cent[er|re]] /
@@ -297,7 +323,6 @@
To use parens or brackets for your delimiters you have to separate:
m:fuzzy (pattern);
- m:fuzzy:(pattern);
or you'll end up with:
@@ -346,7 +371,10 @@
=item *
-An unescaped C<#> now always introduces a comment.
+An unescaped C<#> now always introduces a comment. If followed
+by an opening bracket character (and if not in the first column),
+it introduces an embedded comment that terminates with the closing
+bracket. Otherwise the comment terminates at the newline.
=item *
@@ -438,7 +466,7 @@
so that the closure is never actually run in that case. But it's
a closure that must be run in the general case, so you can use
it to generate a range on the fly based on the earlier matching.
-(Of course, bear in mind the closure is run I<before> attempting to
+(Of course, bear in mind the closure must be run I<before> attempting to
match whatever it quantifies.)
=item *
@@ -473,7 +501,9 @@
/ \Q$var\E /
-(To get rule interpolation use an assertion - see below)
+However, if C<$var> contains a rule object, rather attempting to
+convert it to a string, it is called as if you said C<< <$var> >>.
+See assertions below.
=item *
@@ -486,7 +516,8 @@
/ [ @cmds[0] | @cmds[1] | @cmds[2] | ... ] /
-As with a scalar variable, each element is matched as a literal.
+As with a scalar variable, each element is matched as a literal unless
+it happens to be a rule object, in which case it is matched as a subrule.
=item *
@@ -503,15 +534,23 @@
=item *
-If it is a string or rule object, it is executed as a subrule.
+If it is a string, it is matched literally, starting after where the
+key left off matching.
=item *
-If it has the value 1, nothing special happens beyond the match.
+If it is a rule object, it is executed as a subrule, with an initial
+position after the matched key.
=item *
-Any other value causes the match to fail.
+If it has the value 1, nothing special happens except that the key match
+succeeds.
+
+=item *
+
+Any other value causes the match to fail. In particular, shorter keys
+are not tried if a longer one matches and fails.
=back
@@ -547,6 +586,11 @@
tree and looking for things in the opposite order going to the left.
It is illegal to do lookbehind on a pattern that cannot be reversed.
+Note: the effect of a forward-scanning lookbehind at the top level
+can be achieved with:
+
+ / .*? prestuff <( mainpat >) /
+
=item *
A leading C<?> causes the assertion not to capture what it matches (see
@@ -556,28 +600,66 @@
/ <?ident> <ws> / # only $/<ws> captured
/ <?ident> <?ws> / # nothing captured
+The non-capturing behavior may be overridden with a C<:keepall>.
+
=item *
A leading C<$> indicates an indirect rule. The variable must contain
-either a hard reference to a rule, or a string containing the rule.
+either a rule object, or a string to be compiled as the rule. The
+string is never matched literally.
=item *
A leading C<::> indicates a symbolic indirect rule:
- / <::($somename)>
+ / <::($somename)> /
-The variable must contain the name of a rule.
+The variable must contain the name of a rule. By the rules of single method
+dispatch this is first searched for in the current grammar and its ancestors.
+If this search fails an attempt is made to dispatch via MMD, in which case
+it can find rules defined as multis rather than methods.
=item *
A leading C<@> matches like a bare array except that each element
-is treated as a rule (string or hard ref) rather than as a literal.
+is treated as a rule (string or rule object) rather than as a literal.
+That is, a string is forced to be compiled as a rule rather than matched
+literally. (There is no difference for a rule object.)
=item *
-A leading C<%> matches like a bare hash except that each key
-is treated as a rule (string or hard ref) rather than as a literal.
+A leading C<%> matches like a bare hash except that each value is always
+treated as a rule, even if it is a string that must be compiled to a rule
+at match time.
+
+With both bare hash and hash in angles, the key is always skipped
+over before calling any rule in the value. That rule may, however,
+magically access the key anyway as if the rule had started before the
+key and matched with C<< <KEY> >> assertion. That is, C<< $<KEY> >>
+will contain the keyword or token that this rule was looked up under,
+and that value will be returned by the current match object even if
+you do nothing special with it within the match. (This also works
+for the name of a macro as seen from an C<is parsed> rule, since
+internally that turns into a hash lookup.)
+
+As with bare hash, the longest key matches according to the longest token
+rule, but in addition, you may combine multiple hashes under the same
+longest-token consideration like this:
+
+ <%statement|%prefix|%term>
+
+This means that, despite being in a later hash, C<< %term<food> >>
+will be selected in preference to C<< %prefix<foo> >> because it's
+the longer token. However, if there is a tie, the earlier hash wins,
+so C<< %statement<if> >> hides any C<< %prefix<if> >> or C<< %term<if> >>.
+
+In contrast, if you say
+
+ [ <%prefix> | <%term> ]
+
+a C<< %prefix<foo> >> would be selected in preference to a C<< %term<food> >>.
+(Which is not what you usually want if your language is to do longest-token
+consistently.)
=item *
@@ -592,7 +674,7 @@
rule closure binds the I<result object> for this match, ignores the
rest of the current rule, and reports success:
- / (\d) <{ return $0.sqrt }> NotReached /;
+ / (\d) <{ return $0.sqrt }> NotReached /;
This has the effect of capturing the square root of the numified string,
instead of the string. The C<NotReached> part is not reached.
@@ -654,14 +736,16 @@
/ <after foo> \d+ <before bar> /
except that the scan for "foo" can be done in the forward direction,
-while a lookbehind assertion would presumably scan for \d+ and then
-match "foo" backwards. The use of C<< <(...)> >> affects only the
+while a lookbehind assertion would presumably scan for C<\d+> and then
+match "C<foo>" backwards. The use of C<< <(...)> >> affects only the
meaning of the "result object" and the positions of the beginning and
ending of the match. That is, after the match above, C<$()> contains
only the digits matched, and C<.pos> is pointing to after the digits.
Other captures (named or numbered) are unaffected and may be accessed
through C<$/>.
+It is a syntax error to use an unbalanced C<< <( >> or C<< )> >>.
+
=item *
A leading C<[> or C<+> indicates an enumerated character class. Ranges
@@ -717,6 +801,17 @@
/ <!before _ > / # We aren't before an _
+Note that C<< <!alpha> >> is different from C<< <-alpha> >> because the
+latter matches C</./> when it is not an alpha.
+
+=item *
+
+Conjecture: Multiple opening angles are matched by a corresponding
+number of closing angles, and otherwise function as single angles.
+This can be used to visually isolate unmatched angles inside:
+
+ <<<Ccode: a >> 1>>>
+
=back
=head1 Backslash reform
@@ -904,6 +999,49 @@
causes it to produce a C<Code> or C<Rule> reference, which the switch
statement then selects upon.
+=item *
+
+Just as C<rx> has variants, so does the C<rule> declarator.
+In particular, there are two special variants for use in grammars:
+C<token> and C<parse>.
+
+A token declaration:
+
+ token ident { [ <alpha> | _ ] \w+ }
+
+never backtracks by default. That is, it likes to commit to whatever
+it has scanned so far. The above is equivalent to
+
+ rule ident { [ <alpha>: | _ ]: \w+: }
+
+but rather easier to read. The bare C<*>, C<+> and C<?> quantifiers
+never backtrack in a C<token> unless some outer rule has specified a
+C<:panic> option that applies. If you want to prevent even that, use
+C<*:>, C<+:> or C<?:> to prevent any backtracking into the quantifier.
+If you want to explicitly backtrack, append either a C<?> or a C<+>
+to the quantifier. The C<?> forces minimal matching as usual,
+while the C<+> forces greedy matching. The C<token> declarator is
+really just short for
+
+ rule :ratchet { ... }
+
+The other is the C<parse> declarator, for declaring non-terminal
+productions in a grammar. It also does not backtrack unless a
+C<:panic> is in effect or you explicitly specify a backtracking
+quantifier. In addition, a C<parse> rule also assumes C<:words>.
+A C<parse> is really short for:
+
+ rule :ratchet :words { ... }
+
+=item *
+
+The Perl 5 C<?...?> syntax ("match once") was rarely used and can be
+now emulated more cleanly with a state variable:
+
+ (state $x) ||= / pattern /; # only matches first time
+
+To reset the pattern, simply set C<$x = 0>.
+
=back
=head1 Backtracking control
@@ -912,14 +1050,40 @@
=item *
+By default, backtracking is greedy in C<rx>, C<m>, C<s>, and the
+like. It's also greedy in ordinary rules. In C<parse> and C<token>
+declarations, backtracking must be explicit.
+
+=item *
+
+To force the preceding atom to do eager backtracking,
+append a C<:?> or C<?> to the atom. If the preceding token is
+a quantifier, the C<:> may be omitted, so C<*?> works just as
+in Perl 5.
+
+=item *
+
+To force the preceding atom to do greedy backtracking,
+append a C<:+> or C<+> to the atom. If the preceding token
+is a quantifier, the C<:> may be omitted. (Perl 5 has no
+corresponding construct because backtracking always defaults
+to greedy in Perl 5.)
+
+=item *
+
+To force the preceding atom to do no backtracking, use a single C<:>
+without a subsequent C<?> or C<+>.
Backtracking over a single colon causes the rule engine not to retry
the preceding atom:
- m:w/ \( <expr> [ , <expr> ]* : \) /
+ m:w/ \( <expr> [ , <expr> ]*: \) /
(i.e. there's no point trying fewer C<< <expr> >> matches, if there's
no closing parenthesis on the horizon)
+To force all the atoms in an expression not to backtrack by default,
+use C<:ratchet> or C<parse> or C<token>.
+
=item *
Backtracking over a double colon causes the surrounding group of
@@ -931,8 +1095,12 @@
]
/
-(i.e. there's no point trying to match a different keyword if one
-was already found but failed).
+(i.e. there's no point trying to match a different keyword if one was
+already found but failed). Note that you can still back into such an
+alternation, so you may also need to put C<:> after it if you also
+want to disable that. If a an explicit or implicit C<:ratchet> has
+disabled backtracking, you need to put C<:+> after the alternation
+to enable backing into another alternative if the first pick fails.
=item *
@@ -993,9 +1161,10 @@
=item *
-...so too you can have anonymous rules and I<named> rules:
+...so too you can have anonymous rules and I<named> rules (and tokens,
+and parses):
- rule ident { [<alpha>|_] \w* }
+ token ident { [<alpha>|_] \w* }
# and later...
@@ -1007,11 +1176,11 @@
such as:
rule serial_number { <[A..Z]> \d**{8} }
- rule type { alpha | beta | production | deprecated | legacy }
+ token type { alpha | beta | production | deprecated | legacy }
in other rules as named assertions:
- rule identification { [soft|hard]ware <type> <serial_number> }
+ parse identification { [soft|hard]ware <type> <serial_number> }
=back
@@ -1049,6 +1218,10 @@
This makes it easier to catch errors like this:
+ /a|b|c|/
+
+As a special case, however, the first null alternative in a match like
+
m:w/ [
| if :: <expr> <block>
| for :: <list> <block>
@@ -1056,6 +1229,19 @@
]
/
+is simply ignored. Only the first alternative is special that way.
+If you write:
+
+ m:w/ [
+ if :: <expr> <block> |
+ for :: <list> <block> |
+ loop :: <loop_controls>? <block> |
+ ]
+ /
+
+
+it's still an error.
+
=item *
However, it's okay for a non-null syntactic construct to have a degenerate
@@ -1099,6 +1285,10 @@
# or:
/pattern/; if $/ {...}
+With C<:global> or C<:overlap> or C<:exhaustive> the boolean is
+allowed to return true on the first match. The C<Match> object can
+produce the rest of the results lazily if evaluated in list context.
+
=item *
In string context it evaluates to the stringified value of its
@@ -1121,7 +1311,7 @@
=item *
-When used as a scalar, a Match object evaluates to its underlying
+When used as a scalar, a C<Match> object evaluates to its underlying
result object. Usually this is just the entire match string, but
you can override that by calling C<return> inside a rule:
@@ -1146,7 +1336,7 @@
Additionally, the C<Match> object delegates its C<coerce> calls
(such as C<+$match> and C<~$match>) to its underlying result object.
The only exception is that C<Match> handles boolean coercion itself,
-which returns whether the match had succeeded.
+which returns whether the match had succeeded at least once.
This means that these two work the same:
@@ -1155,7 +1345,7 @@
=item *
-When used as an array, a Match object pretends to be an array of all
+When used as an array, a C<Match> object pretends to be an array of all
its positional captures. Hence
($key, $val) = m:w/ (\S+) => (\S+)/;
@@ -1179,11 +1369,13 @@
Note that, as a scalar variable, C<$/> doesn't automatically flatten
in list context. Use C<@()> as a shorthand for C<@($/)> to flatten
-the positional captures under list context.
+the positional captures under list context. Note that a C<Match> object
+is allowed to evaluate its match lazily in list context. Use C<**@()>
+to force an eager match.
=item *
-When used as a hash, a Match object pretends to be a hash of all its named
+When used as a hash, a C<Match> object pretends to be a hash of all its named
captures. The keys do not include any sigils, so if you capture to
variable C<< @<foo> >> its real name is C<$/{'foo'}> or C<< $/<foo> >>.
However, you may still refer to it as C<< @<foo> >> anywhere C<$/>
@@ -1192,7 +1384,8 @@
Note that, as a scalar variable, C<$/> doesn't automatically flatten
in list context. Use C<%()> as a shorthand for C<%($/)> to flatten as a
-hash, or bind it to a variable of the appropriate type.
+hash, or bind it to a variable of the appropriate type. As with C<@()>,
+it's possible for C<%()> to produce its pairs lazily in list context.
=item *
@@ -1240,7 +1433,7 @@
incomplete C<Match> object (which can be modified via the internal C<$/>.
For example:
- $str ~~ / foo # Match 'foo'
+ $str ~~ / foo # Match 'foo'
{ $/ = 'bar' } # But pretend we matched 'bar'
/;
say $/; # says 'bar'
@@ -1556,7 +1749,9 @@
=item *
-Any call to a named C<< <rule> >> within a pattern is known as a I<subrule>.
+Any call to a named C<< <rule> >> within a pattern is known as a
+I<subrule>, whether that rule is actually defined as a C<rule> or
+C<token> or C<parse> or even an ordinary C<method> or C<multi>.
=item *
@@ -1599,9 +1794,9 @@
=item *
The hash entries of a C<Match> object can be referred to using any of the
-standard hash access notations (C<$/{'foo'}>, C<< $/<bar> >>, C<$/�baz�>,
+standard hash access notations (C<$/{'foo'}>, C<< $/<bar> >>, C<$/«baz»>,
etc.), or else via corresponding lexically scoped aliases (C<< $<foo> >>,
-C<$�bar�>, C<< $<baz> >>, etc.) So the previous example also implies:
+C<$«bar»>, C<< $<baz> >>, etc.) So the previous example also implies:
# $<ident> $0<ident>
# __^__ __^__
@@ -2334,10 +2529,10 @@
so too a grammar can collect a set of named rules together:
grammar Identity {
- rule name :w { Name = (\N+) }
- rule age :w { Age = (\d+) }
- rule addr :w { Addr = (\N+) }
- rule desc {
+ parse name { Name = (\N+) }
+ parse age { Age = (\d+) }
+ parse addr { Addr = (\N+) }
+ parse desc {
<name> \n
<age> \n
<addr> \n
@@ -2351,22 +2546,22 @@
Like classes, grammars can inherit:
grammar Letter {
- rule text { <greet> <body> <close> }
+ parse text { <greet> <body> <close> }
- rule greet :w { [Hi|Hey|Yo] $<to>:=(\S+?) , $$}
+ parse greet { [Hi|Hey|Yo] $<to>:=(\S+?) , $$}
- rule body { <line>+ }
+ parse body { <line>+? }
- rule close :w { Later dude, $<from>:=(.+) }
+ parse close { Later dude, $<from>:=(.+) }
# etc.
}
grammar FormalLetter is Letter {
- rule greet :w { Dear $<to>:=(\S+?) , $$}
+ parse greet { Dear $<to>:=(\S+?) , $$}
- rule close :w { Yours sincerely, $<from>:=(.+) }
+ parse close { Yours sincerely, $<from>:=(.+) }
}
@@ -2382,14 +2577,15 @@
grammar Perl { # Perl's own grammar
- rule prog { <statement>* }
+ parse prog { <statement>* }
- rule statement { <decl>
+ parse statement {
+ | <decl>
| <loop>
| <label> [<cond>|<sideff>|;]
}
- rule decl { <sub> | <class> | <use> }
+ parse decl { <sub> | <class> | <use> }
# etc. etc. etc.
}
@@ -2439,7 +2635,7 @@
$str.trans( %mapping.pairs.sort );
-Use the .= form to do a translation in place:
+Use the C<.=> form to do a translation in place:
$str.=trans( %mapping.pairs.sort );
Thread Next
-
[svn:perl6-synopsis] r8883 - doc/trunk/design/syn
by larry