Front page | perl.perl6.language |
Postings from May 2005
Re: comprehensive list of perl6 rule tokens
Thread Previous
|
Thread Next
From:
Patrick R. Michaud
Date:
May 26, 2005 09:16
Subject:
Re: comprehensive list of perl6 rule tokens
Message ID:
20050526161942.GC11520@pmichaud.com
On Tue, May 24, 2005 at 08:25:03PM -0400, Jeff 'japhy' Pinyan wrote:
> I have looked through the latest
> revisions of Apo05 and Syn05 (from Dec 2004) and come up with the
> following list:
>
> http://japhy.perlmonk.org/perl6/rules.txt
I'll review the list below, but it's also worthwhile to read
http://www.nntp.perl.org/group/perl.perl6.language/21120
which is Larry's latest missive on character classes, and
http://www.nntp.perl.org/group/perl.perl6.language/20985
which describes the capturing semantics (but be sure to note
the lengthy threads that follow concerning changes in the
indexing from $1, $2, ... to $0, $1, ... ).
Here's my comments on the table at http://japhy.perlmonk.org/perl6/rules.txt,
downloaded 26-May 1526 UTC:
CHAR EXAMPLE IMPL DESCRIPTION
===========================================
& a&b N conjunction
&var N subroutine
I'm not sure that "&var" means subroutine anymore. A05 does mention
it, but S05 does not, and I think it invites way too much confusion
with conjunctions. Consider "a&var($x|$y)" versus "a & var ( $x | $y )".
But if are allowing &var (and I hope we do not), then the parens are
required.
x* Y previous atom 0 or more times
x**{n..m} N previous atom n..m times
Keeping in mind that the "n..m" can actually be any sort of closure
(although it's not implemented that way yet in PGE). The rules
engine will generally optimize parsing and handling of "n..m" when
it can (e.g., when "n" and "m" are both constants).
( (x) Y capture 'x'
) Y must match opening '('
It may be worth noting that parens not only capture, they also
introduce a new scope for any nested subpattern and subrule captures.
:ignorecase N case insensitivity :i
:global N match globally :g
:continue N start scanning after previous match :c
...etc
I'm not sure these are "tokens" in the sense of "single unit of purpose"
in your original message. I think these are all adverbs, and the "token"
is just the initial C<:> at the beginning of a group.
:keepall N all rules and invoked rules remember everything
That's now ":parsetree" according to Damian's proposed capture rules.
<commit> N backtracking fails completely
<cut> N remove what matched up to this point from the string
<after P> N we must be after the pattern P
<!after P> N we must NOT be after the pattern P
<before P> N we must be before the pattern P
<!before P> N we must NOT be before the pattern P
As with ':words', etc., I'm not sure that these qualify as "tokens"
when parsing the regex -- the tokens are actually "<" or "<!" and
indicate a call to a subrule of some sort, and these are just predefined
rules. The rules parser and engine may indeed tokenize them for
optimization purposes, but I don't think the language defines them
as fundamental "tokens", and someone is free to override the predefined
rules with their own. (Perhaps <cut> and <commit> cannot be overridden.)
<?ws> N match whitespace by :w rules
<?sp> N match a space character (chr 32 ONLY)
Here the token is "<?", indicating a non-capturing subrule.
<$rule> N indirect rule
<::$rulename> N indirect symbolic rule
<@rules> N like '@rules'
<%rules> N like '%rules'
<{ code }> N code produces a rule
<&foo()> N subroutine returns rule
<( code )> N code must return true or backtracking ensues
Here the leading tokens are actually "<$", "<::$", "<@", "<%", "<{", "<&",
and "<(", and I suspect we have "<?$", "<?::$", "<?@", and "<!$", "<!::$",
"<!@", etc. counterparts. Of course, one could claim that these are
really separated as in "<", "?", and "$" tokens, but PGE's parser currently
treats them as a unit to make it easier to jump directly into the correct
handler for what follows.
<[a-z]> N character class
<+alpha> N character class
<-[a-z]> N complemented character class
The tokens for character class manipulation are currently "<[", "<+",
and "<-", although that's not officially documented in A05 or S05 yet.
Also, ranges are now <[a..z]> -- an unescaped hyphen appearing in an
enumerated character class generates a warning.
<+\w-[0-9]> N character class "arithmetic"
I'm not sure that it's been decided/documented that \w, \s, etc.
can appear in character class arithmetic (although it seems like it
should).
<prop:X> N Unicode property match
<-prop:X> N complemented Unicode property match
Here "prop" is just a subrule (or character class) similar to
<+alpha>, <+digit>, etc. Also, note that <prop:X> is a capturing
subrule, while <+prop:X> would be a character class match (and presumably
not capture).
<rule> N match rule (and capture to $rule)
<?rule> N match rule (don't capture)
<<rule>> N match rule (don't capture)
Do we still have the <<rule>> syntax, or was that abandoned in
favor of <?rule> ? (I know there are still some remnants of <<...>>
in S05 and A05, but I'm not sure they're intentional.)
> Thanks for your help. Unless you're difficult.
"You're welcome" unless $Pm ~~ /<?difficult>/;
Pm
Thread Previous
|
Thread Next