develooper Front page | perl.perl6.language | Postings from June 2006

Re: grammar: difference between rule, token and regex

Thread Previous | Thread Next
Patrick R. Michaud
June 2, 2006 14:40
Re: grammar: difference between rule, token and regex
Message ID:
On Fri, Jun 02, 2006 at 01:56:55PM -0700, jerry gay wrote:
> On 6/2/06, Rene Hangstrup Møller <> wrote:
> >I am toying around with Parrot and the compiler tools. The documenation
> >of Perl 6 grammars that I have been able to find only describe rule. But
> >the grammars in Parrot 0.4.4 for punie and APL use rule, token and regex
> >elements.
> >
> >Can someone please clarify the difference between these three types, and
> >when you should use one or the other?
> i'm forwarding this to p6l, as it's a language question and probably
> best asked there. that said, the regex/token/rule change is a recent
> one, and is documented in S05
> (

Jerry is correct that S05 is the place to look for information
on this.  But to summarize an answer to your question:

   - a C<regex> is a "normal" regular expression

   - a C<token> is a regex with the :ratchet modifier set.  The
     :ratchet modifier disables backtracking by default, so that
     a plain quantifier such as '*' or '+' will greedily match whatever
     it can but won't backtrack if the remainder of the match fails.

   - a C<rule> is a regex with both the :ratchet and :sigspace
     modifiers set.  The :sigspace modifier indicates that whitespace
     in the rule should be replaced by a intertoken separator rule
     such as <?ws> (a whitespace matching rule).


    rule { a* c b+ }

is the same as

    token { <?ws> a* <?ws> c <?ws> b+ <?ws> }

is the same as

    regex { <?ws>: a*: <?ws>: c <?ws>: b+: <?ws> }

To answer your other question, about when to use each, here are
some rules of thumb (sorry for the pun):

  - If the quantifiers in the rule need to do backtracking, use 'regex'

  - If backtracking isn't needed, use 'token'

  - If the components of the regex can have intertoken separators
    between them, use rule (and perhaps define a custom <ws> rule
    that matches the language's idea of "intertoken separator").

Here's a quick contrived example to illustrate the difference:

    token identifier { <alpha> \w* }

    token integer { \d+ }

    token value { <identifier> | <integer> }

    token operator { \+ | - | \* | / }

    rule expression { <value> [ <operator> <value> ]* }

    rule assignment { <identifier> \:= <expression> }

The "token" declarations all define regexes that do not match
any whitespace.  Thus,  "abc" is a valid identifier but "   abc "
is not.

The rule declarations, however, allow for whitespace to occur
between each of the elements.  Thus, each of the following
are valid assignments in the above language, as the use of
"rule" tells us where whitespace is allowed in the match:

     b := 3 + a * 4
     b   :=3   +a*   4

I can come up with more examples if desired, but that's the basics
behind each.

Hope this helps,


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About