develooper Front page | perl.perl6.language | Postings from September 2005

Re: Parsing indent-sensitive languages

Thread Previous | Thread Next
From:
Peri Hankey
Date:
September 9, 2005 02:41
Subject:
Re: Parsing indent-sensitive languages
Message ID:
432157E3.40806@thegreen.co.uk
Dave Whipp wrote:
> If I want to parse a language that is sensitive to whitespace 
> indentation (e.g. Python, Haskell), how do I do it using P6 rules/grammars?
> 
> The way I'd usually handle it is to have a lexer that examines leading 
> whitespace and converts it into "indent" and "unindent" tokens. The 
> grammer can then use these tokens in the same way that it would any 
> other block-delimiter.
> 
> This requires a stateful lexer, because to work out the number of 
> "unindent" tokens on a line, it needs to know what the indentation 
> positions are. How would I write a P6 rule that defines <indent> and 
> <unindent> tokens? Alternatively (if a different approach is needed) how 
> would I use P6 to parse such a language?

In this context, I thought readers of this list might be interested in 
the following extract from mediawiki.lmn, a ruleset for generating html 
pages from a subset of mediawiki markup. These rules are written in lmn, 
the metalanguage of the language machine, and the extract deals with 
unordered and ordered lists, where entries are prefixed by '*' and '#' 
characters, and repeated prefix characters indicate nesting.

NB the source text of lmn rules is written using a subset of the 
mediawiki markup, with preformatted text (lines that start with at least 
one space) treated as actual source with no markup and everything else 
treated as annotation:

----------------- start of extract from mediawiki.lmn ------------------
== bulleted and numbered lists ==
Unordered and ordered lists are a bit tricky - essentially they are like 
indented blocks in Python, but a little more complex because of the way 
ordered and unordered lists can be combined with each other. The 
solution is that at each level, the prefix pattern of '#' and '*' 
characters is known, and the level continues while that pattern is 
recognised. This can be done by matching the value of a variable which 
holds the pattern for the current level.

     '*'                                  <- unit - ulist :'*';
     '#'                                  <- unit - olist :'#';
     ulist :A item :X repeat more item :Y <- unit ul :{X each Y} eom;
     olist :A item :X repeat more item :Y <- unit ol :{X each Y} eom;

     '*'                                  <- item - ulist :{A'*'};
     '#'                                  <- item - olist :{A'#'};
     ulist :A item :X repeat more item :Y <- item :{ ul :{X each Y}};
     olist :A item :X repeat more item :Y <- item :{ ol :{X each Y}};
     - wikitext :X                        <- item :{ li :X };

The following rule permits a level to continue as long as the input 
matches the current prefix. We recurse for each level before getting 
here, so we will always try to match the innermost levels first - they 
have the longest prefix strings, and so there is no danger of a 
premature match

     - A                                  <- more ;
-----------------  end of extract from mediawiki.lmn  ------------------

The complete ruleset can be seen at:
http://languagemachine.sourceforge.net/website.html    - summary
http://languagemachine.sourceforge.net/mediawiki.html  - markup
http://languagemachine.sourceforge.net/sitehtml.html   - wrappings

I have fairly recently published the language machine under Gnu GPL at 
sourceforge. It consists of a minimal main program, a shared library 
written in D using the gdc frontend to gnu gcc, and several flavours of 
an lmn metalanguage compiler - these are all written in lmn and share a 
common frontend.

The metalanguage compiler sources are on the website (with many other 
examples) as web pages that have been generated directly from lmn source 
text by applying  the markup-to-html translation rules.

The language machine in previous incarnations has a long history, but it 
is not much like any other language toolkit that I know of. This is a 
page that relates it to received wisdom about language and language 
implementations:

http://languagemachine.sourceforge.net/grammar.html

There is an extremely useful diagram which shows what happens when 
unrestricted grammatical substitution rules are applied to an input 
stream - this is explained here in relation to a couple of trivially 
simple examples:

http://languagemachine.sourceforge.net/lm-diagram.html

My intention in creating this implementation has been to make something 
that can be combined with other free languages and toolchains, and I 
have recently asked the grants-secretary at the Perl Foundation for 
feedback on a draft proposal to create a language machine extension for 
perl.

I would be very interested to hear what you think.

Regards
Peri Hankey

-- 
http://languagemachine.sourceforge.net - The language machine

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About