develooper Front page | perl.recdescent | Postings from July 2006

Speed issue w/ LARGE parsed file

Thread Next
From:
david.weber
Date:
July 17, 2006 12:48
Subject:
Speed issue w/ LARGE parsed file
Message ID:
55B858CA6C995345AF972EF652665FB7C00A22@ARLEXCHVS01.lst.link.l-3com.com
Hey all,
	I'm a recdescent newbie, so please cut me some slack ;)

I've got a ~1.5Mb file that I'm parsing.  The grammar is pretty well
established, in such that it's from a formal paper, and has EBNF
notation written about it.  I've looked at the EBNF notation, and done
my best to simplify it.  In other words, EBNF says some number should be
from 0-65535, so I just specify /\d{1,5}/ to simplify & speed up the
processing.

W/ the first set of working grammar (tested using a subset of the file),
and it has about 85 separate rules.
I tried running it on the "full" file, but it just took too damn long.

So, I went about creating a much simpler parser (even dumber), so I
could do some pre-parsing, to speed things up.

The file looks like:

(foo bar)
(foo (bar baz))
(foo "bar")
(foo (bar "baz")

And these levels of data could be several levels deep w/ data.  E.g.:

(foo (bar baz)
(baz baz)
(baz (baz (baz(baz "bar")))))

So, I dumbed down my grammar (as can be seen below) but it still takes
longer than I have patience for ( > 10 minutes) to parse.

Am I SOL with parsing this file use RecDescent or is something glaringly
bad w/ the below syntax?

TIA

--dw

############################################################
# The main file has a header, and one or more object models
File : Header Model(s)

# Define what the header is
Header: 
    "(" /Header[\s]v[\d]+\.[\d]+\.[\d]+\.[\d]+/ ")" 
    | <error: Invalid Header>

# Define what the object model is 
Model: 
    "(Model"
        Item(s)        
    ")"
    | <error: Invalid parse of the ObjectModel>

Item:
    "(" /\b[^\s]+\b/ /[^\(\)]*/ Item(s?) ")"  # Simply two tokens
    | "(" /\b[^\s]+\b/ "\"" /[^\"]*/ "\"" Item(s?) ")"
    | <error>


# These items left in for clarity's sake. Functionally equivalent
# to Item above, but hopefully faster
OldItem:
    "(" Label Data Item(s?) ")"  # Simply two tokens
    | "(" Label QuotedData Item(s?) ")"
    | <error>

Label:
	/\b[^\s]+\b/

Data:
	/[^\(\)]*/

QuotedData:
	"\"" /[^\"]*/ "\""
############################################################

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About