develooper Front page | perl.libwww | Postings from March 2001

HTML::Parser: Patch to permit reentrant parsing

Thread Next
From:
John Stracke
Date:
March 9, 2001 13:09
Subject:
HTML::Parser: Patch to permit reentrant parsing
Message ID:
3AA94805.AA20A7BE@ecal.com
I'm using HTML::Parser for a work project (thanks!), and I have a
need for doing reentrant parsing.  (In some cases, I'll spot a
tag and want to parse an include file.) My first workaround was
to construct a separate parser object and have parse the include
file; that works, but it turns out to be somewhat inefficient.
So, instead, I've modified the package to permit reentrant
parsing.  I've got the patch (attached; it's not that long), and
I'm hoping I can get it accepted in the mainline source tree.  (I
do have my employer's permission to release it.)

Since I saw that the actual parsing code was too complex to
monkey with safely, I took a slightly cheating route.  First, I
renamed the original HTML::Parser::parse() function to
parseIntern(), and created a new parse() function, written in
Perl

parse(), and parse_file(), now check to see whether we're already
parsing; if so, they call $self->setDeferred(), which sets a
"deferred" flag in the PSTATE, and then store the chunk or
filename they were asked to parse into a queue
($self->{deferredTasks}).  (Filenames get flagged as such; the
alternative would be to read the entire file and queue up its
contents, which would not be an efficiency win.  :-)

Meanwhile, the parse() function in hparser.c checks the deferred
flag after every time it calls report_event(); if it's set, it
saves the rest of the string it's parsing into
p_state->after_deferred (a new field, a char*), and returns.  The
new HTML::Parser::parse() function calls $self->getDeferred(),
which returns the after_deferred field; if it's nonempty, that
field is added to the end of the queue.  parse() then calls
parseDeferred(), which copies the queue, clears it, calls
$self->setDeferred(0) (which clears the deferred and
after_deferred fields), and parses each entry in the queue via
parse() or parse_file(), as appropriate.

It's a little roundabout, in part because stores the queue in the
Perl side of the world, rather than the C side (because Perl
makes it easier).  It does work, though, and does its job without
having to mess with the core parsing logic.

The attached patch is against version 3.18.  I hope you'll find
it worthy.  :-)

--
/================================================================\
|John Stracke    | http://www.ecal.com |My opinions are my own.  |
|Chief Scientist |===============================================|
|eCal Corp.      |Almost no one has ever wanted a 1/4" drill bit;|
|francis@ecal.com|all they ever wanted was a 1/4" hole.          |
\================================================================/



Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About