develooper Front page | perl.perl5.porters | Postings from August 2001

XML::ReParser (Re: Regexp-based XML parser (XML::Parser::Lite))

Thread Previous
From:
Paul Kulchenko
Date:
August 23, 2001 15:58
Subject:
XML::ReParser (Re: Regexp-based XML parser (XML::Parser::Lite))
Message ID:
20010823225835.7895.qmail@web13506.mail.yahoo.com
Hi, Jarkko!

> I suggest scrapping the use of ?{}: it's too fragile (as you found
> out, painfully).  See the attached quick hack (which I disavow)
> that uses more vanilla regexps and simple recursion.
Agree. See attached XML::ReParser module. It's pure regexp-based,
should work with (almost) any Perl, has Expat and SAX1 interfaces,
and supports streaming (non-blocking) parsing. It's still small, but
does big part of what should be done. Examples and short doc
included. Any feedback is welcome.

Best wishes, Paul.

--- Jarkko Hietaniemi <jhi@iki.fi> wrote:
> On Tue, Jul 31, 2001 at 06:22:07PM -0700, Paul Kulchenko wrote:
> > Hi, Jarkko!
> > 
> > Short summary. To implement pure XML parser several options are
> > available: 
> > 1. shallow parser (regexp-based, but doesn't use ?{} or /e)
> > 2. regexp-based (XML::Parser::Lite or similar)
> > 3. grammar-based parser
> > 4. ?
> > 
> > Shallow parser returns stream of tokens, but for generating
> events
> > from this stream second maching is required.
> > Grammar-based parser can be implemented with Parse::RecDescent,
> but
> > in the near future Parse::RecDescent won't be included in the
> core.
> 
> Not just that.  Damian's opinion is that P::R should *never* be
> included in the core.  Not just because size or what-to-include
> concerns, but because Damian basically wants to rewrite the whole
> thing from scratch.
> 
> > Regexp-based parser that uses ?{} has problems with regexpes
> invoke
> > from inside generated callbacks.
> 
> I suggest scrapping the use of ?{}: it's too fragile (as you found
> out, painfully).  See the attached quick hack (which I disavow)
> that
> uses more vanilla regexps and simple recursion.
> 
> > Here are the results of experiments with ?{}. Minimal code that
> > fails:
> > 
> > use re 'eval';
> > 1 while
> 121211222=~/(1)(?{callback($1)})(?:(3)(?{callback($2)}))?/g;
> > 
> > sub callback {
> >   my $c = $_[0];
> > #  ;             # 0 # default
> > #  $_[0] =~ /1/; # 1 # in-place manipulations
> > #  $c =~ s/1/3/; # 2 # successful s///
> > #  1 =~ /1/;     # 3 # successful match
> > #  1 =~ /(1)/;   # 4 # successful with $1, localization doesn't
> help
> >   print $c;
> > }
> > 
> > It does NOT fail if there is no '?' in the end of regexp (or if
> there
> > is no second section). It also does NOT fail if internal regexp
> isn't
> > match.
> > 
> > Results are below, but what I especially don't like is endless
> output
> > of '1's in tests 1 and 3 (both executed correctly by all previous
> > versions except 5.005) and coredumps of 5.7.x in tests 2 and 4
> (yet
> > results of others are also incorrect, but they are not coredump).
> All
> > experiments done on Linux and Windows, see if you can reproduce
> it in
> > your environment.
> > 
> > Results:
> > 
> > # 0
> > 
> > perl5.00503
> > 111
> > perl5.00503 (Windows, ActiveState)
> > 111
> > perl5.6.0
> > 1111
> > perl5.6.0 (Windows)
> > 1111
> > perl5.6.1
> > 1111
> > perl5.6.1 (Windows, ActiveState)
> > 1111
> > perl5.7.1
> > 1111
> > perl5.7.2
> > 1111
> > 
> > # 1
> > 
> > perl5.00503
> > 111
> > perl5.00503 (Windows, ActiveState)
> > 111
> > perl5.6.0
> > 1111
> > perl5.6.0 (Windows)
> > 1111
> > perl5.6.1
> > 1111
> > perl5.6.1 (Windows, ActiveState)
> > 1111
> > perl5.7.1
> > 1111
> > perl5.7.2
> > *endless output of '1'
> > 
> > # 2
> > 
> > perl5.00503
> > 333
> > perl5.00503 (Windows, ActiveState)
> > 333
> > perl5.6.0
> > *coredump
> > perl5.6.0 (Windows)
> > 3*coredump
> > perl5.6.1
> > 3
> > perl5.6.1 (Windows, ActiveState)
> > 3
> > perl5.7.1
> > *coredump
> > perl5.7.2
> > *coredump
> > 
> > # 3
> > 
> > perl5.00503
> > 111
> > perl5.00503 (Windows, ActiveState)
> > 111
> > perl5.6.0
> > 1111
> > perl5.6.0 (Windows)
> > 1111
> > perl5.6.1
> > 1111
> > perl5.6.1 (Windows, ActiveState)
> > 1111
> > perl5.7.1
> > 1111
> > perl5.7.2
> > *endless output of '1'
> > 
> > # 4
> > 
> > perl5.00503
> > 111
> > perl5.00503 (Windows, ActiveState)
> > *nothing
> > perl5.6.0
> > 1
> > perl5.6.0 (Windows)
> > 1*coredump
> > perl5.6.1
> > 1
> > perl5.6.1 (Windows, ActiveState)
> > 1
> > perl5.7.1
> > *coredump
> > perl5.7.2
> > *coredump
> > 
> > Best wishes, Paul.
> > 
> > 
> > 
> > __________________________________________________
> > Do You Yahoo!?
> > Make international calls for as low as $.04/minute with Yahoo!
> Messenger
> > http://phonecard.yahoo.com/
> 
> -- 
> $jhi++; # http://www.iki.fi/jhi/
>         # There is this special biologist word we use for 'stable'.
>         # It is 'dead'. -- Jack Cohen
> > $s = "<a x='y'><b>foobar</b><c y=\"foo bar\" z=\"zap\"/></a>";
> 
> my $a0 = q!(?:\w+\s*=\s*(?:"[^"<]+?"|'[^'<]+?'))!;
> my $a1 = qq!(?:$a0(?:\\s+$a0)*)!;
> 
> sub _r {
>     while ($_[0] ne '') {
> 	if ($_[0] =~ s!^<(\w+)(?:\s+($a1))?(/?)>!!s) {
> 	    push @{$_[2]}, $1 unless $3;
> 	    $_[1]->{startElement}->($1) if $_[1]->{startElement};
> 	    if (defined $2) {
>                 my $a = $2;
> 		my %a = map { /^([^=\s]+)\s*=\s*["'](.*)['"]/ }
>                             ($a =~ /($a0)\s*/g);
> 		for my $a (sort keys %a) {
> 		    print "\t$a = \"$a{$a}\"\n";
> 		}
>             }
> 	    _r($_[0], $_[1], $_[2]);
>        } elsif ($_[0] =~ s!^</$_[2][-1]>!!s) {
> 	    my $e = pop @{$_[2]};
> 	    $_[1]->{endElement}->($e) if $_[1]->{endElement};
>        } elsif ($_[0] =~ s!^(<[^>]+?>)!!s || $_[0] =~ s!^(<)!!s) {
>             warn "Unexpected: $1\n";
>             return;
>        } else {
> 	    $_[0] =~ s!^([^<]*)!!s;
> 	    $_[1]->{characters}->($1) if $_[1]->{characters};
>        }
>     }
> }
> 
> sub r {
>     $_[1]->{startDocument}->() if $_[1]->{startDocument};
>     _r($_[0], $_[1]);
>     $_[1]->{endDocument}->()   if $_[1]->{endDocument};
> }
> 
> r($s, {
>        startDocument => sub { print "startDocument\n" },
>        startElement  => sub { print "startElement $_[0]\n" },
>        characters    => sub { print "characters: \"$_[0]\"\n" },
>        endElement    => sub { print "endElement $_[0]\n" },
>        endDocument   => sub { print "endDocument\n" },
>       });
> 
> 


__________________________________________________
Do You Yahoo!?
Make international calls for as low as $.04/minute with Yahoo! Messenger
http://phonecard.yahoo.com/
Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About