develooper Front page | perl.beginners | Postings from April 2012

Re: XML::Mini question

Thread Previous
From:
Manfred Lotz
Date:
April 18, 2012 22:43
Subject:
Re: XML::Mini question
Message ID:
20120419073942.5e1328bd@arcor.com
On Wed, 18 Apr 2012 22:23:37 +0200
Manfred Lotz <manfred.lotz@arcor.de> wrote:

> On Thu, 19 Apr 2012 06:15:47 +1000
> "Owen" <rcook@pcug.org.au> wrote:
> 
> > 
> > > Hi there,
> > > I've got a question about XML::Mini.
> > >
> > > When parsing an xml document for some reasons I want to preserve
> > > white space. However, it doesn't work really.
> > >
> > > Minimal example:
> > >
> > > ! /usr/bin/perl
> > >
> > >
> > > use strict;
> > > use warnings;
> > > use Data::Dumper;
> > > use XML::Mini::Document;
> > >
> > > my $XMLString = "<book>  Learning Perl </book>";
> > >
> > > my $xmlDoc = XML::Mini::Document->new();
> > >
> > > $XML::Mini::IgnoreWhitespaces = 0;
> > >
> > > # init the doc from an XML string
> > > $xmlDoc->parse($XMLString);
> > >
> > > my $xmlHash = $xmlDoc->toHash();
> > >
> > > print Dumper($xmlHash);
> > >
> > >
> > > I get the following output:
> > > VAR1 = {
> > >           'book' => 'Learning Perl '
> > >         };
> > >
> > >
> > > I would have expecte to have
> > >    book' => '  Learning Perl '
> > >
> > > instead.
> > >
> > >
> > > Any idea, what's going wrong?
> > 
> > 
> > What Happens if you set $XML::Mini::IgnoreWhitespaces = 1
> > 
> > Seems to me that 1 = yes
> > 
> 
> This is true.
> 
> > What does the documentation say?
> > 
> 
> If I set it to 1 then I get
>   book' => 'Learning Perl'
> 
> which is even worse. Please note that I don't want to have ignored
> white space. 
> 
> 

Hm, I had no other idea but to look up the source code. I guess I found
what happens.

 if ($XMLString =~ 
   m/^\s*(<\s*([^\s>]+)([^>]+)\/\s*>|	# <unary \/>
          <\?\s*([^\s>]+)\s*([^>]*)\?>|	# <? headers ?>
          <!--(.+?)-->| # <!-- comments -->
          <!\[CDATA\s*\[(.*?)\]\]\s*>\s*| 	# CDATA
          <!DOCTYPE\s*([^\[>]*)(\[.*?\])?\s*>\s*| # DOCTYPE
          <!ENTITY\s*([^"'>]+)\s*(["'])([^\11]+)\11\s*>\s*| # ENTITY
          ([^<]+))(.*)/xogsmi) # plain text      

IHMO, here is the bug. Here leading white space will be deleted which
is ok if it is no plaintext.

I changed it like this
if ($XMLString =~ 
   m/(^\s*<\s*([^\s>]+)([^>]+)\/\s*>|	#<unary \/> 
      ^\s*<\?\s*([^\s>]+)\s*([^>]*)\?>|	# <? headers ?>
      ^\s*<!--(.+?)-->| # <!-- comments -->
      ^\s*<!\[CDATA\s*\[(.*?)\]\]\s*>\s*| 	# CDATA
      ^\s*<!DOCTYPE\s*([^\[>]*)(\[.*?\])?\s*>\s*| # DOCTYPE
      ^\s*<!ENTITY\s*([^"'>]+)\s*(["'])([^\11]+)\11\s*>\s*| # ENTITY
      ([^<]+))(.*)/xogsmi) # plain text     


Now in all cases except plain text leading space will be deleted.


$VAR1 = {
          'book' => '  Learning Perl '
        };



-- 
Manfred





Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About