develooper Front page | perl.libwww | Postings from August 2016

Re: Facing problem with HTML::Parser

Thread Previous | Thread Next
From:
Paul Bijnens
Date:
August 16, 2016 17:04
Subject:
Re: Facing problem with HTML::Parser
Message ID:
85e40b10-962c-d74f-aaee-e6688703af83@xplanation.com
See below:


On 2016-08-11 07:44, Shivani Palle wrote:
> Hi,
>
>
> I am facing one issue while using HTML::Parser. Please help me.
>
> /*Issue:*/
>
> I am using HTML::Parser to parse all the HTML files through out the 
> directories to get hard coded strings from the html files(text between 
> the tags).
>
> the code is like this:
>
>  #!/usr/bin/perl -w
> package Example;
> require HTML::Parser;
> @Example::ISA = qw(HTML::Parser);
> use File::Find;
> use File::Basename;
>
> #my @files = glob("*.thtml");
> find({ wanted => \&process_file, no_chdir => 1 }, 
> "/mnt/src/xxx/git/xxx-ive-rdv/");
>
> #foreach $file (@files){
> sub process_file {
>    if (-f $_) {
>        if ($_ =~ m/(.thtml)$/i) {
>    #my($file, $dir, $ext) = fileparse($_);
>    my $file = $_;
>     #step1: Parsing the html file and storing the parsed content in 
> another file
>     my $parser = Example->new;
>     $parser->ignore_elements(qw(script)); #ignoring script elements
>     $parser->parse_file($file);
>     print  $parser->{TEXT};
>
>     sub text
>     {
>         my ($self,$text) = @_;
>         $self->{TEXT} .= $text."\n";
>     }
>     open(my $fh, '>', 'parserOutput.txt');
>     print $fh  $parser->{TEXT};
>     close $fh;
>    }
>   }
> }
>
>
>
> */Failing case/*:
>
> It is breaking some lines in to two lines.
> For example, I have the following line.
>
> *Before Parsing:*
> <label for="chkInstallAgent">Install Agent for this role</label>
>
> *After Parsing*:
> Install Agent for this
> role
>
> There is no tag in "Install Agent for this role". But still it is 
> breaking in to two lines.
> Can you please help me with it.
>

There is a configuration option the HTML::Parser to avoid the breaking:

 From the manual page of HTML::Parser:

     $p->unbroken_text
     $p->unbroken_text( $bool )
         By default, blocks of text are given to the text handler as soon as
         possible (but the parser takes care always to break text at a 
boundary
         between whitespace and non-whitespace so single words and 
entities can
         always be decoded safely). This might create breaks that make 
it hard
         to do transformations on the text. When this attribute is enabled,
         blocks of text are always reported in one piece. This will 
delay the
         text event until the following (non-text) event has been 
recognized by
         the parser.


(And most other comments e.g. from Shlomi Fish apply as well, to create 
a much cleaner program, of course.





Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About