develooper Front page | perl.libwww | Postings from August 2016

Re: Facing problem with HTML::Parser

Thread Previous | Thread Next
From:
Shivani Palle
Date:
August 18, 2016 01:31
Subject:
Re: Facing problem with HTML::Parser
Message ID:
CAH0Myt_CUhP3Tpj=JRjuvUMZOw5s5DypdRvz5KzwYHp-m+u_Bg@mail.gmail.com
Hi All,


Thank you very much for the extra help you gave me. It's working fine.
 I know how busy you are, so I really appreciated the time you spent for
helping me.

Thanks,
Shivani

On Tue, Aug 16, 2016 at 10:33 PM, Paul Bijnens <paul.bijnens@xplanation.com>
wrote:

> See below:
>
> On 2016-08-11 07:44, Shivani Palle wrote:
>
> Hi,
>
>
> I am facing one issue while using HTML::Parser. Please help me.
>
> *Issue:*
>
> I am using HTML::Parser to parse all the HTML files through out the
> directories to get hard coded strings from the html files(text between the
> tags).
>
> the code is like this:
>
>  #!/usr/bin/perl -w
> package Example;
> require HTML::Parser;
> @Example::ISA = qw(HTML::Parser);
> use File::Find;
> use File::Basename;
>
> #my @files = glob("*.thtml");
> find({ wanted => \&process_file, no_chdir => 1 },
> "/mnt/src/xxx/git/xxx-ive-rdv/");
>
> #foreach $file (@files){
> sub process_file {
>    if (-f $_) {
>        if ($_ =~ m/(.thtml)$/i) {
>    #my($file, $dir, $ext) = fileparse($_);
>    my $file = $_;
>     #step1: Parsing the html file and storing the parsed content in
> another file
>     my $parser = Example->new;
>     $parser->ignore_elements(qw(script)); #ignoring script elements
>     $parser->parse_file($file);
>     print  $parser->{TEXT};
>
>     sub text
>     {
>         my ($self,$text) = @_;
>         $self->{TEXT} .= $text."\n";
>     }
>     open(my $fh, '>', 'parserOutput.txt');
>     print $fh  $parser->{TEXT};
>     close $fh;
>    }
>   }
> }
>
>
>
> *Failing case*:
>
> It is breaking some lines in to two lines.
> For example, I have the following line.
>
> *Before Parsing:*
> <label for="chkInstallAgent">Install Agent for this role</label>
>
> *After Parsing*:
> Install Agent for this
> role
>
> There is no tag in "Install Agent for this role". But still it is breaking
> in to two lines.
> Can you please help me with it.
>
>
> There is a configuration option the HTML::Parser to avoid the breaking:
>
> From the manual page of HTML::Parser:
>
>     $p->unbroken_text
>     $p->unbroken_text( $bool )
>         By default, blocks of text are given to the text handler as soon as
>         possible (but the parser takes care always to break text at a
> boundary
>         between whitespace and non-whitespace so single words and entities
> can
>         always be decoded safely). This might create breaks that make it
> hard
>         to do transformations on the text. When this attribute is enabled,
>         blocks of text are always reported in one piece. This will delay
> the
>         text event until the following (non-text) event has been
> recognized by
>         the parser.
>
>
> (And most other comments e.g. from Shlomi Fish apply as well, to create a
> much cleaner program, of course.
>
>
>
>

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About