develooper Front page | perl.libwww | Postings from October 2001

Re: HTML::Parser question

Thread Previous
From:
Reinier Post
Date:
October 30, 2001 14:19
Subject:
Re: HTML::Parser question
Message ID:
20011030231852.A28497@win.tue.nl
On Mon, Oct 29, 2001 at 10:00:23PM -0600, ADJE WebMail Technical Support Team wrote:
> Question: How do I extract the plain text from an HTML file, or, put
> another way, how do I remove the html markups, just leaving the plain
> text?  I have looked at the example provided in HTML::Parser, in
> particular
> 
> HTML-Parser-3.25/eg/htext
> 
> which comes close to what I need, however, I would like to store the
> plain text in a variable, as opposed to having it to STDOUT (standard
> output).... any ideas??

Try

  perl -MLWP::Simple -MHTML::TreeBuilder \
    -e 'my $text =HTML::TreeBuilder->new' \
    -e '->parse(LWP::Simple::get("http://www/"))->as_text;' \
    -e 'print $text'

You probably want to improve on it.

-- 
Reinier

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About