develooper Front page | perl.beginners | Postings from May 2008

Link parsing (was: Getting error...)

From:
Gunnar Hjalmarsson
Date:
May 1, 2008 11:03
Subject:
Link parsing (was: Getting error...)
hotkitty wrote:
> I ultimately want to go to cnn.com/ politics, follow all links under 
> the "Election Coverage" headline and, w/in those links, save all the 
> links under the "Don't Miss" sections that appear in those stories. 
> However, after many hours and trial & error I've yet to complete the 
> task. I know mechanize can do this somehow but I've yet to figure out 
> how to put it all together.

It's not so much about putting it together; it's more like writing Perl 
code step by step...

> Here's the script I have so far, which gets me to only step one:

http://www.mail-archive.com/beginners%40perl.org/msg93769.html

Actually, I'm not sure that the code you have even gets you to step one.

As a parsing exercise, I wrote the code below. I chose to make use of 
LWP::Simple and HTML::TokeParser. Please study the docs for the latter: 
http://search.cpan.org/perldoc?HTML::TokeParser


#!/usr/bin/perl
use strict;
use warnings;

use LWP::Simple;
use HTML::TokeParser;

my $domain = 'http://edition.cnn.com';
my $uri = $domain . '/POLITICS/';

my $html = get($uri) or die "Fetching $uri failed";
my $p = HTML::TokeParser->new(\$html);

# go to start position in the document
while ( $p->get_tag('div') ) {
     last if $p->get_text eq 'Election coverage';
}

# extract links
my @links;
while ( my $token = $p->get_token ) {
     if ( $token->[0] eq 'S' and $token->[1] eq 'a' ) {
         push @links, $token->[2]{href};
     }
     last if $token->[0] eq 'E' and $token->[1] eq 'ul';
}

foreach my $uri ( map $domain . $_, @links ) {
     my $html = get($uri) or warn "Fetching $uri failed" and next;
     my $p = HTML::TokeParser->new(\$html);

     # go to start position in the document
     $p->get_tag('h4');
     unless ( $p->get_text eq "Don't Miss" ) {
         warn "Didn't find section \"Don't Miss\"";
         next;
     }

     print "$uri\n";

     # extract links
     while ( my $token = $p->get_token ) {
         if ( $token->[0] eq 'S' and $token->[1] eq 'a' ) {
             print '  ', $p->get_text, "\n";
             my $uri = substr($token->[2]{href}, 0, 4) eq 'http' ?
               $token->[2]{href} : $domain . $token->[2]{href};
             print "  $uri\n\n";
         }
         last if $token->[0] eq 'E' and $token->[1] eq 'ul';
     }
}

-- 
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl



Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About