develooper Front page | perl.libwww | Postings from April 2003

HTTP::Cookies

Thread Next
From:
Mike Clark
Date:
April 3, 2003 09:27
Subject:
HTTP::Cookies
Message ID:
05f501c2fa06$72d0a7a0$0100a8c0@nicky
I see from researching the archives of this list that people have
succeeded in getting HTTP::Cookies to work with a login, along with
HTML::Form.

Maybe someone can suggest some methods to me here.
I will be grateful for any help I can get.

Thanks in advance,

Mike Clark
nuts@coconutisland.com
Toll Free 888 999 2181

Here is the project:

We have developed a simple perl spider which executes from a command prompt,
and it accepts a list of urls, then downloads the web pages to a directory.

I want to adapt it to spider from password-protected asp pages.  We have
purchased a membership to this site, but the download is too slow.

When I login manually with a browser, it sets a cookie, then any time during
that browser session, any url entered separately in the location field will
work -- that is, it does not require a specific referer page, it only
requires the cookie.  When I disable cookies in netscape, the login will not
work.

When I have a browser session open in internet explorer, any url pasted into
the location field will work, but if I open a new browser window, the pasted
url will not work in the new window.  However, when I login with
user/password in a second browser window (either netscape or explorer), then
urls pasted into the second browser window do work.

Conclusion:  it is reading the cookie from the specific browser session.

Project:  first the script has to login and set the cookie, then it has to
download a list of urls from the site.

This is the login form:

<form name="login" method="post" action="main.asp"  onSubmit="validate();" >
Enter Email ID <input type="text" name="email" size="25" maxlength="50"><br>
Enter Password <input type="password" name="password" size="25"
maxlength="30" ><br>
<input type ="hidden" name ="Browser" value ="">
<input type="hidden" name="submitted" value="Y">
<input type="submit" name="Login" value="Submit">
<SCRIPT language="JavaScript">
if (document.location.search == "?message=Y"){
document.write("ID/Password not found. Please register or try again.");
</script>
</form>


Here is the existing script:



#!/usr/bin/perl

require LWP::UserAgent;
require HTTP::Request;
require HTTP::Response;
use HTTP::Request::Common;

foreach (@ARGV)
{
        if ( $_ eq $ARGV[0] )
        {
        $inputfile = $_;
        }
        elsif ( $_ eq $ARGV[1] )
        {
        $outdir = $ARGV[1];
        }
        else
        {
        die "Usage: $0 inputfile outdir\n";
        }
}

print "Welcome\n";

print "Opening inputfile... ";
open (LINKFILE,"$inputfile") or die "Couldn't open the inputfile, $!";
@links = <LINKFILE>;
close(LINKFILE);
print "Sucess!\n";

# unless (-e $outdir){
#       print "Directory doesn't exist... Creating\n";
#       mkdir "$outdir", 755 or die "Couldn't make directory, $!";
# }

if(!opendir (OUTDIR, "$outdir")){
        mkdir "$outdir",755;
        print "Output directory created!\n";
}
else{print "Output directory exists!\n";}

print "Changing directory... ";
chdir "$outdir" or die "Couldn't change directory, $!";
print "Success!\n";

# Check to see if we hung up last time
# this doesn't resume, just warns you that it stopped somewhere
# in earlier versions of the program i had problems with the
# program hanging, but I don't know why.
if (-e "spiderlog.txt"){
        open (LOG,"spiderlog.txt");
        @spiderlog = reverse <LOG>;
        close(LOG);

        $lastline = chomp($spiderlog[0]);

        if ($lastline ne "Done"){
                print "Spider not finished... Last line in log says:
$lastline\n";
        }
}

$filenum = 1;

$ua = new LWP::UserAgent;

$ua->agent('OurBot/1.0');

print "Start spidering process...\n\n";
$total = @links;

$start = time();

open (LOG,">>spiderlog.txt");
print LOG "Started at: $start\n\n";

foreach $line (@links){

        print "Getting $line";
        $response = $ua->request(GET $line);

        if ($response->is_success) {

                $content = $response->content;

                if      ($filenum =~ /\d\d\d\d/) {$filenum = $filenum; }
                elsif   ($filenum =~ /\d\d\d/) {$filenum = "0$filenum"; }
                elsif   ($filenum =~ /\d\d/) {$filenum = "00$filenum"; }
                else    {$filenum = "000$filenum"; }

                open (NEWPAGE,">$filenum.html");
                print NEWPAGE $response->content;
                close (NEWPAGE);
                print "$filenum.html generated\n\n";

                print LOG "$filenum - $line";

                $filenum++;
        } else { print $response->error_as_HTML; }
}

$end = time();
$parse = $end - $start;
$parse = 1 unless($parse);
$lps  = int($total/$parse);
print "$total lines in $parse seconds ($lps lines/sec)\n";

print LOG "$total lines in $parse seconds\nFinished at $end\nDone\n";
close (LOG);

print "clumping files... \n";
system "cat *.html > masterfile.htm";
print "Done!\n";


Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About