develooper Front page | perl.libwww | Postings from December 2000

Re: Parse and save to file

Thread Previous | Thread Next
Sean M. Burke
December 17, 2000 12:20
Re: Parse and save to file
Message ID:
At 11:03 AM 2000-12-17 -0800, Randal L. Schwartz wrote:
>Just make the callback also write the file:
>    my $response = do {
>      my $handle = IO::File->new(">tvschedule.html");
>      $ua->request(HTTP::Request->new(GET => $url), sub {
>        print $handle $_[0];
>        $p->parse($_[0]);
>      });
>    };

First off, don't forget to call $p->eof() at the end of the parse.  Most
people forget to do that with HTML::Parser (et al) objects.  Forgetting to
call eof() with HTML::TreeBuilder objects is particularly bothersome.

And second off, I have a dim recollection that I was using code like this
once, and pointed out to Gisle that the callback didn't get called when for
response data that come across with a non-success status.  I.e., if your
request got an "OK, here it is..." status, then the callback would be
called for chunks of what was returned, as it was returned -- but if it
said "404 Not Found", the object that came along with it (usually an HTML
page that explained the error) wouldn't be sent to the callback.  If I
remember correctly, Gisle thought about this and judged that it would be
nice to have an option stipulating that the callback should be called even
on the object received in an unsuccessful response.  I don't remember how
long ago this was (two weeks? a year?), and I don't know whether or how
he's implemented it.

There's all sorts of messy possibilities here.  Say you turn on bad-status
callbacking: what should that do in the case of a request that's a
redirection response that has content, and redirecting to a successful
request that does have content?  What about the case where the redirection
response has content, but redirects to a request that can't be fulfilled?
Maybe redirections should always be an exception, never causing a callback
to get called.

BTW, back in the realm of the likely:  setting a callback that blindly
feeds a $p->parse object is okay, assuming you /know/ the media type of the
object returned.  If, God help us, what was returned was an .au file, then
$p's parse tree would be very scary indeed.  Less perversely, if it replied
with a text/plain object (as some simple 404 handlers often emit), or an
XHTML object, then you wouldn't want to be sending either to an HTML parser
object.  Moral of the story: have your callback check the MIME type the
first time it's called.  Something like:

  my($mime_type, $mime_type_is_good);

  $ua->request(HTTP::Request->new(GET => $url), sub {
    if(defined($mime_type)) {
      if($mime_type_is_good) {
      } else {
    } else {
      $mime_type = $_[2]->content_type;
      if(lc($mime_type) eq 'text/html') {
        # or whatever would be the acceptable type(s)
        $mime_type_is_good = 1;
      } else {
        $mime_type_is_good = 0;

But unless you absolutely need to feed a callback, you might as well just
save to a file and then parse from the file after considering status and
content-type (and possibly other sanity-checks, like file size not being
too big or too small).

Sean M. Burke

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About