develooper Front page | perl.libwww | Postings from April 2003

$response->base use of Content-Location: header

Thread Next
From:
Bill Moseley
Date:
April 20, 2003 08:53
Subject:
$response->base use of Content-Location: header
Message ID:
Pine.LNX.4.10.10304200807050.957-100000@mardy.hank.org

I'm using HTML::LinkExtor and I want absolute links extracted, plus I want
the links to be adjusted if there's a <base href=..> tag.  Makes sense,
right?

So I might create the object like:

 $p = HTML::LinkExtor->new(\&cb, $response->base)

Now $response->base will get the base from the Content-Base: or
Content-Location: header if there's no <base> tag.

The problem I'm having is if I fetch foo.html, but Apache actually returns
content from foo.html.en, then Content-Location is foo.html.en.

So, if I'm extracting links from foo.html, but Apache offers a
Content-Base header of foo.html.en, then a fragment link like href="#jump"
ends up producing link back to the same content.  My spider ends up
fetching the page twice under two different URLs.

Looking at http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.14
I'm thiniing it's incorrect it use Content-Location as the base:

But, I often find the RFC's make things less clear... 

[The value of Content-Location also defines the base URI for the entity]

Ok, that supports its use. But then goes on to say:

[The Content-Location value is not a replacement for the original
requested URI; it is only a statement of the location of the resource
corresponding to this particular entity at the time of the request. Future
requests MAY specify the Content-Location URI as the request- URI if the
desire is to identify the source of that particular entity.]

Regardless, it's breaking my spider.  So,

1) is my use of $response->base in LinkExtor correct?

2) is HTTP::Response wrong in using Content-Location?

3) What's the real world use of Content-Location?
Frankly, if I request foo.html, I (as the client) don't really need to
know the content came from foo.html.en, do I?

4) anyone else find RFCs confusing?


One useful thing is when I'm spidering and I have links for both:

   /path/to/dir/
and
   /path/to/dir/index.html

I end up indexing both as separate docs.  (I use MD5 checksums to catch
that, though).  So it would be helpful in the case of / to know the real
URI.


-- 
Bill Moseley moseley@hank.org


Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About