develooper Front page | perl.beginners | Postings from November 2019

Re: reading data from a web site

Thread Previous | Thread Next
November 21, 2019 12:27
Re: reading data from a web site
Message ID:
On Thursday, November 21, 2019 8:34:09 AM CET Olivier wrote:
> hw <> wrote:
> > On Wednesday, November 20, 2019 3:29:00 AM CET Olivier wrote:
> > > hw <> writes:
> > > > Hi,
> > > > 
> > > > how can I read data from a web site which is using multiple frames and
> > > > some
> > > > javascript?
> > > 
> > > Provided that the web site does not change too often and that they don't
> > > implement stupid "security" features, this should not be too complicate.
> > > 
> > > Each frame is a web page, with it own URL. So you can examine the source
> > > code of the web page to find the URL of the first frame and second frame
> > > 
> > > Them you can use any Perl library you like to load that URLand pars it
> > > for what you are looking for.
> > > 
> > > Then use that data to load the second frame with a URL modified to
> > > include the type of data you have selected.
> > > 
> > > Being frames makes it much easier, you hould not have to care about the
> > > javascript too much.
> > 
> > The web site seems to be created by a program running on the server, i. e.
> > there is not really a web site.  When I access it with lynx or with
> > WWW::Mechanize, the answer from the server says that neither frames, nor
> > javascript is supported, and it is not possible to log in.
> Of course lynx cannot process frames. But that is not what I meant to
> tell you.
> Open the web page with your browser, FireFox, Chromium, whatever, the
> CTRL-U to display the source. In that source, you should see some tages
> <frame ir maybe <iframe which contains an URL.

When I do that, the login page is being displayed, and nothing happens when I 
press Ctrl-U.  Maybe it's because the page is already made with frames?

When I look at the source of the frame that contains the fields to enter a 
username a password, I can see that there are inputs for those, like this:

<INPUT TYPE="TEXT" NAME="usrlogn" VALUE="" MAXLENGTH="15" SIZE="8">&nbsp;

The only URL is probably the one displayed in the address bar of the web 
browser when looking at the source of the frame.  That URL seems to point at 
the program running on the web server with parameters in the URL which have 
been created by the program.  One of the parameters seems to be a session ID.

Instead of viewing the source of the frame, I can open the frame in other tab.  
How does that help me?  There is no way to automatically get the URL for the 
frame because the parameters are being created by the program on the web 
server, and they are only valid for a short time.

> Copy that URL and try to paste it in a separate window of your web browser.
> You should see the list of the topic you can select from. In fact it
> should display the contents of the 1st frame.

Well, yes, I can see the source of the frame that has the select list.  That 
doesn't help me either because to get the data I want, I need to select 
entries from the select list.  Selecting such an entry results in another 
frame being updated; that frame shows a table.

I can get the URL of that frame from the frame info of the web browser and 
download the frame and convert its table into a CSV and put the data into a 
database --- but I can not get the URL of the frame other than copying it 
manually from the frame info of the web browser.

> If it does not, you are in a not too good shape.
> If it works, go back to the source code and locate the second <frame
> tag, find the URL, copy, new window, paste.
> The concept is to access to the contens of the frames directly, without
> accessing the main page.
> Best regards,
> Olivier
> > Can WWW::Mechanize somehow trick the server into assuming that frames and
> > javascript are supported by the client?

Like I said, there are no frames to do anything with when the web site is 
being accessed with WWW::Mechanize.

I can only see that when I select an entry from the select list, the web 
browser sends a POST request for a subdocument and then right away makes a GET 
request for a style sheet.  Unfortunately, the browser doesn't tell me what 
the POST request looks like.  It should have something to do with what is 
selected from the list ...

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About