Hi Fabrizio,
see below for my response.
On Fri, 6 Apr 2012 03:30:46 -0700 (PDT)
Fabrizio Di Carlo <dicarlo.fabrizio@gmail.com> wrote:
> Hello to all,
>
> I'm very newbie of Perl but every I'm understanding how is powerful this language, but I have a problem:
>
> I'm using Perl with Selenium for scraping data (for a job) the code looks like this
>
> [code]
> use strict;
> use warnings;
> use Time::HiRes qw(sleep);
> use Test::WWW::Selenium;
> use Test::More "no_plan";
> use Test::Exception;
>
>
> open (INFO, '>>database.csv') or die "$!";
> print INFO ("titolo\;descrizione\;schedaTecnica\;URLFoto\n");
> my $sel = Test::WWW::Selenium->new( host => "localhost",
> port => 4444,
> browser => "*chrome",
> browser_url => "http://www.example.com/it/page.html" );
>
> sub estrai{
> $sel->wait_for_page_to_load_ok("30000");
> my $titolo = $sel->get_text("//h1");
> my $schedaTecnica = $sel->get_text("//td[3]/ul");
> my $img = $sel->get_attribute("//div/img\@src");
> my $descrizione = $sel->get_text("//td[2]");
> print INFO ("$titolo\;$descrizione\;$schedaTecnica\;$img\n");
> $sel->go_back_ok();
> $sel->wait_for_page_to_load_ok("30000");
> }
>
> $sel->open_ok("/it/page.html");
> $sel->click_ok("//div[2]/div/div/div[2]/h3/a");
> $sel->wait_for_page_to_load_ok("30000");
> $sel->click_ok("//div[2]/div/div/div[2]/h3/a");
> $sel->wait_for_page_to_load_ok("30000");
> estrai($sel);
> ...
> close (INFO);
> [/code]
>
> Unfortunately my CSV is very bad because (sometimes) when I extract data from "//ul" my file looks like:
>
> [code]
> Art. S500 Set Yoga "Siddhartha";Idea regalo ?SET YOGA Siddhartha? Elegante scatola in cartone lucido contenente:
> 2 mattoni in legno naturale mis. cm 20 x 12,5 x 7
>
> 1 cinghia in cotone mis. cm 4 x 235
>
> 1 stuoia in cotone mis. cm 70 x 170
>
> 1 manuale di introduzione allo yoga stampato
>
>
>
> Tutto rigorosamente realizzato con materiali naturali;€ 82,50;../images/S500%20(Custom).jpg
> [/code]
> So when I extract data I need to implement UTF8 encoding and to eliminate spaces between lines, how is possible?
>
You should play with the encoding layer of file handles (e.g: «binmode $myfh,
":encoding(utf8)"») and with Encode.pm's decode() and encode() functions. For
me at least, it usually takes some trial and error.
Regards,
Shlomi Fish
--
-----------------------------------------------------------------
Shlomi Fish http://www.shlomifish.org/
List of Portability Libraries - http://shlom.in/port-libs
Chuck Norris wrote a complete Perl 6 implementation in a day, but then
destroyed all evidence with his bare hands, so no‐one will know his secrets.
Please reply to list if it's a mailing list post - http://shlom.in/reply .
Thread Previous