Hello to all,
I'm very newbie of Perl but every I'm understanding how is powerful this language, but I have a problem:
I'm using Perl with Selenium for scraping data (for a job) the code looks like this
[code]
use strict;
use warnings;
use Time::HiRes qw(sleep);
use Test::WWW::Selenium;
use Test::More "no_plan";
use Test::Exception;
open (INFO, '>>database.csv') or die "$!";
print INFO ("titolo\;descrizione\;schedaTecnica\;URLFoto\n");
my $sel = Test::WWW::Selenium->new( host => "localhost",
port => 4444,
browser => "*chrome",
browser_url => "http://www.example.com/it/page.html" );
sub estrai{
$sel->wait_for_page_to_load_ok("30000");
my $titolo = $sel->get_text("//h1");
my $schedaTecnica = $sel->get_text("//td[3]/ul");
my $img = $sel->get_attribute("//div/img\@src");
my $descrizione = $sel->get_text("//td[2]");
print INFO ("$titolo\;$descrizione\;$schedaTecnica\;$img\n");
$sel->go_back_ok();
$sel->wait_for_page_to_load_ok("30000");
}
$sel->open_ok("/it/page.html");
$sel->click_ok("//div[2]/div/div/div[2]/h3/a");
$sel->wait_for_page_to_load_ok("30000");
$sel->click_ok("//div[2]/div/div/div[2]/h3/a");
$sel->wait_for_page_to_load_ok("30000");
estrai($sel);
...
close (INFO);
[/code]
Unfortunately my CSV is very bad because (sometimes) when I extract data from "//ul" my file looks like:
[code]
Art. S500 Set Yoga "Siddhartha";Idea regalo ?SET YOGA Siddhartha? Elegante scatola in cartone lucido contenente:
2 mattoni in legno naturale mis. cm 20 x 12,5 x 7
1 cinghia in cotone mis. cm 4 x 235
1 stuoia in cotone mis. cm 70 x 170
1 manuale di introduzione allo yoga stampato
Tutto rigorosamente realizzato con materiali naturali;€ 82,50;../images/S500%20(Custom).jpg
[/code]
So when I extract data I need to implement UTF8 encoding and to eliminate spaces between lines, how is possible?
Thanks in advance
Thread Next