develooper Front page | perl.beginners | Postings from April 2012

Problems: UTF8 charset and print on line from data extract

Thread Next
From:
Fabrizio Di Carlo
Date:
April 9, 2012 03:03
Subject:
Problems: UTF8 charset and print on line from data extract
Message ID:
15071503.573.1333708246645.JavaMail.geo-discussion-forums@vbug19
Hello to all,

I'm very newbie of Perl but every I'm understanding how is powerful this language, but I have a problem:

I'm using Perl with Selenium for scraping data (for a job) the code looks like this 

[code]
use strict;
use warnings;
use Time::HiRes qw(sleep);
use Test::WWW::Selenium;
use Test::More "no_plan";
use Test::Exception;


open (INFO, '>>database.csv') or die "$!";	
print INFO ("titolo\;descrizione\;schedaTecnica\;URLFoto\n");									
my $sel = Test::WWW::Selenium->new( host => "localhost", 
                                    port => 4444, 
                                    browser => "*chrome", 
                                    browser_url => "http://www.example.com/it/page.html" );

sub estrai{
	$sel->wait_for_page_to_load_ok("30000");
	my $titolo = $sel->get_text("//h1");
	my $schedaTecnica = $sel->get_text("//td[3]/ul");
	my $img = $sel->get_attribute("//div/img\@src");
	my $descrizione = $sel->get_text("//td[2]");
	print INFO ("$titolo\;$descrizione\;$schedaTecnica\;$img\n");
	$sel->go_back_ok();
	$sel->wait_for_page_to_load_ok("30000");
}
									
$sel->open_ok("/it/page.html");
$sel->click_ok("//div[2]/div/div/div[2]/h3/a");
$sel->wait_for_page_to_load_ok("30000");
$sel->click_ok("//div[2]/div/div/div[2]/h3/a");
$sel->wait_for_page_to_load_ok("30000");
estrai($sel);
...
close (INFO);
[/code]

Unfortunately my CSV is very bad because (sometimes) when I extract data from "//ul" my file looks like:

[code]
Art. S500 Set Yoga "Siddhartha";Idea regalo ?SET YOGA Siddhartha? Elegante scatola in cartone lucido contenente:
 2 mattoni in legno naturale mis. cm 20 x 12,5 x 7
 
 1 cinghia in cotone mis. cm 4 x 235
 
 1 stuoia in cotone mis. cm 70 x 170
 
 1 manuale di introduzione allo yoga stampato
 
 
 
 Tutto rigorosamente realizzato con materiali naturali;€ 82,50;../images/S500%20(Custom).jpg
[/code]
So when I extract data I need to implement UTF8 encoding and to eliminate spaces between lines, how is possible?

Thanks in advance


Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About