develooper Front page | perl.perl5.porters | Postings from May 2016

Re: Whither encoding::warnings?

Thread Previous | Thread Next
From:
George Greer
Date:
May 24, 2016 17:41
Subject:
Re: Whither encoding::warnings?
Message ID:
alpine.LFD.2.20.1605241238560.12422@drei.m-l.org
On Tue, 24 May 2016, Aristotle Pagaltzis wrote:

> * George Greer <perl@greerga.m-l.org> [2016-05-24 00:23]:
>> We rely on it at $WORK because we can't assume the source text is
>> Latin-1 and both sides should have already been upgraded before doing
>> operations.
>
> Do you rely on it to catch things in production? Or to catch problems
> during development? What problems would you have if it went silent and
> stopped warning?

We leave it on during both in fatal mode.  We're mostly doing bulk backend 
ETL transfers, so we'd rather the program fail entirely than silently send 
through mojibaked data.  Part of our typical operation is to load data in 
some character set, do some transformations, then normalize it to another 
character set, often between UTF-8, ASCII, and Windows-1252, but almost 
never Latin-1.

During the programs' organic evolution to handle character sets properly 
(i.e., not just load everything and pretend it is bytes to Perl), the 
pragma was of great help to find the places where we hadn't properly 
decoded the data, or mixed it in with other external data that wasn't 
decoded, or did the decoding incorrectly, or have HTML::Entities surprise 
us with Latin-1 due to the chr() 128-255 window in %entity2char.

I forget the details, but the workaround was to:

   my $iso_8859_1 = find_encoding("iso-8859-1");
   for (grep { not utf8::is_utf8($_) } values %HTML::Entities::entity2char) {
     $_ = $iso_8859_1->decode($_);
   }
   for (grep { not utf8::is_utf8($_) } keys %HTML::Entities::char2entity) {
     $HTML::Entities::char2entity{$iso_8859_1->decode($_)} = delete $HTML::Entities::char2entity{$_};
   }

Note that we're using 5.10.0 (+1 local patch from later releases) so that 
horrible block above may not be needed, although if "unicode_strings" is 
lexical it might not help for HTML::Entities.



-- 
George Greer

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About