On Tue, 24 May 2016, Aristotle Pagaltzis wrote: > * George Greer <perl@greerga.m-l.org> [2016-05-24 00:23]: >> We rely on it at $WORK because we can't assume the source text is >> Latin-1 and both sides should have already been upgraded before doing >> operations. > > Do you rely on it to catch things in production? Or to catch problems > during development? What problems would you have if it went silent and > stopped warning? We leave it on during both in fatal mode. We're mostly doing bulk backend ETL transfers, so we'd rather the program fail entirely than silently send through mojibaked data. Part of our typical operation is to load data in some character set, do some transformations, then normalize it to another character set, often between UTF-8, ASCII, and Windows-1252, but almost never Latin-1. During the programs' organic evolution to handle character sets properly (i.e., not just load everything and pretend it is bytes to Perl), the pragma was of great help to find the places where we hadn't properly decoded the data, or mixed it in with other external data that wasn't decoded, or did the decoding incorrectly, or have HTML::Entities surprise us with Latin-1 due to the chr() 128-255 window in %entity2char. I forget the details, but the workaround was to: my $iso_8859_1 = find_encoding("iso-8859-1"); for (grep { not utf8::is_utf8($_) } values %HTML::Entities::entity2char) { $_ = $iso_8859_1->decode($_); } for (grep { not utf8::is_utf8($_) } keys %HTML::Entities::char2entity) { $HTML::Entities::char2entity{$iso_8859_1->decode($_)} = delete $HTML::Entities::char2entity{$_}; } Note that we're using 5.10.0 (+1 local patch from later releases) so that horrible block above may not be needed, although if "unicode_strings" is lexical it might not help for HTML::Entities. -- George GreerThread Previous | Thread Next