develooper Front page | perl.perl5.porters | Postings from November 2016

[perl #130199] Text::CSV::Encoded is incorrectly forced to parsewidechar

Thread Previous | Thread Next
From:
James E Keenan via RT
Date:
November 28, 2016 23:04
Subject:
[perl #130199] Text::CSV::Encoded is incorrectly forced to parsewidechar
Message ID:
rt-4.0.24-30716-1480374232-1358.130199-15-0@perl.org
On Mon, 28 Nov 2016 12:34:02 GMT, rafal@zorro.ztk-rp.eu wrote:
> 
> This is a bug report for perl from rafal@zorro.ztk-rp.eu,
> generated with the help of perlbug 1.40 running under perl 5.20.2.
> 
> 
> -----------------------------------------------------------------
> [Please describe your issue here]
> After upgrading from debian-wheezy to debian-jessie HTML::Mason
> started
> to behave strangely with respect to UTF8 encoding. Earlier both web-
> pages
> and forms were working correctly (in UTF8) without any special setup.
> As
> of jessie with Apache 2.4 UTF8 no longer works.
> 1. I had to add binmode(STDOUT,'UTF8') to modules.
> 2. I had to decode_utf8($_) data from forms before passing them over
> to psql-db
> This report I file with example code of erratic behavior of
> Text::CSV::Encoded
> since I could narrow the problem to just a few lines of test-case.
> 
> ========================
> #!/usr/bin/perl
> use Text::CSV::Encoded;
> open(my $FH, shift) or die "open";
> binmode($FH, ":encoding(cp1250) :raw :bytes");
> local $/ = "\r\n";
> my $csv = Text::CSV::Encoded->new ( { encoding_in  => "cp1250",
>                         binary => 1, eol => $/, sep_char => ';',
>                 } ) or die "Cannot use CSV: ".Text::CSV->error_diag
> ();
> $\ = "\n";
> while ( <$FH> ) {
>         s/\s+$//;
>         print;
>         if ($csv->parse( $_ )) {
>                 print $csv->fields();
>         }
> }
> __END__
> 10;"SPӣDZIELNIA
> WARSZAWA";62;"TEST"
> ======================
> 
> In this example:
> 1. the test file (provided "inline") as <DATA> contains two speciffic
> characters from CODE-PAGE-1250, one such char just after another.
> 1a. this test file IS-NOT UTF8 encoded.
> 2. the input stream is correctly marked as CP1250
> 3. the module gets correct information as to that file encoding
> ... and yet, the module complains about encoutering a "wide-char",
> which in
> the above setup should not ever be possible (I think).
> 
> The result of the above program is:
> =======================
> $ ./wide-char test-input
> 10;"SPӣDZIELNIA
> WARSZAWA";62;"TEST"
> Wide character in subroutine entry at
> /usr/share/perl5/Text/CSV/Encoded/Coder/Encode.pm line 37, <$FH> chunk
> 1.
> $
> =======================
> 
> This result is incorrect, since the file does not contain any "wide
> chars".
> 


It appears that the file does indeed contain characters which satisfy the condition required for the "Wide characters" warning.  Here's what pod/perldiag.pod in perl-5.24.0 says:

#####
=item Wide character in %s

(S utf8) Perl met a wide character (>255) when it wasn't expecting
one.  This warning is by default on for I/O (like print).  The easiest
way to quiet this warning is simply to add the C<:utf8> layer to the
output, e.g. C<binmode STDOUT, ':utf8'>.  Another way to turn off the
warning is to add C<no warnings 'utf8';> but that is often closer to
cheating.  In general, you are supposed to explicitly mark the
filehandle with an encoding, see L<open> and L<perlfunc/binmode>.
#####

If I put your test data into a file and run it through 'od -c', I observe two characters in the >255 range.

#####
$ od -c warsaw.txt 
0000000   1   0   ;   "   S   P 323 243   D   Z   I   E   L   N   I   A
0000020  \n   W   A   R   S   Z   A   W   A   "   ;   6   2   ;   "   T
0000040   E   S   T   "  \n
0000045
#####

Text::CSV::Encoded is not part of the Perl 5 core distribution, so I think including it in the test script muddies the waters.  Here's a pure Perl reduction:

#####
$ cat 2-130199-text-csv-encoded.pl 
# perl
use strict;
use warnings;

open(my $FH, '<', 'warsaw.txt') or die "open";
binmode($FH, ":encoding(cp1250)");
while ( <$FH> ) {
	s/\s+$//;
	print "$_\n";
}
close $FH or die "close";
#####
$ perl 2-130199-text-csv-encoded.pl 
Wide character in print at 2-130199-text-csv-encoded.pl line 9, <$FH> line 1.
10;"SPÓŁDZIELNIA
WARSZAWA";62;"TEST"
#####

I think that warning is appropriate.  However, I concede that I don't have much experience with 'cp1250' so I'm unclear what the expected behavior is.  Other people on list should comment.

Thank you very much.


---
via perlbug:  queue: perl5 status: new
https://rt.perl.org/Ticket/Display.html?id=130199

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About