Front page | perl.perl5.porters |
Postings from November 2016
[perl #130199] Text::CSV::Encoded is incorrectly forced to parsewidechar
Thread Previous
|
Thread Next
From:
James E Keenan via RT
Date:
November 28, 2016 23:16
Subject:
[perl #130199] Text::CSV::Encoded is incorrectly forced to parsewidechar
Message ID:
rt-4.0.24-27900-1480374963-1685.130199-15-0@perl.org
On Mon, 28 Nov 2016 23:03:51 GMT, jkeenan wrote:
> On Mon, 28 Nov 2016 12:34:02 GMT, rafal@zorro.ztk-rp.eu wrote:
> >
> > This is a bug report for perl from rafal@zorro.ztk-rp.eu,
> > generated with the help of perlbug 1.40 running under perl 5.20.2.
> >
> >
> > -----------------------------------------------------------------
> > [Please describe your issue here]
> > After upgrading from debian-wheezy to debian-jessie HTML::Mason
> > started
> > to behave strangely with respect to UTF8 encoding. Earlier both web-
> > pages
> > and forms were working correctly (in UTF8) without any special setup.
> > As
> > of jessie with Apache 2.4 UTF8 no longer works.
> > 1. I had to add binmode(STDOUT,'UTF8') to modules.
> > 2. I had to decode_utf8($_) data from forms before passing them over
> > to psql-db
> > This report I file with example code of erratic behavior of
> > Text::CSV::Encoded
> > since I could narrow the problem to just a few lines of test-case.
> >
> > ========================
> > #!/usr/bin/perl
> > use Text::CSV::Encoded;
> > open(my $FH, shift) or die "open";
> > binmode($FH, ":encoding(cp1250) :raw :bytes");
> > local $/ = "\r\n";
> > my $csv = Text::CSV::Encoded->new ( { encoding_in => "cp1250",
> > binary => 1, eol => $/, sep_char => ';',
> > } ) or die "Cannot use CSV: ".Text::CSV->error_diag
> > ();
> > $\ = "\n";
> > while ( <$FH> ) {
> > s/\s+$//;
> > print;
> > if ($csv->parse( $_ )) {
> > print $csv->fields();
> > }
> > }
> > __END__
> > 10;"SPӣDZIELNIA
> > WARSZAWA";62;"TEST"
> > ======================
> >
> > In this example:
> > 1. the test file (provided "inline") as <DATA> contains two speciffic
> > characters from CODE-PAGE-1250, one such char just after another.
> > 1a. this test file IS-NOT UTF8 encoded.
> > 2. the input stream is correctly marked as CP1250
> > 3. the module gets correct information as to that file encoding
> > ... and yet, the module complains about encoutering a "wide-char",
> > which in
> > the above setup should not ever be possible (I think).
> >
> > The result of the above program is:
> > =======================
> > $ ./wide-char test-input
> > 10;"SPӣDZIELNIA
> > WARSZAWA";62;"TEST"
> > Wide character in subroutine entry at
> > /usr/share/perl5/Text/CSV/Encoded/Coder/Encode.pm line 37, <$FH>
> > chunk
> > 1.
> > $
> > =======================
> >
> > This result is incorrect, since the file does not contain any "wide
> > chars".
> >
>
>
> It appears that the file does indeed contain characters which satisfy
> the condition required for the "Wide characters" warning. Here's what
> pod/perldiag.pod in perl-5.24.0 says:
>
> #####
> =item Wide character in %s
>
> (S utf8) Perl met a wide character (>255) when it wasn't expecting
> one. This warning is by default on for I/O (like print). The easiest
> way to quiet this warning is simply to add the C<:utf8> layer to the
> output, e.g. C<binmode STDOUT, ':utf8'>. Another way to turn off the
> warning is to add C<no warnings 'utf8';> but that is often closer to
> cheating. In general, you are supposed to explicitly mark the
> filehandle with an encoding, see L<open> and L<perlfunc/binmode>.
> #####
>
> If I put your test data into a file and run it through 'od -c', I
> observe two characters in the >255 range.
>
> #####
> $ od -c warsaw.txt
> 0000000 1 0 ; " S P 323 243 D Z I E L N I
> A
> 0000020 \n W A R S Z A W A " ; 6 2 ; "
> T
> 0000040 E S T " \n
> 0000045
> #####
>
> Text::CSV::Encoded is not part of the Perl 5 core distribution, so I
> think including it in the test script muddies the waters. Here's a
> pure Perl reduction:
>
> #####
> $ cat 2-130199-text-csv-encoded.pl
> # perl
> use strict;
> use warnings;
>
> open(my $FH, '<', 'warsaw.txt') or die "open";
> binmode($FH, ":encoding(cp1250)");
> while ( <$FH> ) {
> s/\s+$//;
> print "$_\n";
> }
> close $FH or die "close";
> #####
> $ perl 2-130199-text-csv-encoded.pl
> Wide character in print at 2-130199-text-csv-encoded.pl line 9, <$FH>
> line 1.
> 10;"SPÓŁDZIELNIA
> WARSZAWA";62;"TEST"
> #####
>
> I think that warning is appropriate. However, I concede that I don't
> have much experience with 'cp1250' so I'm unclear what the expected
> behavior is. Other people on list should comment.
>
> Thank you very much.
On #p5p khw has pointed out an error in my analysis. 'od -c' prints octal. So these characters are below \0377 equivalent to 255.
Also, in my test program I should have applied binmode to STDOUT as well.
#####
# perl
use strict;
use warnings;
open(my $FH, '<', 'warsaw.txt') or die "open";
binmode($FH, ":encoding(cp1250)");
binmode(STDOUT, ":encoding(cp1250)");
while ( <$FH> ) {
s/\s+$//;
print "$_\n";
}
close $FH or die "close";
#####
$ perl 2-130199-text-csv-encoded.pl
10;"SPӣDZIELNIA
WARSZAWA";62;"TEST"
#####
And once I 'binmode' STDOUT, the "Wide character" warning goes away. So, notwithstanding my errors, I still think this is not a bug -- at least not in perl-5.24.0.
Thank you very much.
--
James E Keenan (jkeenan@cpan.org)
---
via perlbug: queue: perl5 status: open
https://rt.perl.org/Ticket/Display.html?id=130199
Thread Previous
|
Thread Next