develooper Front page | perl.perl5.porters | Postings from April 2018

Re: [perl #133101] Anomalies in handling malformed utf8 input

Thread Previous | Thread Next
From:
Karl Williamson
Date:
April 11, 2018 20:47
Subject:
Re: [perl #133101] Anomalies in handling malformed utf8 input
Message ID:
cc17e4b6-847d-cb35-da6b-e0207b6f6798@khwilliamson.com
On 04/11/2018 11:17 AM, Ricardo SIGNES (via RT) wrote:
> # New Ticket Created by  Ricardo SIGNES
> # Please include the string:  [perl #133101]
> # in the subject line of all future correspondence about this issue.
> # <URL: https://rt.perl.org/Ticket/Display.html?id=133101 >
> 
> 
> Mark Dominus sent me a bug report that he couldn't get perlbug to accept.
> 
> -------
> 
> This is a bug report.
> 
> The attached input file “bad” is a one-line summary of an email message
> whose subject field was malformed. The subject field is encoded in GB-2312
> and its raw bytes are invalid when interpreted as utf8.  Let us suppose
> that this data is saved in a file named bad.  Now consider the following
> invocations:
> 
> 1$ perl -lne 'print if /[ąę]/' bad                       > /dev/null
> 2$ PERL_UNICODE=39 perl -lne 'print if /[ąę]/' bad       > /dev/null
> 3$ cat bad | perl -lne 'print if /[ąę]/'                 > /dev/null
> 4$ cat bad | PERL_UNICODE=39 perl -lne 'print if /[ąę]/' > /dev/null
> Malformed UTF-8 character (fatal) at -e line 1, <> line 1.
> 
> 5$ perl -lne 'print if /ą/' bad                          > /dev/null
> 6$ PERL_UNICODE=39 perl -lne 'print if /ą/' bad          > /dev/null
> 7$ cat bad | perl -lne 'print if /ą/'                    > /dev/null
> 8$ cat bad | PERL_UNICODE=39 perl -lne 'print if /ą/'    > /dev/null

Shouldn't

use utf8

be used?
> 
> There are at least two anomalies here.
> 
> Invocation 4 properly fails.  (PERL_UNICODE=39 is equivalent to supplying
> the -CAS flag to Perl.)  But invocation 8 is identical, except that the
> pattern is /ą/ instead of /[ąę]/; why doesn't this fail as well?
> 
> Invocation 2 is completely identical, except that the data is delivered on
> stdin rather than coming from ARGV.  Why doesn't this fail as well?  (The
> data itself is identical, as confirmed by cat bad | cmp - bad).
> 
> The complete message header is also attached (msg-hdr.txt), and the
> examples above all behave the same when I use it in place of the shorter
> excerpt.
> 
> This is perl 5, version 22, subversion 1 (v5.22.1) built for
> x86_64-linux-gnu-thread-multi
> (with 60 registered patches, see the attached output of perl -V for more
> detail)
> 
> Please cc me on replies, as I do not regularly read this list.
> 

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About