develooper Front page | perl.perl5.porters | Postings from April 2018

[perl #133101] Anomalies in handling malformed utf8 input

Thread Next
From:
Ricardo SIGNES
Date:
April 11, 2018 17:17
Subject:
[perl #133101] Anomalies in handling malformed utf8 input
Message ID:
rt-4.0.24-25826-1523467027-1803.133101-75-0@perl.org
# New Ticket Created by  Ricardo SIGNES 
# Please include the string:  [perl #133101]
# in the subject line of all future correspondence about this issue. 
# <URL: https://rt.perl.org/Ticket/Display.html?id=133101 >


Mark Dominus sent me a bug report that he couldn't get perlbug to accept.

-------

This is a bug report.

The attached input file “bad” is a one-line summary of an email message
whose subject field was malformed. The subject field is encoded in GB-2312
and its raw bytes are invalid when interpreted as utf8.  Let us suppose
that this data is saved in a file named bad.  Now consider the following
invocations:

1$ perl -lne 'print if /[ąę]/' bad                       > /dev/null
2$ PERL_UNICODE=39 perl -lne 'print if /[ąę]/' bad       > /dev/null
3$ cat bad | perl -lne 'print if /[ąę]/'                 > /dev/null
4$ cat bad | PERL_UNICODE=39 perl -lne 'print if /[ąę]/' > /dev/null
Malformed UTF-8 character (fatal) at -e line 1, <> line 1.

5$ perl -lne 'print if /ą/' bad                          > /dev/null
6$ PERL_UNICODE=39 perl -lne 'print if /ą/' bad          > /dev/null
7$ cat bad | perl -lne 'print if /ą/'                    > /dev/null
8$ cat bad | PERL_UNICODE=39 perl -lne 'print if /ą/'    > /dev/null

There are at least two anomalies here.

Invocation 4 properly fails.  (PERL_UNICODE=39 is equivalent to supplying
the -CAS flag to Perl.)  But invocation 8 is identical, except that the
pattern is /ą/ instead of /[ąę]/; why doesn't this fail as well?

Invocation 2 is completely identical, except that the data is delivered on
stdin rather than coming from ARGV.  Why doesn't this fail as well?  (The
data itself is identical, as confirmed by cat bad | cmp - bad).

The complete message header is also attached (msg-hdr.txt), and the
examples above all behave the same when I use it in place of the shorter
excerpt.

This is perl 5, version 22, subversion 1 (v5.22.1) built for
x86_64-linux-gnu-thread-multi
(with 60 registered patches, see the attached output of perl -V for more
detail)

Please cc me on replies, as I do not regularly read this list.

-- 
rjbs

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About