Front page | perl.perl5.porters |
Postings from January 2011
Re: [perl #81454] perl cannot read UTF-16 files with illegal unicode
Thread Previous
|
Thread Next
From:
Andrew Pimlott
Date:
January 30, 2011 11:05
Subject:
Re: [perl #81454] perl cannot read UTF-16 files with illegal unicode
Message ID:
1296410400-sup-9890@pimlott.net
Excerpts from Karl Williamson's message of Sat Jan 22 13:16:45 -0800 2011:
> I'm sorry that you're being exposed to Perl's internal organizational
> structure.
No prob about that. I've filed the bug with Encode:
https://rt.cpan.org/Public/Bug/Display.html?id=64788
I'm just suggesting that some coordination with the core perl maintaiers
is warranted, since Encode is so closely integrated.
> As an aside, UTF-16 parallels UTF-8, in that if you use Encode with that
> spelling of the encoding, you get the same strict behavior you do with
> UTF-16.
I don't think that's accurate--see the original example I posted:
binmode(STDIN, ':encoding(UTF-8)');
while (<STDIN>) { }
Input is EF B7 93, which decodes to U+FDD3, a noncharacter. There is no
diagnostic. (Unless this is changed in a recent dev version.)
> The default behavior can't be just a warning when a server is facing the
> wide-world of hackers.
That's a fair point, but consider the other side: I write code that is
correct according to Unicode and my application's semantics. One day,
my application fails because Encode surprisingly considers some valid
input illegal. It's a judgement call, and don't think the best default
policy is obvious. A warning is a reasonable compromise, IMO.
Even better would be to document the situation clearly:
As a security precaution, the following encodings consider Unicode
"noncharacters" to be malformed. If you want to decode Unicode
noncharacters, ...
> > Interesting point about the security implication. But U+FEFF could as
> > well be used maliciously, but it is accepted. And an attack might be
> > routed through UTF-8 or another encoding, so by the same reasoning it
> > should be an error there. (It might even be routed through a perl
> > program that constructs strings with chr(), so chr(0xFDD0) should be an
> > error.)
>
> I'm not sure if you were being ironic here, as it is self-contradictory;
> so I have to assume you were being straight. You said in an earlier
> post that the non-characters are legal internally. So chr() has to
> accept them.
I'd put it differently: perl and Encode have no idea what data is
"internal". Input being read from a filehandle might well be internal,
eg. from a file the application itself produced. An argument to chr()
might well be external, eg a number supplied by a potentially malicious
agent. Saying Encode::decode() should be more strict than chr() is
merely a rough heuristic.
By the way, to be clear, chr(0xFDD0) does throw a warning, so someone
already decided that this should be checked here. I say, the best thing
for the programmer is to be consistent: Have one way to say whether
noncharacters should be errors, warings, or non-issues, and honor it
everywhere. The "no warnings 'utf8'" pragma would be the natural way to
do this.
> I don't see how U+FEFF or encoding them in UTF-8 could be
> security implications.
Oops, I got that one wrong. But according to my understanding, U+FFFE
is the only character with this security implication. So this rationale
should not be used to brand other noncharacters "illegal".
> Since Encode doesn't know anything about the context, it has to assume
> the worst case. To do otherwise is to leave users unknowingly open to
> attack.
I appreciate your caution. But it's a judgement call as to how paranoid
you should be, when you may cause unexpected errors in valid programs.
"has to assume the worst case" is an extreme philosophy.
> There was no biasing involved. I've been trying to root out all
> instances of calling these "illegal" in the 5.13.X series.
Great--clear diagnostics and documentation will make these issues less
trouble for everyone.
Andrew
Thread Previous
|
Thread Next