develooper Front page | perl.perl5.porters | Postings from January 2011

Re: [perl #81454] perl cannot read UTF-16 files with illegal unicode

Thread Previous | Thread Next
From:
Andrew Pimlott
Date:
January 30, 2011 11:05
Subject:
Re: [perl #81454] perl cannot read UTF-16 files with illegal unicode
Message ID:
1296410400-sup-9890@pimlott.net
Excerpts from Karl Williamson's message of Sat Jan 22 13:16:45 -0800 2011:
> I'm sorry that you're being exposed to Perl's internal organizational 
> structure.

No prob about that.  I've filed the bug with Encode:
https://rt.cpan.org/Public/Bug/Display.html?id=64788

I'm just suggesting that some coordination with the core perl maintaiers
is warranted, since Encode is so closely integrated.

> As an aside, UTF-16 parallels UTF-8, in that if you use Encode with that 
> spelling of the encoding, you get the same strict behavior you do with 
> UTF-16.

I don't think that's accurate--see the original example I posted:

binmode(STDIN, ':encoding(UTF-8)');
while (<STDIN>) { }

Input is EF B7 93, which decodes to U+FDD3, a noncharacter.  There is no
diagnostic.  (Unless this is changed in a recent dev version.)

> The default behavior can't be just a warning when a server is facing the 
> wide-world of hackers.

That's a fair point, but consider the other side:  I write code that is
correct according to Unicode and my application's semantics.  One day,
my application fails because Encode surprisingly considers some valid
input illegal.  It's a judgement call, and don't think the best default
policy is obvious.  A warning is a reasonable compromise, IMO.

Even better would be to document the situation clearly:

    As a security precaution, the following encodings consider Unicode
    "noncharacters" to be malformed.  If you want to decode Unicode
    noncharacters, ...

> > Interesting point about the security implication.  But U+FEFF could as
> > well be used maliciously, but it is accepted.  And an attack might be
> > routed through UTF-8 or another encoding, so by the same reasoning it
> > should be an error there.  (It might even be routed through a perl
> > program that constructs strings with chr(), so chr(0xFDD0) should be an
> > error.)
> 
> I'm not sure if you were being ironic here, as it is self-contradictory; 
> so I have to assume you were being straight.  You said in an earlier 
> post that the non-characters are legal internally.  So chr() has to 
> accept them.

I'd put it differently:  perl and Encode have no idea what data is
"internal".  Input being read from a filehandle might well be internal,
eg. from a file the application itself produced.  An argument to chr()
might well be external, eg a number supplied by a potentially malicious
agent.  Saying Encode::decode() should be more strict than chr() is
merely a rough heuristic.

By the way, to be clear, chr(0xFDD0) does throw a warning, so someone
already decided that this should be checked here.  I say, the best thing
for the programmer is to be consistent:  Have one way to say whether
noncharacters should be errors, warings, or non-issues, and honor it
everywhere.  The "no warnings 'utf8'" pragma would be the natural way to
do this.

> I don't see how U+FEFF or encoding them in UTF-8 could be 
> security implications.

Oops, I got that one wrong.  But according to my understanding, U+FFFE
is the only character with this security implication.  So this rationale
should not be used to brand other noncharacters "illegal".

> Since Encode doesn't know anything about the context, it has to assume 
> the worst case.  To do otherwise is to leave users unknowingly open to 
> attack.

I appreciate your caution.  But it's a judgement call as to how paranoid
you should be, when you may cause unexpected errors in valid programs.
"has to assume the worst case" is an extreme philosophy.

> There was no biasing involved.  I've been trying to root out all 
> instances of calling these "illegal" in the 5.13.X series.

Great--clear diagnostics and documentation will make these issues less
trouble for everyone.

Andrew

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About