develooper Front page | perl.perl5.porters | Postings from August 2011

Re: BOMs as noncharacters

Thread Previous | Thread Next
Karl Williamson
August 17, 2011 13:10
Re: BOMs as noncharacters
Message ID:
On 08/14/2011 06:01 PM, Tom Christiansen wrote:
> Is this correct behavior for loose and strict UTF-8 respectively?
>      % perl -CS -Wle 'print chr(0xFFFE)' | perl -ne 'BEGIN { binmode(STDIN, ":encoding(UTF-8)") || die; } printf "%v04X\n", $_'
>      Unicode non-character U+FFFE is illegal for open interchange at -e line 1.
>      utf8 "\xFFFE" does not map to Unicode.
>      005C.0078.007B.0046.0046.0046.0045.007D.000A
>      % perl -CS -Xle 'print chr(0xFFFE)' | perl -ne 'BEGIN { binmode(STDIN, ":encoding(UTF-8)") || die; } printf "%v04X\n", $_'
>      utf8 "\xFFFE" does not map to Unicode.
>      005C.0078.007B.0046.0046.0046.0045.007D.000A
>      % perl -CS -Xle 'print chr(0xFFFE)' | perl -ne 'use warnings FATAL =>  "utf8"; BEGIN { binmode(STDIN, ":encoding(UTF-8)") || die; } print'
>      utf8 "\xFFFE" does not map to Unicode.
>      \x{FFFE}
> Why am I getting a warnings instead of an exception, no matter how hard a try?
> Do I misunderstand strict UTF-8?  I thought U+FFFE was a noncharacter guaranteed
> not to occur in a conformant UTF-8 stream.  It is coming back \x{} escaped.
> Do I have to somehow set a warn handler beyond what I've done there?

FFFE should not occur in a conformant stream, and it is my understanding 
that UTF-8 as an encoding should enforce strict conformance in Perl, and 
forbid it.  I don't know where the code is that does the check; please 
file a bug report.  Further, some people would say that this is a 
security hole; see below.

> Also, does this mean that otherwise apparently UTF-8 files with a U+FFFE
> at the start of them are not in fact conformant UTF-8 files?  Or is this
> a utility program trying to handle manged text?  Or are both true, that
> it isn't UTF-8 and also the utility program thing?
>> From the Unicode Standard version 6.0.0, Chapter 3, conformance:
>       C2 A process shall not interpret a noncharacter code point as an abstract character.
>         · The noncharacter code points may be used internally, such as for sentinel val-
>           ues or delimiters, but should not be exchanged.
> "Should not".  What happens if they are?   Does that disqualify from being a valid
> Unicode character encoding form like UTF-8?  And yet, UTF-16 and UTF-32 have to
> have BOMs, at least at the fronts.  That can't disqualify them.  Does it disqualify
> UTF-8?  Why or why not?
> [...]
>      C10 When a process interprets a code unit sequence which purports to be in a Unicode char-
>           acter encoding form, it shall treat ill-formed code unit sequences as an error condition
>           and shall not interpret such sequences as characters.
>         · For example, in UTF-8 every code unit of the form 110xxxx2 must be followed
>           by a code unit of the form 10xxxxxx2. A sequence such as 110xxxxx2 0xxxxxxx2
>           is ill-formed and must never be generated. When faced with this ill-formed
>           code unit sequence while transforming or interpreting text, a conformant pro-
>           cess must treat the first code unit 110xxxxx2 as an illegally terminated code unit
>           sequence--for example, by signaling an error, filtering the code unit out, or
>           representing the code unit with a marker such as U+FFFD replacement
>           character.
>         · Conformant processes cannot interpret ill-formed code unit sequences. How-
>           ever, the conformance clauses do not prevent processes from operating on code
>           unit sequences that do not purport to be in a Unicode character encoding form.
>           For example, for performance reasons a low-level string operation may simply
>           operate directly on code units, without interpreting them as characters. See,
>           especially, the discussion under D89.
>         · Utility programs are not prevented from operating on "mangled" text. For
>           example, a UTF-8 file could have had CRLF sequences introduced at every 80
>           bytes by a bad mailer program. This could result in some UTF-8 byte sequences
>           being interrupted by CRLFs, producing illegal byte sequences. This mangled
>           text is no longer UTF-8. It is permissible for a conformant program to repair
>           such text, recognizing that the mangled text was originally well-formed UTF-8
>           byte sequences. However, such repair of mangled data is a special case, and it
>           must not be used in circumstances where it would cause security problems.
>           There are important security issues associated with encoding conversion, espe-
>           cially with the conversion of malformed text. For more information, see Uni-
> [...]
>       D14 Noncharacter: A code point that is permanently reserved for internal use and that
>             should never be interchanged. Noncharacters consist of the values U+nFFFE and
>             U+nFFFF (where n is from 0 to 1016) and the values U+FDD0..U+FDEF.
>           · For more information, see Section 16.7, Noncharacters.
>           · These code points are permanently reserved as noncharacters.
>       D15 Reserved code point: Any code point of the Unicode Standard that is reserved for
>             future assignment. Also known as an unassigned code point.
>           · Surrogate code points and noncharacters are considered assigned code points,
>             but not assigned characters.
>           · For a summary classification of reserved and other types of code points, see
>             Table 2-3.
>      In general, a conforming process may indicate the presence of a code point whose use has
>      not been designated (for example, by showing a missing glyph in rendering or by signaling
>      an appropriate error in a streaming protocol), even though it is forbidden by the standard
>      from interpreting that code point as an abstract character.
> I could swear D14 says that all these files with dummy UTF-8 BOMs that
> purport to be "purport to be in a Unicode character encoding form" are
> noncomforant (read: they're lying).
> I know that Unicode says UTF-8 BOMS "are neither required not recommended".
> But doesn't Chapter 3 go further than that, saying that UTF-8 BOMs must not
> occur in comformant UTF-8 streams, and that a conformant process must
> not interpret them as abstract characters, since they are by definition
> noncharacters?  But they aren't "malformed sequences" either.
> Obviously, I'm very confused.
> --tom

I'm wondering if you are confusing U+FFFE, a non-character code point 
that is invalid in open interchange, with U+FEFF, the BYTE ORDER MARK, 
which is used in UTF-16 and UTF-32 to give the endianness of the stream. 
  Unicode now discourages BOMs, but they are not forbidden, and there is 
no real use for them in UTF-8, as that encoding does not have endianness.

Putting a BOM in a UTF-8 stream does not make the stream illegal. 
However, putting a FFFE in such a stream does make the stream illegal 
(for open interchange;  it's fine for a set of cooperating processes 
that are expecting it).

The reason our accepting FFFE is a potential security hole is that it 
has the same bit pattern as the BOM FEFF under the opposite endianness. 
  Malware could send it to an application that is expecting a BOM, with 
the consequence that the application thinks the endianness is the 
opposite of what it really is, and so the malware could inject something 
that would normally be illegal input.  UTF-8 does not have endianness, 
but I imagine there are some games that could be played, so that it is 
important to reject it here too.  I have not tried to figure out how 
this all could happen, but people do say it could.

So this is a bug.  We should be throwing an exception, or changing it 
into a REPLACEMENT CHARACTER.  Filtering it out is not considered a good 

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About