develooper Front page | perl.perl5.porters | Postings from August 2011

BOMs as noncharacters

Thread Next
Tom Christiansen
August 14, 2011 17:29
BOMs as noncharacters
Message ID:
Is this correct behavior for loose and strict UTF-8 respectively?

    % perl -CS -Wle 'print chr(0xFFFE)' | perl -ne 'BEGIN { binmode(STDIN, ":encoding(UTF-8)") || die; } printf "%v04X\n", $_'
    Unicode non-character U+FFFE is illegal for open interchange at -e line 1.
    utf8 "\xFFFE" does not map to Unicode.

    % perl -CS -Xle 'print chr(0xFFFE)' | perl -ne 'BEGIN { binmode(STDIN, ":encoding(UTF-8)") || die; } printf "%v04X\n", $_'
    utf8 "\xFFFE" does not map to Unicode.

    % perl -CS -Xle 'print chr(0xFFFE)' | perl -ne 'use warnings FATAL => "utf8"; BEGIN { binmode(STDIN, ":encoding(UTF-8)") || die; } print'
    utf8 "\xFFFE" does not map to Unicode.

Why am I getting a warnings instead of an exception, no matter how hard a try?
Do I misunderstand strict UTF-8?  I thought U+FFFE was a noncharacter guaranteed
not to occur in a conformant UTF-8 stream.  It is coming back \x{} escaped.
Do I have to somehow set a warn handler beyond what I've done there?

Also, does this mean that otherwise apparently UTF-8 files with a U+FFFE 
at the start of them are not in fact conformant UTF-8 files?  Or is this
a utility program trying to handle manged text?  Or are both true, that
it isn't UTF-8 and also the utility program thing?

From the Unicode Standard version 6.0.0, Chapter 3, conformance:

     C2 A process shall not interpret a noncharacter code point as an abstract character.
       · The noncharacter code points may be used internally, such as for sentinel val-
         ues or delimiters, but should not be exchanged.

"Should not".  What happens if they are?  Does that disqualify from being a valid
Unicode character encoding form like UTF-8?  And yet, UTF-16 and UTF-32 have to
have BOMs, at least at the fronts.  That can't disqualify them.  Does it disqualify
UTF-8?  Why or why not?


    C10 When a process interprets a code unit sequence which purports to be in a Unicode char-
         acter encoding form, it shall treat ill-formed code unit sequences as an error condition
         and shall not interpret such sequences as characters.

       · For example, in UTF-8 every code unit of the form 110xxxx2 must be followed
         by a code unit of the form 10xxxxxx2. A sequence such as 110xxxxx2 0xxxxxxx2
         is ill-formed and must never be generated. When faced with this ill-formed
         code unit sequence while transforming or interpreting text, a conformant pro-
         cess must treat the first code unit 110xxxxx2 as an illegally terminated code unit
         sequence--for example, by signaling an error, filtering the code unit out, or
         representing the code unit with a marker such as U+FFFD replacement

       · Conformant processes cannot interpret ill-formed code unit sequences. How-
         ever, the conformance clauses do not prevent processes from operating on code
         unit sequences that do not purport to be in a Unicode character encoding form.
         For example, for performance reasons a low-level string operation may simply
         operate directly on code units, without interpreting them as characters. See,
         especially, the discussion under D89.

       · Utility programs are not prevented from operating on "mangled" text. For
         example, a UTF-8 file could have had CRLF sequences introduced at every 80
         bytes by a bad mailer program. This could result in some UTF-8 byte sequences
         being interrupted by CRLFs, producing illegal byte sequences. This mangled
         text is no longer UTF-8. It is permissible for a conformant program to repair
         such text, recognizing that the mangled text was originally well-formed UTF-8
         byte sequences. However, such repair of mangled data is a special case, and it
         must not be used in circumstances where it would cause security problems.
         There are important security issues associated with encoding conversion, espe-
         cially with the conversion of malformed text. For more information, see Uni-


     D14 Noncharacter: A code point that is permanently reserved for internal use and that
           should never be interchanged. Noncharacters consist of the values U+nFFFE and
           U+nFFFF (where n is from 0 to 1016) and the values U+FDD0..U+FDEF.
         · For more information, see Section 16.7, Noncharacters.
         · These code points are permanently reserved as noncharacters.

     D15 Reserved code point: Any code point of the Unicode Standard that is reserved for
           future assignment. Also known as an unassigned code point.
         · Surrogate code points and noncharacters are considered assigned code points,
           but not assigned characters.
         · For a summary classification of reserved and other types of code points, see
           Table 2-3.

    In general, a conforming process may indicate the presence of a code point whose use has
    not been designated (for example, by showing a missing glyph in rendering or by signaling
    an appropriate error in a streaming protocol), even though it is forbidden by the standard
    from interpreting that code point as an abstract character.

I could swear D14 says that all these files with dummy UTF-8 BOMs that
purport to be "purport to be in a Unicode character encoding form" are
noncomforant (read: they're lying).

I know that Unicode says UTF-8 BOMS "are neither required not recommended".

But doesn't Chapter 3 go further than that, saying that UTF-8 BOMs must not 
occur in comformant UTF-8 streams, and that a conformant process must
not interpret them as abstract characters, since they are by definition
noncharacters?  But they aren't "malformed sequences" either.

Obviously, I'm very confused.  


Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About