develooper Front page | perl.perl5.porters | Postings from August 2011

BOMs as noncharacters

Thread Next
From:
Tom Christiansen
Date:
August 14, 2011 17:29
Subject:
BOMs as noncharacters
Message ID:
22810.1313366471@chthon
Is this correct behavior for loose and strict UTF-8 respectively?

    % perl -CS -Wle 'print chr(0xFFFE)' | perl -ne 'BEGIN { binmode(STDIN, ":encoding(UTF-8)") || die; } printf "%v04X\n", $_'
    Unicode non-character U+FFFE is illegal for open interchange at -e line 1.
    utf8 "\xFFFE" does not map to Unicode.
    005C.0078.007B.0046.0046.0046.0045.007D.000A

    % perl -CS -Xle 'print chr(0xFFFE)' | perl -ne 'BEGIN { binmode(STDIN, ":encoding(UTF-8)") || die; } printf "%v04X\n", $_'
    utf8 "\xFFFE" does not map to Unicode.
    005C.0078.007B.0046.0046.0046.0045.007D.000A

    % perl -CS -Xle 'print chr(0xFFFE)' | perl -ne 'use warnings FATAL => "utf8"; BEGIN { binmode(STDIN, ":encoding(UTF-8)") || die; } print'
    utf8 "\xFFFE" does not map to Unicode.
    \x{FFFE}


Why am I getting a warnings instead of an exception, no matter how hard a try?
Do I misunderstand strict UTF-8?  I thought U+FFFE was a noncharacter guaranteed
not to occur in a conformant UTF-8 stream.  It is coming back \x{} escaped.
Do I have to somehow set a warn handler beyond what I've done there?

Also, does this mean that otherwise apparently UTF-8 files with a U+FFFE 
at the start of them are not in fact conformant UTF-8 files?  Or is this
a utility program trying to handle manged text?  Or are both true, that
it isn't UTF-8 and also the utility program thing?

From the Unicode Standard version 6.0.0, Chapter 3, conformance:

     C2 A process shall not interpret a noncharacter code point as an abstract character.
       · The noncharacter code points may be used internally, such as for sentinel val-
         ues or delimiters, but should not be exchanged.

"Should not".  What happens if they are?  Does that disqualify from being a valid
Unicode character encoding form like UTF-8?  And yet, UTF-16 and UTF-32 have to
have BOMs, at least at the fronts.  That can't disqualify them.  Does it disqualify
UTF-8?  Why or why not?

[...]

    C10 When a process interprets a code unit sequence which purports to be in a Unicode char-
         acter encoding form, it shall treat ill-formed code unit sequences as an error condition
         and shall not interpret such sequences as characters.

       · For example, in UTF-8 every code unit of the form 110xxxx2 must be followed
         by a code unit of the form 10xxxxxx2. A sequence such as 110xxxxx2 0xxxxxxx2
         is ill-formed and must never be generated. When faced with this ill-formed
         code unit sequence while transforming or interpreting text, a conformant pro-
         cess must treat the first code unit 110xxxxx2 as an illegally terminated code unit
         sequence--for example, by signaling an error, filtering the code unit out, or
         representing the code unit with a marker such as U+FFFD replacement
         character.

       · Conformant processes cannot interpret ill-formed code unit sequences. How-
         ever, the conformance clauses do not prevent processes from operating on code
         unit sequences that do not purport to be in a Unicode character encoding form.
         For example, for performance reasons a low-level string operation may simply
         operate directly on code units, without interpreting them as characters. See,
         especially, the discussion under D89.

       · Utility programs are not prevented from operating on "mangled" text. For
         example, a UTF-8 file could have had CRLF sequences introduced at every 80
         bytes by a bad mailer program. This could result in some UTF-8 byte sequences
         being interrupted by CRLFs, producing illegal byte sequences. This mangled
         text is no longer UTF-8. It is permissible for a conformant program to repair
         such text, recognizing that the mangled text was originally well-formed UTF-8
         byte sequences. However, such repair of mangled data is a special case, and it
         must not be used in circumstances where it would cause security problems.
         There are important security issues associated with encoding conversion, espe-
         cially with the conversion of malformed text. For more information, see Uni-

[...]


     D14 Noncharacter: A code point that is permanently reserved for internal use and that
           should never be interchanged. Noncharacters consist of the values U+nFFFE and
           U+nFFFF (where n is from 0 to 1016) and the values U+FDD0..U+FDEF.
         · For more information, see Section 16.7, Noncharacters.
         · These code points are permanently reserved as noncharacters.

     D15 Reserved code point: Any code point of the Unicode Standard that is reserved for
           future assignment. Also known as an unassigned code point.
         · Surrogate code points and noncharacters are considered assigned code points,
           but not assigned characters.
         · For a summary classification of reserved and other types of code points, see
           Table 2-3.

    In general, a conforming process may indicate the presence of a code point whose use has
    not been designated (for example, by showing a missing glyph in rendering or by signaling
    an appropriate error in a streaming protocol), even though it is forbidden by the standard
    from interpreting that code point as an abstract character.


I could swear D14 says that all these files with dummy UTF-8 BOMs that
purport to be "purport to be in a Unicode character encoding form" are
noncomforant (read: they're lying).

I know that Unicode says UTF-8 BOMS "are neither required not recommended".

But doesn't Chapter 3 go further than that, saying that UTF-8 BOMs must not 
occur in comformant UTF-8 streams, and that a conformant process must
not interpret them as abstract characters, since they are by definition
noncharacters?  But they aren't "malformed sequences" either.

Obviously, I'm very confused.  

--tom

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About