Front page | perl.perl5.porters |
Postings from August 2011
BOMs as noncharacters
Thread Next
From:
Tom Christiansen
Date:
August 14, 2011 17:29
Subject:
BOMs as noncharacters
Message ID:
22810.1313366471@chthon
Is this correct behavior for loose and strict UTF-8 respectively?
% perl -CS -Wle 'print chr(0xFFFE)' | perl -ne 'BEGIN { binmode(STDIN, ":encoding(UTF-8)") || die; } printf "%v04X\n", $_'
Unicode non-character U+FFFE is illegal for open interchange at -e line 1.
utf8 "\xFFFE" does not map to Unicode.
005C.0078.007B.0046.0046.0046.0045.007D.000A
% perl -CS -Xle 'print chr(0xFFFE)' | perl -ne 'BEGIN { binmode(STDIN, ":encoding(UTF-8)") || die; } printf "%v04X\n", $_'
utf8 "\xFFFE" does not map to Unicode.
005C.0078.007B.0046.0046.0046.0045.007D.000A
% perl -CS -Xle 'print chr(0xFFFE)' | perl -ne 'use warnings FATAL => "utf8"; BEGIN { binmode(STDIN, ":encoding(UTF-8)") || die; } print'
utf8 "\xFFFE" does not map to Unicode.
\x{FFFE}
Why am I getting a warnings instead of an exception, no matter how hard a try?
Do I misunderstand strict UTF-8? I thought U+FFFE was a noncharacter guaranteed
not to occur in a conformant UTF-8 stream. It is coming back \x{} escaped.
Do I have to somehow set a warn handler beyond what I've done there?
Also, does this mean that otherwise apparently UTF-8 files with a U+FFFE
at the start of them are not in fact conformant UTF-8 files? Or is this
a utility program trying to handle manged text? Or are both true, that
it isn't UTF-8 and also the utility program thing?
From the Unicode Standard version 6.0.0, Chapter 3, conformance:
C2 A process shall not interpret a noncharacter code point as an abstract character.
· The noncharacter code points may be used internally, such as for sentinel val-
ues or delimiters, but should not be exchanged.
"Should not". What happens if they are? Does that disqualify from being a valid
Unicode character encoding form like UTF-8? And yet, UTF-16 and UTF-32 have to
have BOMs, at least at the fronts. That can't disqualify them. Does it disqualify
UTF-8? Why or why not?
[...]
C10 When a process interprets a code unit sequence which purports to be in a Unicode char-
acter encoding form, it shall treat ill-formed code unit sequences as an error condition
and shall not interpret such sequences as characters.
· For example, in UTF-8 every code unit of the form 110xxxx2 must be followed
by a code unit of the form 10xxxxxx2. A sequence such as 110xxxxx2 0xxxxxxx2
is ill-formed and must never be generated. When faced with this ill-formed
code unit sequence while transforming or interpreting text, a conformant pro-
cess must treat the first code unit 110xxxxx2 as an illegally terminated code unit
sequence--for example, by signaling an error, filtering the code unit out, or
representing the code unit with a marker such as U+FFFD replacement
character.
· Conformant processes cannot interpret ill-formed code unit sequences. How-
ever, the conformance clauses do not prevent processes from operating on code
unit sequences that do not purport to be in a Unicode character encoding form.
For example, for performance reasons a low-level string operation may simply
operate directly on code units, without interpreting them as characters. See,
especially, the discussion under D89.
· Utility programs are not prevented from operating on "mangled" text. For
example, a UTF-8 file could have had CRLF sequences introduced at every 80
bytes by a bad mailer program. This could result in some UTF-8 byte sequences
being interrupted by CRLFs, producing illegal byte sequences. This mangled
text is no longer UTF-8. It is permissible for a conformant program to repair
such text, recognizing that the mangled text was originally well-formed UTF-8
byte sequences. However, such repair of mangled data is a special case, and it
must not be used in circumstances where it would cause security problems.
There are important security issues associated with encoding conversion, espe-
cially with the conversion of malformed text. For more information, see Uni-
[...]
D14 Noncharacter: A code point that is permanently reserved for internal use and that
should never be interchanged. Noncharacters consist of the values U+nFFFE and
U+nFFFF (where n is from 0 to 1016) and the values U+FDD0..U+FDEF.
· For more information, see Section 16.7, Noncharacters.
· These code points are permanently reserved as noncharacters.
D15 Reserved code point: Any code point of the Unicode Standard that is reserved for
future assignment. Also known as an unassigned code point.
· Surrogate code points and noncharacters are considered assigned code points,
but not assigned characters.
· For a summary classification of reserved and other types of code points, see
Table 2-3.
In general, a conforming process may indicate the presence of a code point whose use has
not been designated (for example, by showing a missing glyph in rendering or by signaling
an appropriate error in a streaming protocol), even though it is forbidden by the standard
from interpreting that code point as an abstract character.
I could swear D14 says that all these files with dummy UTF-8 BOMs that
purport to be "purport to be in a Unicode character encoding form" are
noncomforant (read: they're lying).
I know that Unicode says UTF-8 BOMS "are neither required not recommended".
But doesn't Chapter 3 go further than that, saying that UTF-8 BOMs must not
occur in comformant UTF-8 streams, and that a conformant process must
not interpret them as abstract characters, since they are by definition
noncharacters? But they aren't "malformed sequences" either.
Obviously, I'm very confused.
--tom
Thread Next
-
BOMs as noncharacters
by Tom Christiansen