[ ...continuing "C<use encoding> Considered Harmful" ]
Wasn't *that* fun?
Sure, once you C<use encoding "utf8">, then you can no longer
use the ISO8859-1 encoding of "tsch\xFC\xDF". You need the UTF-8
one. So a literal string you want read as "tschüß" must have a
different set of bytes, the UTF-8 encoded version.
Fine; that's all good and expected.
But there's more. You can no longer write "tsch\xFC\xDF" under
C<use encoding "utf8">. You must now write the octets as UTF-8 wants
to see them: "tsch\xC3\xBC\xC3\x9F". So not only must all high-bit
literal data be exactly encoded, you must also pre-(re-?)encode every
7-bit-clean SYMBOLIC mention of all code points over 128, each in its
precise physical bitwise layout according to the encoding you've used.
You can't dodge by writing "tsch".chr(0xFC).chr(DF) or any other string-
composing trick. You really do have to write out the blinking octets as
they encode. Get that? Under C<use encoding>, "\xFC" is *not* the
character whose code point is 0xFC!! The old equality of chr(0xFC) eq
"\xFC" is out the door. Now chr(0xFC) eq "\xC3\BC", and chr(0xDF) ne
"\xDF" as chr(0xDF) must be written eq "\xC3\x9F". If this seems like
fun, try UTF-16 where it's out with familiar "tsch\xFC\xDF" and in with
"\x0t\x0s\x0c0u\xFF\xFD\xFF\xFD\x0\x0", maybe +"\xFE\xFF" in front.
In UTF-7, it's "tscH+//3//q-".
You have to know and write all these. How ridiculous is that, and why
would anyone (knowingly) inflict this on themself, or others? What a
maintenance nightmare! How many of you really already knew about this?
Honestly, please; am I truly the only one here caught unaware by what
appears to me a gross failure of abstraction? I didn't realize the
holes in my head were as big as they plainly are.
At some point this failed to follow Perl's prime directive that
"easy things should be easy." This seems hard to understand, hard
to explain, and hard to work with, and [I believe] few can correctly
predict what it will do. I wish I were wrong.
I don't even think it fixable, since surely there's code "out there"
that relies upon this unsane state of affairs.
Speaking of C<use encoding>, for yet another good time--and I've
plenty more where these come from--guess before running it the
exact output *this* produces:
#!/usr/bin/perl
use encoding;
print "Hello, brave new world!\n";
That's enough. I won't ask anyone to guess how to *reliably* write
if ($data =~ s/^$BOM//) { $byte_order = XXX; }
where BOM is the two-byte sequence FF FE or FE FF, depending. It's
probably not what you may think it is :(, since C<use encoding "utf8">
renders that otherwise straightforward problem pathetically tortuous.
--tom, who's running short of toes to stub
Thread Previous
|
Thread Next