develooper Front page | perl.perl5.porters | Postings from September 2009

Re: [perl #69414] Case-insensitive utf8 matching problem

Thread Previous | Thread Next
From:
Tom Christiansen
Date:
September 27, 2009 18:56
Subject:
Re: [perl #69414] Case-insensitive utf8 matching problem
Message ID:
382.1254102931@chthon
[ ...continuing "C<use encoding> Considered Harmful" ]

Wasn't *that* fun?  

Sure, once you C<use encoding "utf8">, then you can no longer 
use the ISO8859-1 encoding of "tsch\xFC\xDF".  You need the UTF-8 
one.  So a literal string you want read as "tschüß" must have a 
different set of bytes, the UTF-8 encoded version.

Fine; that's all good and expected.

But there's more.  You can no longer write "tsch\xFC\xDF" under 
C<use encoding "utf8">.  You must now write the octets as UTF-8 wants 
to see them:  "tsch\xC3\xBC\xC3\x9F".  So not only must all high-bit 
literal data be exactly encoded, you must also pre-(re-?)encode every 
7-bit-clean SYMBOLIC mention of all code points over 128, each in its
precise physical bitwise layout according to the encoding you've used.

You can't dodge by writing "tsch".chr(0xFC).chr(DF) or any other string-
composing trick.  You really do have to write out the blinking octets as
they encode.  Get that?  Under C<use encoding>, "\xFC" is *not* the
character whose code point is 0xFC!! The old equality of chr(0xFC) eq
"\xFC" is out the door.  Now chr(0xFC) eq "\xC3\BC", and chr(0xDF) ne
"\xDF" as chr(0xDF) must be written eq "\xC3\x9F".  If this seems like 
fun, try UTF-16 where it's out with familiar "tsch\xFC\xDF" and in with
"\x0t\x0s\x0c0u\xFF\xFD\xFF\xFD\x0\x0", maybe +"\xFE\xFF" in front.  
In UTF-7, it's "tscH+//3//q-".

You have to know and write all these.  How ridiculous is that, and why
would anyone (knowingly) inflict this on themself, or others?  What a
maintenance nightmare!  How many of you really already knew about this?
Honestly, please; am I truly the only one here caught unaware by what
appears to me a gross failure of abstraction?  I didn't realize the
holes in my head were as big as they plainly are.

At some point this failed to follow Perl's prime directive that 
"easy things should be easy."  This seems hard to understand, hard 
to explain, and hard to work with, and [I believe] few can correctly 
predict what it will do.  I wish I were wrong.

I don't even think it fixable, since surely there's code "out there" 
that relies upon this unsane state of affairs.

Speaking of C<use encoding>, for yet another good time--and I've 
plenty more where these come from--guess before running it the 
exact output *this* produces:

    #!/usr/bin/perl
    use encoding;
    print "Hello, brave new world!\n";

That's enough.  I won't ask anyone to guess how to *reliably* write 

    if ($data =~ s/^$BOM//)  { $byte_order = XXX; }

where BOM is the two-byte sequence FF FE or FE FF, depending.  It's
probably not what you may think it is :(, since C<use encoding "utf8">
renders that otherwise straightforward problem pathetically tortuous.

--tom, who's running short of toes to stub

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About