develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
March 31, 2007 03:40
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
Hash: SHA1


On Saturday 31 March 2007 00:33:55 Juerd Waalboer wrote:
> Tels skribis 2007-03-31  1:39 (+0000):
> > My question was posed because I wanted to know how to *keep* a KOI8 (or
> > any other random binary) string in Perl without converting it to
> > Unicode. It seems to me this is not easily possible because there are
> > literally dozend places where your KOI8 string might get suddenly
> > upgraded to UTF-8 (and thus get corrupted because Perl treats it is
> > ISO-8859-1). Or did I get this wrong?
> A koi8r string is a byte string. If you keep it separated from text
> strings properly, it should not be upgraded and thus treated as latin1.
> I'm very curious as to "sudden upgrades" that aren't related to mixing
> with text strings. Should you encounter them, please let me know.

"Keeping things seperate" is not working in the Real World[tm]. As far as I 
can see so:

	#!/usr/bin/perl -w
	use Encode qw/decode/;
	my $random = "\xc3\xc3";        # some random bytes
	my $ascii = "a";		# some 7bit data

	# Somebody "helpfull" decodes the ascii string:
	# The encoding doesn't actually matter, since it is 7bit anyway.
	# This step happens out of my control (e.g. in third party code)
	$string = decode('ISO-8859-1', $ascii);

	# now take our random binary data and a 7bit ascii string and do:
	print join (" ", unpack("CCC", "$random$string")), "\n";
	print join (" ", unpack("CCC", "$random$ascii")), "\n";

Now explain to me why this prints different things even tho $random is the 
same string in both cases, and $string and $ascii should be the same, 
too. :) Bonus points if you manage to not mention the uhh -- ut - utf -- 
uhm -- er The Flag[tm].

So far, I can see the ways to handle this are:

* replace C with U (lots of code review work, plus it still means you
  200Mbyte TIFF file might make a trip to UTF-8 land and back)
* always forcefully downgrade stuff in 7bit ASCII (wastefull) and just hope
  your 8bit data never get's in contact with anything with The Flag[tm]
* never mix fire and water er dogs and cats er I mean text and bytes, and
  pray that every piece of code out there to adheres to this, too.

I think the Pray and Hope[tm] strategy doesn't really work, tho.

All the best,


- -- 
 Signed on Sat Mar 31 12:09:53 2007 with key 0x93B84C15.
 Get one of my photo posters:
 PGP key on or per email.

 "Sundials don't work, the one I've had in my basement hasn't changed
 time since I installed it." grub (11606) on 2004-12-03 on /.

Version: GnuPG v1.4.2 (GNU/Linux)


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About