-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Moin, On Saturday 31 March 2007 00:33:55 Juerd Waalboer wrote: > Tels skribis 2007-03-31 1:39 (+0000): > > My question was posed because I wanted to know how to *keep* a KOI8 (or > > any other random binary) string in Perl without converting it to > > Unicode. It seems to me this is not easily possible because there are > > literally dozend places where your KOI8 string might get suddenly > > upgraded to UTF-8 (and thus get corrupted because Perl treats it is > > ISO-8859-1). Or did I get this wrong? > > A koi8r string is a byte string. If you keep it separated from text > strings properly, it should not be upgraded and thus treated as latin1. > I'm very curious as to "sudden upgrades" that aren't related to mixing > with text strings. Should you encounter them, please let me know. "Keeping things seperate" is not working in the Real World[tm]. As far as I can see so: #!/usr/bin/perl -w use Encode qw/decode/; my $random = "\xc3\xc3"; # some random bytes my $ascii = "a"; # some 7bit data # Somebody "helpfull" decodes the ascii string: # The encoding doesn't actually matter, since it is 7bit anyway. # This step happens out of my control (e.g. in third party code) $string = decode('ISO-8859-1', $ascii); # now take our random binary data and a 7bit ascii string and do: print join (" ", unpack("CCC", "$random$string")), "\n"; print join (" ", unpack("CCC", "$random$ascii")), "\n"; Now explain to me why this prints different things even tho $random is the same string in both cases, and $string and $ascii should be the same, too. :) Bonus points if you manage to not mention the uhh -- ut - utf -- uhm -- er The Flag[tm]. So far, I can see the ways to handle this are: * replace C with U (lots of code review work, plus it still means you 200Mbyte TIFF file might make a trip to UTF-8 land and back) * always forcefully downgrade stuff in 7bit ASCII (wastefull) and just hope your 8bit data never get's in contact with anything with The Flag[tm] * never mix fire and water er dogs and cats er I mean text and bytes, and pray that every piece of code out there to adheres to this, too. I think the Pray and Hope[tm] strategy doesn't really work, tho. All the best, Tels - -- Signed on Sat Mar 31 12:09:53 2007 with key 0x93B84C15. Get one of my photo posters: http://bloodgate.com/posters PGP key on http://bloodgate.com/tels.asc or per email. "Sundials don't work, the one I've had in my basement hasn't changed time since I installed it." grub (11606) on 2004-12-03 on /. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) iQEVAwUBRg5Sz3cLPEOTuEwVAQJvegf+OVl0Ha2tJ3QIXmkUs+XHXWdYIqtu9xJe VeBwrelub65lfgIfD8FnNmft+KgZDE8S8QU3sjFo5NArtVT56tFsAeIwtdtC23au BcobxZxkI9iHWJtkJYlxKHEdSPbWSgJiWfJ7J3fc4zprme3/Zlxgpcd3pyiRee0m AhpnZ6dui033dNakhZCHu1L/YeUyP72OmGmtWOAJLHGIQ/w0nUrUJrx5kg3WuV88 ATfl7EFVZOxqavSSWJCgBHXvU8iRUg4mmqpoVPY4S9uqMi9IYCZBPZNAc++MSjbn b0e8+qPTB43zah6EfNSc5Xq22EDEjx7mu0n62FQhajV1lOIoc0kV7g== =CfKu -----END PGP SIGNATURE-----Thread Previous | Thread Next