develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
Juerd Waalboer
March 31, 2007 09:09
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
Tels skribis 2007-03-31 12:23 (+0000):
> 	#!/usr/bin/perl -w
> 	use Encode qw/decode/;
> 	my $random = "\xc3\xc3";        # some random bytes
> 	my $ascii = "a";		# some 7bit data
> 	# Somebody "helpfull" decodes the ascii string:
> 	# The encoding doesn't actually matter, since it is 7bit anyway.
> 	# This step happens out of my control (e.g. in third party code)
> 	$string = decode('ISO-8859-1', $ascii);

$string is a text string, now. Remember, decoding is going from byte
string to text string.

Using unpack "C" on a text string makes no sense if you consider that
this "C" doesn't stand for "character" in the sense that the
documentation for chr, ord, length, split, etcetera use. It stands for
"char", which is a C datatype that contains one byte.

As such, unpack "C" is a byte operation and makes sense on byte strings
only. $string is a text string, and you can tell by looking at the
decode() step. 

> 	# now take our random binary data and a 7bit ascii string and do:
> 	print join (" ", unpack("CCC", "$random$string")), "\n";

Dangerous, and that's why I suggested adding a "wide character in..."
warning earlier in this thread.

> Now explain to me why this prints different things even tho $random is the 
> same string in both cases, and $string and $ascii should be the same, 
> too. :) Bonus points if you manage to not mention the uhh -- ut - utf -- 
> uhm -- er The Flag[tm].

I get the bonus points! Hurrah! :)

The only explanation that I used is the separation between text strings
and binary strings. It's also the only thing you need to know. You'll
benefit from knowing more, certainly, but I see red flags in your code.

> So far, I can see the ways to handle this are:
> (..)
> * never mix fire and water er dogs and cats er I mean text and bytes, and
>   pray that every piece of code out there to adheres to this, too.


> I think the Pray and Hope[tm] strategy doesn't really work, tho.

It doesn't always work, because people can't be trusted to do the right
thing, but it can always be fixed.
korajn salutojn,

  juerd waalboer:  perl hacker  <>  <>
  convolution:     ict solutions and consultancy <>

Ik vertrouw stemcomputers niet.
Zie <>.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About