develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
From:
Tels
Date:
March 31, 2007 09:38
Subject:
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
200703311838.33237@bloodgate.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Moin,

On Saturday 31 March 2007 16:09:18 Juerd Waalboer wrote:
> Tels skribis 2007-03-31 12:23 (+0000):
> > 	#!/usr/bin/perl -w
> > 	use Encode qw/decode/;
> > 	my $random = "\xc3\xc3";        # some random bytes
> > 	my $ascii = "a";		# some 7bit data
> >
> > 	# Somebody "helpfull" decodes the ascii string:
> > 	# The encoding doesn't actually matter, since it is 7bit anyway.
> > 	# This step happens out of my control (e.g. in third party code)
> > 	$string = decode('ISO-8859-1', $ascii);
>
> $string is a text string, now. Remember, decoding is going from byte
> string to text string.

Yes, but my point was that I:

* might not be the one who "decoded" $string or produced it even.
* do not know if I am passed a "text" string as there is only the 
flag-you-should-not-know-about to distinguish these two.

> Using unpack "C" on a text string makes no sense if you consider that
> this "C" doesn't stand for "character" in the sense that the
> documentation for chr, ord, length, split, etcetera use. It stands for
> "char", which is a C datatype that contains one byte.
>
> As such, unpack "C" is a byte operation and makes sense on byte strings
> only. $string is a text string, and you can tell by looking at the
> decode() step.
>
> > 	# now take our random binary data and a 7bit ascii string and do:
> > 	print join (" ", unpack("CCC", "$random$string")), "\n";
>
> Dangerous, and that's why I suggested adding a "wide character in..."
> warning earlier in this thread.
>
> > Now explain to me why this prints different things even tho $random is
> > the same string in both cases, and $string and $ascii should be the
> > same, too. :) Bonus points if you manage to not mention the uhh -- ut -
> > utf -- uhm -- er The Flag[tm].
>
> I get the bonus points! Hurrah! :)

Not really, as you didn't explain the difference, you merely told me "there 
is a difference" (where me personally don't expect to be a difference)

> The only explanation that I used is the separation between text strings
> and binary strings. It's also the only thing you need to know. You'll  
> benefit from knowing more, certainly, but I see red flags in your code.

Ok, and how am I supposed know that in:

	sub dosomething {
		my $a = shift;
	}

$a is a text string or a binary string? :)


> > So far, I can see the ways to handle this are:
> > (..)
> > * never mix fire and water er dogs and cats er I mean text and bytes,
> > and pray that every piece of code out there to adheres to this, too.
>
> Exactly.

This is not a working strategy.

> > I think the Pray and Hope[tm] strategy doesn't really work, tho.
>
> It doesn't always work, because people can't be trusted to do the right
> thing, but it can always be fixed.

Only if you consider your own code. But data is sometimes processed by other 
code (Perl itself, some module etc.). 

All the best,

Tels

- -- 
 Signed on Sat Mar 31 18:33:51 2007 with key 0x93B84C15.
 Get one of my photo posters: http://bloodgate.com/posters
 PGP key on http://bloodgate.com/tels.asc or per email.

 "We're looking at a future where only the very largest companies will be
 able to implement software, and it will technically be illegal for other
 people to do so."

  -- Bruce Perens, 2004-01-23
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iQEVAwUBRg6qqXcLPEOTuEwVAQINCAf/QWq653liE6ZUnR5sUrO8YFVXU0Gi5s/m
wm4teby4dypHRuyjKov7a2XeheRCZU+iYXnlNFk8Tioqd3ZOwlZC5uGbufX1QnpO
H9lYRtDTG14BHH2D+QsMgSrPcAXwsnvSdlePAmy4m9TJ3xQTtzcPLTWt2p8tgiul
URl0lgMHv7I9ASJusYwPa00YRFDexpdVuYpclTtnzzVPoGkuMxAKIDhhAuKp9uSl
gWJXGiha9hvGEZOh2k6mGZ/bkstEMhp3vrqU1ccp11jfahsaAwvU9EVS7254t22R
KqXh3Ca4/lMxs+2+1xW0j518Asq0sB/L6gkyGr0tHdFgQwX7S71yoA==
=K82l
-----END PGP SIGNATURE-----

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About