develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
From:
Tels
Date:
March 30, 2007 16:42
Subject:
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
200703310139.06384@bloodgate.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Moin,

On Friday 30 March 2007 22:38:19 Juerd Waalboer wrote:
> Tels skribis 2007-03-31  0:19 (+0000):
> > Anyway, I wasn't aware that any non-utf8 data in Perl is *always*
> > ISO-8859-1, I thought that, when not specified, this depended on some
> > other stuff. Guess I need to reread the tutorials. :)
>
> Note that they are unicode strings, and that Perl is theoretically free
> to change the internal representation at any time.
>
> > However, this also poses the question: How does Perl know that your
> > data is in KOI8-R?
>
> Because you tell it that it is with "decode". The resulting string is a
> unicode string, which may have any encoding internally. (Practically,
> this is limited to latin1 and utf8.)
>
>     my $text_string = decode("koi8-r", $byte_string);
>
> or, if you prefer different terminology:
>
>     my $unicode_string = decode("koi8-r", $koi8r_string);

I thought you would say this :)

My question was posed because I wanted to know how to *keep* a KOI8 (or any 
other random binary) string in Perl without converting it to Unicode. It 
seems to me this is not easily possible because there are literally dozend 
places where your KOI8 string might get suddenly upgraded to UTF-8 (and 
thus get corrupted because Perl treats it is ISO-8859-1). Or did I get this 
wrong?

In an ideal world, you could either just keep everything in utf-8 (that's 
too slow for some things and not fool-proof either), or rely on no other 
code to corrupt your data - especially this random third party module you 
pulled from CPAN last night. :)

OMHO the problem arises from the fact that Perl makes no distinction between 
a byte string like "a" and a text string like "a", and furthermore, 
manipulating byte string (for instance appending a byte) is done with 
typical string operators. So:

	$byte_string = 'something random bytes';

	# works if $y is 7bit and no utf8 flag
	# but fails if $y is 7bit with utf8 flag
	$byte_string .= $y;

As you said, all is well as long as you can keep these two beasts seperate, 
but the slightest problem might mangle your data. Such as a decode_utf8 
setting the UTF8 bit on a 7bit ASCII string, therefore changing the 7bit 
byte string to a text string.

Hm, maybe one could write a module that always tackles the encoding to an SV 
via magic. And then you could have a special encoding called "BINARY" (or 
absence of an encoding means it is treated as binary), so that if you ever 
try to fuse two strings together where one of them is tagged binary, you 
get an exception (but only then!).

As you said, the current warnings::encode can't decide between the case 
of "BINARY + UTF_8" and "ISO-8859-1 + UTF_8" as Perl makes no distinction 
between binary data and ISO-8859-1. And this missing distinction is 
certainly a bother :)

> > One of the limitations of the "there can be only two encodings" of Perl
> > seems to be that strings are permanently upgraded:
> > 	$iso_8859_1 = '...';
> > 	$utf8 = '...';
> > 	if ($iso_8859_1 eq $utf8) { ... }
>
> $iso_8859_1 is temporarily upgraded to utf8 for this comparison.

> (Yes, this copies data, and then throws it away. Again, optimization
> does require knowing internals. The easiest optimization here is to
> utf8::upgrade $iso_8859_1, after which the variable name no longer makes
> sense :))

Nah, in this case I wanted the temporarily upgrade :)

> > Just like 1 + 2.0 will result in 3.0 and not 3 and we all know how
> > much confusion this creates :) (heh, I fell for it today, even tho I
> > should have know better :)
>
> Doesn't really cause me any headaches, to be honest.

Yeah, I am not a genius :/ (Sometimes I wish I could upgrade my brain :)

> > > The same type of string can be used for binary data, because in the
> > > unicode encoding "latin1", all 256 codepoints map to the same byte
> > > values.
> >
> > This sounds like a circular definition, because in CP1250, also all 256
> > codepoints map to the same byte values. Except it are different byte
> > values :)
>
> I said "unicode encoding", but should have said "unicode codepoints".
>
> Codepoints 0..256 in latin1 map to byte values 0..256. That makes it
> special.

Erm, I don't buy this because:

Codepoints 0..256 in KOI8-R (to pick one) map to byte values 0.256. That 
would make it special, too.

(I don't nec. disagree with you, I just don't understand what you mean).

> > > > In short, it becomes a mess.
> > >
> > > Yes, with strong typing, especially with string subtypes for
> > > arbitrary encodings, it would be cleaner. But it would also not look
> > > like Perl 5.
> >
> > Over the years, I come to the insight that I want to build reliable and
> > fast programs. (easy to maintain, reliable, fast, pick two :-)
>
> I do that with Perl. Really, you should check that language out! You'll
> LOVE it! :)

Yeah, maybe one day I actually start real programming work in Perl. ;)

All the best,

Tels

PS: I think this discussion has become a bit off-topic, so we should 
probably keep it off-list. Just for the original topic and the record, when 
you have pure 7bit ASCII data, Perl (decode etc) should not set the utf8 
flag on the data, as that makes things go slower and is just a waste. In 
fact, it shouldn't even copy the data around etc., it should only make 
exactly one run through the data to count the high-bit bytes.
PPS: Thanx for the discussion, this really helps me to understand things 
better.
P³S: Unrelated to this thread, I was working on benchmarking Encode and the 
ISO-8859-1 to UTF-8 upgrade code. Stay tuned :)

- -- 
 Signed on Sat Mar 31 01:18:34 2007 with key 0x93B84C15.
 View my photo gallery: http://bloodgate.com/photos
 PGP key on http://bloodgate.com/tels.asc or per email.

 ". . . my work, which I've done for a long time, was not pursued in
 order to gain the praise I now enjoy, but chiefly from a craving after
 knowledge, which I notice resides in me more than in most other men. And
 therewithal, whenever I found out anything remarkable, I have thought it
 my duty to put down my discovery on paper, so that all ingenious people
 might be informed thereof."

  -- Antony van Leeuwenhoek. Letter of June 12, 1716
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iQEVAwUBRg27uncLPEOTuEwVAQIkXAf+O+FgERCl2lcyr28XpeLcCl17pKtfeVBd
kQn/j7sqMGLYuqzcZMrNIn4gKskw8L1T19Q0XcoJBVb4phlHHKrZttmbBrhN++KA
YfXPd9WH/qg9exYHH/+TDdAWCaJYDYcG2B8xI1NTKrDgwFBt8sJJyt9J2jrJoPJE
6rPpAL9vun1wqv6MJeRacxHWmWk7wXflCIrUt9bf8c+feEpMJ51/331Kgb0tjcFs
85IpfzV9TuFn8I17it//7rPrzJfb1NOSwOcgk/6dj5msIoZv1psmNYZcaysAIGpu
evEdhAjpmiVh+DSnGRZEoWfzGwoJfVwGCOmoaQ2O44e9u+AVmx6x0A==
=gDih
-----END PGP SIGNATURE-----

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About