develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
Marc Lehmann
March 30, 2007 17:21
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
On Sat, Mar 31, 2007 at 01:39:06AM +0000, Tels <> wrote:
> My question was posed because I wanted to know how to *keep* a KOI8 (or any 
> other random binary) string in Perl without converting it to Unicode. It 
> seems to me this is not easily possible because there are literally dozend 
> places where your KOI8 string might get suddenly upgraded to UTF-8 (and 
> thus get corrupted because Perl treats it is ISO-8859-1). Or did I get this 
> wrong?

Yes, you did get that wrong, liekly because Juerd wants users to care about
that. But in fact, if you try it, nothing will get corrupted unless you use
unpack "C" to get the first byte of your KOI8-string. Then you might get
surprised (current perl) or an exception (Juerd's idea).

> In an ideal world, you could either just keep everything in utf-8 (that's 
> too slow for some things and not fool-proof either), or rely on no other 
> code to corrupt your data - especially this random third party module you 
> pulled from CPAN last night. :)

In an ideal world, you would just want to manipulate bytes == characters in
Perl, and do not care about how it treats it internally. It should treat it
as fast as possible, of course.

The same is true for other things in perl: you do not wan tto care wether
your scalar contains an integer, floatingpoint, or strings. Use decides that
in perl: if you print an integer scalar, it (also) turns into a string. If you add
a floating point number to an integer-only scalar, you get the expected
floatingpoint result.

Perl converts between all those "encodings" transparently in a way that makes
most sense. And the same thing is true for character data.

There is a small diference, as Perl can have scalars that have both a string
and a double value, for example, and can then choose the fastest
representation. Perl could just as well keep both an UTF-X encoded as well as
a octet-encoded version of string around to optimise for speed.

Of course, that optimisation would need a lot of memory, so the trade-off
choosen in the current implementation is to upgrade/downgrade when needed,
transparently, so your KOI8-bytes stay KOI8-bytes all the time.

It is the few cases where perl doesn't do that I am concerned about.

> OMHO the problem arises from the fact that Perl makes no distinction between 
> a byte string like "a" and a text string like "a", and furthermore, 
> manipulating byte string (for instance appending a byte) is done with 
> typical string operators. So:

Yeah. It also makes no difference between numbers and strings. Thats Perl.

> 	# works if $y is 7bit and no utf8 flag
> 	# but fails if $y is 7bit with utf8 flag
> 	$byte_string .= $y;
> As you said, all is well as long as you can keep these two beasts seperate, 
> but the slightest problem might mangle your data. Such as a decode_utf8 
> setting the UTF8 bit on a 7bit ASCII string, therefore changing the 7bit 
> byte string to a text string.

No, only in Juerd's model where binary data encoded in UTF-X is a bug. In
real-world perl, that just works fine,a dn thats what I expect, and thats I
think what users expect, too: not having to deal with the internal types.

In the same way, you do not have a module that converts numbers to strings,
you just print them:

   my $x = 5;
   print $x;

Again, pelr transparently handles the details (which includes(!) character
encoding for the outside world!).

> As you said, the current warnings::encode can't decide between the case 
> of "BINARY + UTF_8" and "ISO-8859-1 + UTF_8" as Perl makes no distinction 
> between binary data and ISO-8859-1. And this missing distinction is 
> certainly a bother :)

Only when you hit bugs, or unpack.


                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __
      --==---/ / _ \/ // /\ \/ /
      -=====/_/_//_/\_,_/ /_/\_\      XX11-RIPE

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About