develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
Juerd Waalboer
March 30, 2007 15:38
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
Tels skribis 2007-03-31  0:19 (+0000):
> Anyway, I wasn't aware that any non-utf8 data in Perl is *always* 
> ISO-8859-1, I thought that, when not specified, this depended on some other 
> stuff. Guess I need to reread the tutorials. :)

Note that they are unicode strings, and that Perl is theoretically free
to change the internal representation at any time.

> However, this also poses the question: How does Perl know that your data is 
> in KOI8-R?

Because you tell it that it is with "decode". The resulting string is a
unicode string, which may have any encoding internally. (Practically,
this is limited to latin1 and utf8.)

    my $text_string = decode("koi8-r", $byte_string);

or, if you prefer different terminology:

    my $unicode_string = decode("koi8-r", $koi8r_string);

> One of the limitations of the "there can be only two encodings" of Perl 
> seems to be that strings are permanently upgraded:
> 	$iso_8859_1 = '...';
> 	$utf8 = '...';
> 	if ($iso_8859_1 eq $utf8) { ... }

$iso_8859_1 is temporarily upgraded to utf8 for this comparison.

(Yes, this copies data, and then throws it away. Again, optimization
does require knowing internals. The easiest optimization here is to
utf8::upgrade $iso_8859_1, after which the variable name no longer makes
sense :))

> Just like 1 + 2.0 will result in 3.0 and not 3 and we all know how
> much confusion this creates :) (heh, I fell for it today, even tho I
> should have know better :)

Doesn't really cause me any headaches, to be honest.

> > The same type of string can be used for binary data, because in the
> > unicode encoding "latin1", all 256 codepoints map to the same byte
> > values.
> This sounds like a circular definition, because in CP1250, also all 256 
> codepoints map to the same byte values. Except it are different byte 
> values :)

I said "unicode encoding", but should have said "unicode codepoints".

Codepoints 0..256 in latin1 map to byte values 0..256. That makes it

> > > In short, it becomes a mess.
> > Yes, with strong typing, especially with string subtypes for arbitrary
> > encodings, it would be cleaner. But it would also not look like Perl 5.
> Over the years, I come to the insight that I want to build reliable and fast
> programs. (easy to maintain, reliable, fast, pick two :-)

I do that with Perl. Really, you should check that language out! You'll
LOVE it! :)
korajn salutojn,

  juerd waalboer:  perl hacker  <>  <>
  convolution:     ict solutions and consultancy <>

Ik vertrouw stemcomputers niet.
Zie <>.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About