develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
From:
Juerd Waalboer
Date:
March 30, 2007 17:34
Subject:
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
20070331003355.GK31277@c4.convolution.nl
Tels skribis 2007-03-31  1:39 (+0000):
> My question was posed because I wanted to know how to *keep* a KOI8 (or any 
> other random binary) string in Perl without converting it to Unicode. It 
> seems to me this is not easily possible because there are literally dozend 
> places where your KOI8 string might get suddenly upgraded to UTF-8 (and 
> thus get corrupted because Perl treats it is ISO-8859-1). Or did I get this 
> wrong?

A koi8r string is a byte string. If you keep it separated from text
strings properly, it should not be upgraded and thus treated as latin1.
I'm very curious as to "sudden upgrades" that aren't related to mixing
with text strings. Should you encounter them, please let me know.

Indeed, some functions and operations will not work properly on koi8r,
with regards to character properties. For example, the regex engine has
no idea which characters are word characters, and which are cyrillic. It
can only assume it's either ascii or latin1. For full functionality, you
must decode the string.

If your program is just a gateway in between other things, and doesn't
do any text processing, just keep the thing a byte string.

Just like $jpeg_image is a byte string that contains JPEG data, and this
can be safely used, $koi8r_string can be a byte string that contains
koi8r text data.

> especially this random third party module you pulled from CPAN last
> night. :)

Well, yes, modules sometimes have bugs. That's something we have to
learn to live with.

> As you said, all is well as long as you can keep these two beasts seperate, 
> but the slightest problem might mangle your data.

That is true. Programming can be a delicate job. Has always been like
that :)

> Hm, maybe one could write a module that always tackles the encoding to an SV 
> via magic.  (...) so that if you ever try to fuse two strings together
> where one of them is tagged binary, you get an exception (but only
> then!).

That would be neat. You'd effectively have strong typing. I don't think
you can do this in a module, though. It requires checks all over the
place. Maybe Scott Walters' typesafety module can be of help or
inspiration: http://search.cpan.org/~swalters/typesafety-0.05/

> Yeah, I am not a genius :/ (Sometimes I wish I could upgrade my brain :)

But then, it would be much slower! ;)

> > Codepoints 0..256 in latin1 map to byte values 0..256. That makes it
> > special.
> Erm, I don't buy this because:
> Codepoints 0..256 in KOI8-R (to pick one) map to byte values 0.256. That 
> would make it special, too.

I should have said "unicode codepoints 0..255 in latin1 map ...".

The interesting thing about latin1 is that 0..255 overlap with unicode.
The 0..255 (not 256 btw, silly mistake) in koi8-r can all be found in
unicode somewhere, but they're not all in exactly the same places.
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerd@juerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy <sales@convolution.nl>

Ik vertrouw stemcomputers niet.
Zie <http://www.wijvertrouwenstemcomputersniet.nl/>.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About