develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs

Juerd Waalboer
May 20, 2008 11:06
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs
Message ID:
demerphq skribis 2008-05-20 18:03 (+0200):
> > The Encode suite treats character sets as properties of encodings;
> Given how perl works internally does Encode have any other choice?

Sure. Since a string in Perl is just a sequence of numbered characters,
it could theoretically be used to represent any character set, not just
Unicode. We tend to call Perl strings Unicode strings, but in reality
the unicode-ness is not part of the string, but of the operation done on
it. It's a fair coincidence that the multibyte encoding chosen happens
to be a unicode encoding ;)

> > the user only has to deal with a single character set, namely Unicode.
> Except er, they dont. As weve been discussing for ages now.

Encode combines "character set" and "byte encoding" into a single
mapping, which it calls "encoding". Perl users can treat binary data as
encoded text.

A Perl programmer decodes the binary data, and later encodes the text
data back to binary. They only specify the "encoding", and the character
set is handled transparently.

Let's call the latin1 character set "l1cs" and the latin1 encoding
"l1enc". The real transformation from UTF-8 to l1enc would be:

    UTF-8 -> unicode -> l1cs -> l1enc

However, Perl provides a unified view of encodings, and bundles the
charset in them. What you're actually doing is

    UTF-8 -> (string of unicode codepoints) -> latin1

And you don't have to care about the difference between l1cs and l1enc.
That's what I meant by: the character set is Unicode, and all other
character sets are handled by their encoding implementations.

> > I think that's the only sane approach. Information about the
> > charset/encoding does not have to be in the string, but belongs to
> > operations as Marc aptly describes the first post carrying this subject.
> I dont get you really. If you dont know what type of a data is
> contained in a string how can you know what the correct behaviour is
> for it for a given operation?

By declaring what you expect, so you don't have to know or guess. Perl
operators would expect unicode text.

uc(), lc(), character classes, etcetera are all text operations. You
don't use them on binary data. Perl assumes that the character set of
the string is Unicode, and uses Unicode semantics. Or, it should.

In fact, I couldn't even *find* any other character set with clearly
defined semantics for things like upper/lower case. Unicode appears to
be unique in that. Oh, and ASCII of course :).
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <>  <>
  Convolution:     ICT solutions and consultancy <>
1; Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About