develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs

Thread Previous | Thread Next
Juerd Waalboer
May 20, 2008 08:01
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs
Message ID:
demerphq skribis 2008-05-20 14:10 (+0200):
> Where this gets confusing is that Perl does in fact assume Latin-1
> semantics for its octet based strings in a number of common cases,

I think you mean "ASCII semantics" there. In these cases, the second
half of latin1 is ignored and left alone.

Latin1 was (re)defined as a Unicode encoding in 1998, which means that
0xE9 is no longer just something that looks like é, but defined as
U+00E9 LATIN SMALL LETTER E WITH ACUTE. This does, of course, have the
implication that the letter now has an uppercase variant in U+00C9 LATIN
CAPITAL LETTER E WITH ACUTE which is encoded as the 0xC9 byte in latin1.
Perl ignores this part of the specification, and that's why I think it's
incorrect to call what Perl does "Latin-1 semantics".

In fact, latin1 semantics are pretty hard to describe because uc("\xff")
(\xFF is U+00FF LATIN SMALL LETTER Y WITH DIAERESIS) cannot be expressed
in latin1, because the uppercase of U+00FF is U+0178 which has no
representation in latin1.

> I think Marc is right, the utf8 flag being off doesn't say "this data
> is latin1" and the utf8 flag being on doesn't say "this data is
> Unicode". The flag instead says (when off) "this is array of
> characters" or "this is an array of integers encoded as utf8" (when
> on). 

You're making a distinction between "characters" (SvUTF8 off) and
"integers" (SvUTF8 on) that I don't understand. Could you explain why
there is a difference and what that is?

> Latin-1 is a character set.

Latin-1 is both a character set and an encoding. The character set is
defined as equal to the first 256 characters in Unicode (U+0000 ..
U+00FF), and the encoding is defined as a straight forward 8 bit
encoding: U+0000 => 0x00 .. U+00FF => 0xFF. They even went as far as
describing how the individual bits are to be layed out in the byte. Not
surprisingly, the 8 bits have weights from 128 to 1, where each
subsequent bit is half the value of the one before it :)

The specification uses the term "coded representation" rather than

> The issues i see are this:
> 1. We don't have a binary data type.

I intend to release a module that handles this in Perl space in a way
that is backward compatible to 5.000. Its name is BLOB.

One thing that it doesn't do, is avoid concatenation with non-BLOBs. I'd
like to learn if this can be done at all.

> 3. We use the name of an encoding of Unicode as the name of for the
> encoding of a string causing confusion.

Indeed. Maybe it would be wise to start calling the internal
representation SvUTF8 encoding, rather than UTF8 encoding. Or maybe a
wholly different name.

> Maybe by making PV's store more information about their character set.

The Encode suite treats character sets as properties of encodings; the
user only has to deal with a single character set, namely Unicode. I
think that's the only sane approach. Information about the
charset/encoding does not have to be in the string, but belongs to
operations as Marc aptly describes the first post carrying this subject.
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <>  <>
  Convolution:     ICT solutions and consultancy <>

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About