develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs

Thread Previous | Thread Next
From:
Juerd Waalboer
Date:
May 20, 2008 08:01
Subject:
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs
Message ID:
20080520150136.GH2842@c4.convolution.nl
demerphq skribis 2008-05-20 14:10 (+0200):
> Where this gets confusing is that Perl does in fact assume Latin-1
> semantics for its octet based strings in a number of common cases,

I think you mean "ASCII semantics" there. In these cases, the second
half of latin1 is ignored and left alone.

Latin1 was (re)defined as a Unicode encoding in 1998, which means that
0xE9 is no longer just something that looks like é, but defined as
U+00E9 LATIN SMALL LETTER E WITH ACUTE. This does, of course, have the
implication that the letter now has an uppercase variant in U+00C9 LATIN
CAPITAL LETTER E WITH ACUTE which is encoded as the 0xC9 byte in latin1.
Perl ignores this part of the specification, and that's why I think it's
incorrect to call what Perl does "Latin-1 semantics".

In fact, latin1 semantics are pretty hard to describe because uc("\xff")
(\xFF is U+00FF LATIN SMALL LETTER Y WITH DIAERESIS) cannot be expressed
in latin1, because the uppercase of U+00FF is U+0178 which has no
representation in latin1.

> I think Marc is right, the utf8 flag being off doesn't say "this data
> is latin1" and the utf8 flag being on doesn't say "this data is
> Unicode". The flag instead says (when off) "this is array of
> characters" or "this is an array of integers encoded as utf8" (when
> on). 

You're making a distinction between "characters" (SvUTF8 off) and
"integers" (SvUTF8 on) that I don't understand. Could you explain why
there is a difference and what that is?

> Latin-1 is a character set.

Latin-1 is both a character set and an encoding. The character set is
defined as equal to the first 256 characters in Unicode (U+0000 ..
U+00FF), and the encoding is defined as a straight forward 8 bit
encoding: U+0000 => 0x00 .. U+00FF => 0xFF. They even went as far as
describing how the individual bits are to be layed out in the byte. Not
surprisingly, the 8 bits have weights from 128 to 1, where each
subsequent bit is half the value of the one before it :)

The specification uses the term "coded representation" rather than
"encoding".

> The issues i see are this:
> 1. We don't have a binary data type.

I intend to release a module that handles this in Perl space in a way
that is backward compatible to 5.000. Its name is BLOB.

One thing that it doesn't do, is avoid concatenation with non-BLOBs. I'd
like to learn if this can be done at all.

> 3. We use the name of an encoding of Unicode as the name of for the
> encoding of a string causing confusion.

Indeed. Maybe it would be wise to start calling the internal
representation SvUTF8 encoding, rather than UTF8 encoding. Or maybe a
wholly different name.

> Maybe by making PV's store more information about their character set.

The Encode suite treats character sets as properties of encodings; the
user only has to deal with a single character set, namely Unicode. I
think that's the only sane approach. Information about the
charset/encoding does not have to be in the string, but belongs to
operations as Marc aptly describes the first post carrying this subject.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <#####@juerd.nl>  <http://juerd.nl/sig>
  Convolution:     ICT solutions and consultancy <sales@convolution.nl>
1;

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About