develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs

From:
demerphq
Date:
May 20, 2008 09:03
Subject:
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs
Message ID:
9b18b3110805200903k35ae8d6dp6dec9d8f6822f830@mail.gmail.com
2008/5/20 Juerd Waalboer <juerd@convolution.nl>:
> demerphq skribis 2008-05-20 14:10 (+0200):
>> Where this gets confusing is that Perl does in fact assume Latin-1
>> semantics for its octet based strings in a number of common cases,
>
> I think you mean "ASCII semantics" there. In these cases, the second
> half of latin1 is ignored and left alone.

Yeah something like that. English capitalization rules as applied to latin1.

> Latin1 was (re)defined as a Unicode encoding in 1998, which means that
> 0xE9 is no longer just something that looks like é, but defined as
> U+00E9 LATIN SMALL LETTER E WITH ACUTE. This does, of course, have the
> implication that the letter now has an uppercase variant in U+00C9 LATIN
> CAPITAL LETTER E WITH ACUTE which is encoded as the 0xC9 byte in latin1.
> Perl ignores this part of the specification, and that's why I think it's
> incorrect to call what Perl does "Latin-1 semantics".
>
> In fact, latin1 semantics are pretty hard to describe because uc("\xff")
> (\xFF is U+00FF LATIN SMALL LETTER Y WITH DIAERESIS) cannot be expressed
> in latin1, because the uppercase of U+00FF is U+0178 which has no
> representation in latin1.

Arent charset encoding issues fun?

>> I think Marc is right, the utf8 flag being off doesn't say "this data
>> is latin1" and the utf8 flag being on doesn't say "this data is
>> Unicode". The flag instead says (when off) "this is array of
>> characters" or "this is an array of integers encoded as utf8" (when
>> on).
>
> You're making a distinction between "characters" (SvUTF8 off) and
> "integers" (SvUTF8 on) that I don't understand. Could you explain why
> there is a difference and what that is?

Sorry, i slipped into c speak there. I meant to say that the SvUTF8
flag just tells us whether we have an array of octets (values 0..255)
or a "stream" of integers (0..N for some large value of N
hypothetically unbounded). I say stream here because its not really an
array at that point.

>
>> Latin-1 is a character set.
>
> Latin-1 is both a character set and an encoding. The character set is
> defined as equal to the first 256 characters in Unicode (U+0000 ..
> U+00FF), and the encoding is defined as a straight forward 8 bit
> encoding: U+0000 => 0x00 .. U+00FF => 0xFF. They even went as far as
> describing how the individual bits are to be layed out in the byte. Not
> surprisingly, the 8 bits have weights from 128 to 1, where each
> subsequent bit is half the value of the one before it :)
>
> The specification uses the term "coded representation" rather than
> "encoding".

Ok. Fine. It specifies both. Latin-1 was probably a bad example. I
meant that if you had some arbitrary stream of octets you could encode
those octets as utf8 without losing information. But you wouldnt
(except in the case of latin1) have converted it to unicode.

>> The issues i see are this:
>> 1. We don't have a binary data type.
>
> I intend to release a module that handles this in Perl space in a way
> that is backward compatible to 5.000. Its name is BLOB.
>
> One thing that it doesn't do, is avoid concatenation with non-BLOBs. I'd
> like to learn if this can be done at all.

Sounds interesting.

>
>> 3. We use the name of an encoding of Unicode as the name of for the
>> encoding of a string causing confusion.
>
> Indeed. Maybe it would be wise to start calling the internal
> representation SvUTF8 encoding, rather than UTF8 encoding. Or maybe a
> wholly different name.
>
>> Maybe by making PV's store more information about their character set.
>
> The Encode suite treats character sets as properties of encodings;

Given how perl works internally does Encode have any other choice?

> the user only has to deal with a single character set, namely Unicode.

Except er, they dont. As weve been discussing for ages now.

> I think that's the only sane approach. Information about the
> charset/encoding does not have to be in the string, but belongs to
> operations as Marc aptly describes the first post carrying this subject.

I dont get you really. If you dont know what type of a data is
contained in a string how can you know what the correct behaviour is
for it for a given operation?

Yves


-- 
perl -Mre=debug -e "/just|another|perl|hacker/"



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About