develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs

From:
Tom Christiansen
Date:
May 19, 2008 23:08
Subject:
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs
Message ID:
13963.1211263659@chthon
In his epistle of "Tue, 20 May 2008 06:56:10 +0200." 
    <20080520045610.GB16896@schmorp.de> 
Marc Lehmann <schmorp@schmorp.de> graciously explained:

> On Mon, May 19, 2008 at 06:54:39PM -0600, 
> Tom Christiansen <tchrist@perl.com> wrote:

> [Thanks for summarising all the possible fears :]

Oh, there might be more, you know.  Haven't thought much on it.  
Those were just the ones that came to mind, both the relevant 
and the ir-.

>>> part of the codebase) assuming this is just an internal flag,
>>> as originally designed, will kill perl in the long run.

>> The end of the world is near, eh?

> I meet a lot of people who would like to use Unicode in perl, but fail
> to do so because they run into the problems mentioned and claim it
> should be much easier (yes, it should, and certainly less random).

> But almost all of the issues they run into, *iff* they really want to
> use Unicode and are open to learning a bit about it first originate in
> the utf-8 flag that they cannot see in their perl sources yet that
> affects so many things.

I see you've been talking with Phil Harvey again. :-)

> Basically, if you don't have it set right, stuff breaks everywhere,
> whether it's perl core functions or XS modules, but the brakage [SIC:
> probably meant to read "breakage" unless one's foot hits the brakes
> instead of the accelerator --tchrist] is not universal (if it were, it
> would simply be a different model; the problem is the inconsistency).

But is it a foolish one for little minds to worry about, or a great
one for bigger minds to mull over?

I believe that Phil, for example, due perhaps to such things as you
allude to, tries quite hard to be Unicode-agnostic.  By that I means he
insistently uses byte-interfaces only, even though he sometimes has to
encode or decode byte-data into Unicodepoints.

That said, I always feel there's something *WRONG* if I find
myself having to resort to Encode's encode or decode functions.
Can't quite say why.

>>> Regarding the perlunicode manpage, it is basically a helpless case.

>> "Helpless"?

>> More rhetorical hyperbole, I presume, since if it is incorrectly
>> worded, the road to helping it is obvious: just send patches.

> I certainly won't send patches if people tell me before I submit
> them that the current manpage is correct. I can waste my time in
> better ways.

Wise lesson, that.  However, it never stopped me from doing so.
It's like how all change to the world comes from unreasonable people.

> I would probably submit patches if the process to do so would be
> easier, and the first step would be an agreement of the existing perl5-
> porters on how strings are to be interpreted.

That might require further instruction so that we can all be on the
same, um, code page.

> Note I am not asking for agreement on how it *should* be done or what
> would be better, but an agreement on which semantics will be acceptable
> and which are not.

>>> This is completely untrue: in earlier releases, "use utf8/use
>>> bytes" switched between interpreting the strings as utf-8 vs.
>>> bytes, and did nothing about Unicode-awareness.

>> *That*, I think, is more a matter of casuistry than of correctness.

> Maybe, but I still expect *one* manpage to be consistent to itself - if it
> defines operation as one thing and contrasts it with some other meaning of
> operation then they better should be the same thing - if you compare, then
> apples to apples and oranges to oranges.

First, those are hardly category errors.  After all, would you not
agree that:

    * Both are fruit.
    * Both are juicy.
    * Both are often served juiced with breakfast.
    * Both start green and then usually fall somewhere into red-orange
      area of the visible spectrum.
    * Both are usually of a similar size.
    * Both are of topologically equivalent shape

So for more punch, you might sometimes consider trying the alternate 
aphorism of comparing apples with avarice, or oranges with oratory.

Just an idea. :-)

>>> Perl currently implements a model where encoding is *not*
>>> attached to perl scalars,

Second, I am in need of deeper understanding, or more sleep,
to see how my statement regarding casuistry does not apply.

>> Right: it's supposed to be attached to the I/O Layer alone.

> More correctly to the interfaces communicating with the outside world,
> which includes other things as well (for example filenames, or XS
> modules).

Well...

At one level, nearly all meaningful communication "with the outside
world" falls within the category of being I/O, with signals and exit
status being the most common exceptions.  And timing attacks don't
count. :-)

But I still think that you are asking a lot if you want to make the
claim that filenames as used to access the system's underlying files 
VIA ITS OWN INTERFACES are data rather than metadata.  And I don't think
that filesystem metadata is reliably treated as anything but bytes, at
least on systems with which I am conversant.

Sure, their contents are certainly data, but even that has its limits.
The restrictions under the BUGS section of the perl(1) manpage still
apply: you are for the most part at your system's mercy.  If it provides
byte-access seeks only, not variable-width utf-8 encoded positions, you
can't do much about that.  Well, not much that I'd care to do, at least.

> But the basic theme is indeed "I/O" here - one way to treat characters
> is to encode/decode them when they leave/enter perl and use Unicode
> semantics within.

That sounds sane.

>>> and neither is *Unicodeness* attached to perl scalars.

>> And now you've lost me.

>> If that is universally true, then might you gently explain how 
>> an SV set to chr(500) always has its UTF8 flag turned on?

> Easy, it is the only way for perl to internally represent characters
> with a value of 500.

Of course.

> If that 500 is for example the second character in v5.500, then this
> might simply be the perl 5.500 version string (or part of an ip
> address in the game "uplink" stored in a compact string form, e.g.
> v478.321.571.277).

I'm a little bit queasy about v-strings, thank you very much.

> No Unicode anywhere in sight.

Did I say there was?

> Note also that I can have the Unicode character 32 stored with or
> without the UTF8 flag, which doesn't change the fact that it is
> still the Unicode character 32.

Oh, now that I'm not sure I agree with.  But I fear  we may be back
to casuistry again.

> This is the insane part - I wouldn't expect even an expert perl
> programmer to predict how $s gets interpreted here.

No, neither would I.

> So, my "end of the world" in a more verbose way and less drastic,
> would be "as long as perl has this totally unpredictable rules on
> character interpretation it will not gain wide acceptance for
> Unicode usage".

>> Armed with that output, it seems to me that if you are correct,
>> then UTF8 is not "Unicodeness", but that if it is, then you are
>> not correct.

> I think the examples above made it clear that the UTF8
> flag is not "Unicodeness".

I think I'd better sign off.  Perhaps sleep will make your statement
obviously true to me.  It isn't now.

>>  * Start by explain just what it is that you are calling
>>    Unicodeness.

> Not sure what you mean - Unicodeness is the state of being Unicode.

That's either a trivial tautology of no significance, or something
deeper than I can now fathom.

> More explicitly, a perl string contains Unicode characters when it
> contains Unicode characters

Now *THAT* belongs in the formerly mentioned set.

> - the interpreter itself does not know this currently, nor do I
>   see a way for it to do so except by forcing the user to make
>   this explicit.

Ever read in a string, or grabbed something from @ARGV or %ENV, that
you had to do this to:

    $num =~ oct($num) if $num =~ /^0/;

And if you did, did this "bother" you?

> For example, "open" (at least on Unix) cannot support Unicodeness,
> because the system interfaces do not allow for Unicode to be used -
> you have to encode it, and it is not clear which encoding is the
> right one.

> So open would be an operation that enforces octet semantics, because
> that's what the system interface relies on.

Well, there you have it then, don't you?

Good night.

/* HIC JACENT VERBA DELETA */

>> Thank you.

> Nice to have you around again.

Oh sure, *now* you say that.  Just wait. :-)

Anyway, it's SUMMER, and this is a fluke; I shouldn't even be here.

--tom



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About