Front page | perl.perl5.porters |
Postings from May 2008
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs
From:
Tom Christiansen
Date:
May 19, 2008 23:08
Subject:
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs
Message ID:
13963.1211263659@chthon
In his epistle of "Tue, 20 May 2008 06:56:10 +0200."
<20080520045610.GB16896@schmorp.de>
Marc Lehmann <schmorp@schmorp.de> graciously explained:
> On Mon, May 19, 2008 at 06:54:39PM -0600,
> Tom Christiansen <tchrist@perl.com> wrote:
> [Thanks for summarising all the possible fears :]
Oh, there might be more, you know. Haven't thought much on it.
Those were just the ones that came to mind, both the relevant
and the ir-.
>>> part of the codebase) assuming this is just an internal flag,
>>> as originally designed, will kill perl in the long run.
>> The end of the world is near, eh?
> I meet a lot of people who would like to use Unicode in perl, but fail
> to do so because they run into the problems mentioned and claim it
> should be much easier (yes, it should, and certainly less random).
> But almost all of the issues they run into, *iff* they really want to
> use Unicode and are open to learning a bit about it first originate in
> the utf-8 flag that they cannot see in their perl sources yet that
> affects so many things.
I see you've been talking with Phil Harvey again. :-)
> Basically, if you don't have it set right, stuff breaks everywhere,
> whether it's perl core functions or XS modules, but the brakage [SIC:
> probably meant to read "breakage" unless one's foot hits the brakes
> instead of the accelerator --tchrist] is not universal (if it were, it
> would simply be a different model; the problem is the inconsistency).
But is it a foolish one for little minds to worry about, or a great
one for bigger minds to mull over?
I believe that Phil, for example, due perhaps to such things as you
allude to, tries quite hard to be Unicode-agnostic. By that I means he
insistently uses byte-interfaces only, even though he sometimes has to
encode or decode byte-data into Unicodepoints.
That said, I always feel there's something *WRONG* if I find
myself having to resort to Encode's encode or decode functions.
Can't quite say why.
>>> Regarding the perlunicode manpage, it is basically a helpless case.
>> "Helpless"?
>> More rhetorical hyperbole, I presume, since if it is incorrectly
>> worded, the road to helping it is obvious: just send patches.
> I certainly won't send patches if people tell me before I submit
> them that the current manpage is correct. I can waste my time in
> better ways.
Wise lesson, that. However, it never stopped me from doing so.
It's like how all change to the world comes from unreasonable people.
> I would probably submit patches if the process to do so would be
> easier, and the first step would be an agreement of the existing perl5-
> porters on how strings are to be interpreted.
That might require further instruction so that we can all be on the
same, um, code page.
> Note I am not asking for agreement on how it *should* be done or what
> would be better, but an agreement on which semantics will be acceptable
> and which are not.
>>> This is completely untrue: in earlier releases, "use utf8/use
>>> bytes" switched between interpreting the strings as utf-8 vs.
>>> bytes, and did nothing about Unicode-awareness.
>> *That*, I think, is more a matter of casuistry than of correctness.
> Maybe, but I still expect *one* manpage to be consistent to itself - if it
> defines operation as one thing and contrasts it with some other meaning of
> operation then they better should be the same thing - if you compare, then
> apples to apples and oranges to oranges.
First, those are hardly category errors. After all, would you not
agree that:
* Both are fruit.
* Both are juicy.
* Both are often served juiced with breakfast.
* Both start green and then usually fall somewhere into red-orange
area of the visible spectrum.
* Both are usually of a similar size.
* Both are of topologically equivalent shape
So for more punch, you might sometimes consider trying the alternate
aphorism of comparing apples with avarice, or oranges with oratory.
Just an idea. :-)
>>> Perl currently implements a model where encoding is *not*
>>> attached to perl scalars,
Second, I am in need of deeper understanding, or more sleep,
to see how my statement regarding casuistry does not apply.
>> Right: it's supposed to be attached to the I/O Layer alone.
> More correctly to the interfaces communicating with the outside world,
> which includes other things as well (for example filenames, or XS
> modules).
Well...
At one level, nearly all meaningful communication "with the outside
world" falls within the category of being I/O, with signals and exit
status being the most common exceptions. And timing attacks don't
count. :-)
But I still think that you are asking a lot if you want to make the
claim that filenames as used to access the system's underlying files
VIA ITS OWN INTERFACES are data rather than metadata. And I don't think
that filesystem metadata is reliably treated as anything but bytes, at
least on systems with which I am conversant.
Sure, their contents are certainly data, but even that has its limits.
The restrictions under the BUGS section of the perl(1) manpage still
apply: you are for the most part at your system's mercy. If it provides
byte-access seeks only, not variable-width utf-8 encoded positions, you
can't do much about that. Well, not much that I'd care to do, at least.
> But the basic theme is indeed "I/O" here - one way to treat characters
> is to encode/decode them when they leave/enter perl and use Unicode
> semantics within.
That sounds sane.
>>> and neither is *Unicodeness* attached to perl scalars.
>> And now you've lost me.
>> If that is universally true, then might you gently explain how
>> an SV set to chr(500) always has its UTF8 flag turned on?
> Easy, it is the only way for perl to internally represent characters
> with a value of 500.
Of course.
> If that 500 is for example the second character in v5.500, then this
> might simply be the perl 5.500 version string (or part of an ip
> address in the game "uplink" stored in a compact string form, e.g.
> v478.321.571.277).
I'm a little bit queasy about v-strings, thank you very much.
> No Unicode anywhere in sight.
Did I say there was?
> Note also that I can have the Unicode character 32 stored with or
> without the UTF8 flag, which doesn't change the fact that it is
> still the Unicode character 32.
Oh, now that I'm not sure I agree with. But I fear we may be back
to casuistry again.
> This is the insane part - I wouldn't expect even an expert perl
> programmer to predict how $s gets interpreted here.
No, neither would I.
> So, my "end of the world" in a more verbose way and less drastic,
> would be "as long as perl has this totally unpredictable rules on
> character interpretation it will not gain wide acceptance for
> Unicode usage".
>> Armed with that output, it seems to me that if you are correct,
>> then UTF8 is not "Unicodeness", but that if it is, then you are
>> not correct.
> I think the examples above made it clear that the UTF8
> flag is not "Unicodeness".
I think I'd better sign off. Perhaps sleep will make your statement
obviously true to me. It isn't now.
>> * Start by explain just what it is that you are calling
>> Unicodeness.
> Not sure what you mean - Unicodeness is the state of being Unicode.
That's either a trivial tautology of no significance, or something
deeper than I can now fathom.
> More explicitly, a perl string contains Unicode characters when it
> contains Unicode characters
Now *THAT* belongs in the formerly mentioned set.
> - the interpreter itself does not know this currently, nor do I
> see a way for it to do so except by forcing the user to make
> this explicit.
Ever read in a string, or grabbed something from @ARGV or %ENV, that
you had to do this to:
$num =~ oct($num) if $num =~ /^0/;
And if you did, did this "bother" you?
> For example, "open" (at least on Unix) cannot support Unicodeness,
> because the system interfaces do not allow for Unicode to be used -
> you have to encode it, and it is not clear which encoding is the
> right one.
> So open would be an operation that enforces octet semantics, because
> that's what the system interface relies on.
Well, there you have it then, don't you?
Good night.
/* HIC JACENT VERBA DELETA */
>> Thank you.
> Nice to have you around again.
Oh sure, *now* you say that. Just wait. :-)
Anyway, it's SUMMER, and this is a fluke; I shouldn't even be here.
--tom