develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs

From:
Marc Lehmann
Date:
May 19, 2008 21:56
Subject:
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs
Message ID:
20080520045610.GB16896@schmorp.de
On Mon, May 19, 2008 at 06:54:39PM -0600, Tom Christiansen <tchrist@perl.com> wrote:
[Thanks for summarising all the possible fears :]

> > part of the codebase) assuming this is just an internal flag,
> > as originally designed, will kill perl in the long run.
> 
> The end of the world is near, eh?

I meet a lot of people who would like to use unicode in perl, but fail to do
so because they run intot he problems mentioned and claim it should be much
easier (yes, it should, and certainly less random).

But almost all of the issues they run into, *iff* they really want to use
unicode and are open to learning a bit about it first originate in the utf-8
flag that they cannot see in their perl sources yet that affects so many
things.

Basically, if you don't have it set right, stuff breaks everywhere, wether
its perl core functions or xs modules, but the brakage is not universal
(if it were, it would simply be a different model, the problem is the
inconsistency).

> > Regarding the perlunicode manpage, it is basically a 
> > helpless case. 
> 
> "Helpless"?
> 
> More rhetorical hyperbole, I presume, since if it is incorrectly
> worded, the road to helping it is obvious: just send patches.

I certainly won't send patches if people tlel me before I submit them that
the current manpage is correct. I can waste my time in better ways.

I would probably submit patches if the process to do so would be easier, and
the first step would be an agreement of the existing perl5-porters on how
strings are to be interpreted.

Note I am not asking for agreement on how it *should* be done or what
would be better, but an agreement on which semantics will be acceptable
and which are not.

> > This is completely untrue: in earlier releases, "use utf8/use
> > bytes" switched between interpreting the strings as utf-8 vs.
> > bytes, and did nothing about Unicode-awareness.
> 
> *That*, I think, is more a matter of casuistry than of correctness.

Maybe, but I sitll expect *one* manpage to be consistent to itself - if it
defines operation as one thing and contrasts it with some other meaning of
operation then they better should be the same thing - if you compare, then
apples to apples and oranges to oranges.

> > Perl currently implements a model where encoding is *not*
> > attached to perl scalars,
> 
> Right: it's supposed to be attached to the I/O Layer alone.

More correctly to the interfaces communicating with the outside world,
which includes other things as well (for example filenames, or XS
modules).

But the basic theme is indeed "I/O" here - one way to treat characters is
to encode/decode them when they leave/enter perl and use unicode semantics
within.

> > and neither is *unicodeness* attached to perl scalars.
> 
> And now you've lost me.
> 
> If that is universally true, then might you gently explain how 
> an SV set to chr(500) always has its UTF8 flag turned on?

Easy, it is the only way for perl to internally represent characters with a
value of 500.

If that 500 is for example the second character in v5.500, then this might
simply be the perl 5.500 version string (or part of an ip address in the
game "uplink" stored in a compact string form, e.g. v478.321.571.277).

No unicode anywhere in sight.

Note also that I can have the unicode character 32 stored with or without the
UTF8 flag, which doesn't change the fact that it is still the unicode
character 32.

Summary: the UTF8 flag says very little about wether the string contains
unicodce or not. However, when I do this:

   my $s = v5.500;
   $s =~ /ü/;

Then one could expect that $s indeed contains a unicode string, because =~
forces the interpretation of the string to "characters" and in this very case
to "unicode".

Of course, this gets you in trouble:

   my $s = chr 200; # not unicode, but native 8-bit(??)
   substr $s, 0, 0, chr 500;
   $s =~ /ü/; # now interpreted as unicode

This is the insane part - I wouldn't expect even an expert perl programmer
to predict how $s gets interpreted here.

So, my "end of the world" in a more verbose way and less drastic, would be
"as long as perl has this totally unpredictable rules on character
interpretation it will not gain wide acceptance for unicode usage".

> Armed with that output, it seems to me that if you are correct,
> then UTF8 is not "unicodeness", but that if it is, then you are
> not correct.

I think the examples above made it clear that the UTF8 flag is not
"unicodeness".

A different example you might like even more :) is this: mark some filehandle
as utf-8 encoded, then print downgraded and upgraded data to it. In both
cases it will be interpreted as unicode, so the utf-8 flag again is no
indicator for "unicodeness".

> I politely request that you kindly explain three things:
> 
>  * Start by explain just what it is that you are calling unicodeness.

Not sure what you mean - unicodeness is the state of being unicode.

More explicitly, a perl string contains unicode characters when it contains
unicode characters - the interpreter itself does not know this currently, nor
do I see a way for it to do so except by forcing the user to make this
explicit.

As for operations, "unicodeness" would be how certain operations would
interpret character data.

For example, "open" (at least on unix) cannot support unicodeness, because
the system interfaces do not allow for unicode to be used - you have to
encode it, and it is not clear which encoding is the right one.

So open would be an operation that enforces octet semantics, because thats
what the system interface relies on.

A clearer example would be crypt: crypt has to force octet semantics
because the C interface is only defined in terms of octets (i.e. the
salt).

regex matching would also by default (maybe) apply unicode semantics, but it
would be somewhat important to be able to apply locale interpretation to it.

>  * Now explain the presence and purpose of the UTF8 flag on an SV.  

The UTF8 flag on an SV specifies wether the codepoints stored in the
string are stored in octet form (only possible if all are <256 of course)
or in perl's variant of utf-8 (which is very similar, but not the same, as
the utf-8 defined by the unicode consortium for example).

To put it differently, the UTF8 flag only states how the character values are
encoded internally. It does not say anything about wether the string contains
unicode data or not.

(This is what is mostly implemented right now in perl, regexes are the
notable exception).

>  * Finally, please demonstrate, especially in light of my Dump 
>    above and the two answers you just gave, how your last-quoted
>    statement can in its second half be deemed all of reasonable,
>    accurate, and correct.

I think I demonstrated that in my examples already. It is me, the
programmer, who defines what string contains unicode or not.

And in a lot of important cases, this unicodeness of a string only has a very
superficial correspondance to the UTF8 flag - they are not totally
independent. For example if all my unicode data happens to consists of
latin1 characters that I store as such, then perl *might*, using wondrous
optimisations I don't care about as long as it is fast, never hit a string
with the UTF8 flag set.

On the other hand, if I store my unicode data directly as codepoints, and
some of those happen to be >255, then the correspdoning scalar will have the
UTF8 flag set. The converse is not true, however, not every scalar having the
UTF8-flag set contains unicode charcters.

And lastly, if I store my unicode data in utf-8, then it would still be
unicode data. It would be reasonable to call a string containing utf-8 data
a unicode string (encoded, however), in which case, again, the UTF8 flag
could be set or not (usually not), which has nothing to say about the
unicodeness of the data stored in the string.

> Thank you.

Nice to have you around again.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      pcg@goof.com
      -=====/_/_//_/\_,_/ /_/\_\



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About