develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs

Thread Previous | Thread Next
Tom Christiansen
May 19, 2008 17:55
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs
Message ID:
On "Mon, 19 May 2008 22:03:29 +0200.", Marc Lehmann <>
in <> flamed:

> In case you didn't notice, perl has an extremely bad reputation
> for Unicode handling: most users fear Unicode, because it is so
> complicated in perl.


> The reason it is so complicated is because there are so many bugs, and
> it takes insanely long discussions to fix those bugs.

"Insanely long"?  I'll charitably take that as rhetorical hyperbole rather
than some statement of your mental health--and invite correction should the
contrary case apply.

I shouldn't say that *that* at all the common concern of those who "fear"
using Unicode.  Perhaps for XS writers it may be, but my experience has
been that users have other concerns that what you've just stated.  Instead,
one or more of these four problems dominate, listed from highest to lowest,
but all are important.

    (0) Knowing *what* all Perl documentation they should be reading about
        Unicode, doing that reading, and then making the last bit of sense
        out of what they've just read.

        Why?  Because they can't tell what is and is not important, or even
        applicable.  Part of this problem is the density of information as
        presented, part derives from the important information unevenly
        scattered across many documents, and, loth though I am to say this,
        there may also be a language-barrier creating interference between
        reader and writer.

    (1) Understanding I/O Layers and encodings:

        eg:  encoding vs Encode, ::via::, binmode(), -C,
             envariables, triadic open, etc

    (2) The troubles of getting Unicodish action on codepoints
        in the U+0080 .. U+OOFF range.

        eg:  % perl -E 'say chr(0xdf)'
             % perl -E 'say ucfirst chr(0xdf)'
             % perl -E 'use utf8; say ucfirst chr(0xdf)'
             % perl -E 'use encoding "latin1"; say ucfirst chr(0xdf)'

    (3) Difficulties comparing and matching Unicode data that
        hasn't been first laundered, er, normalized, into a
        standard canonical from, generally due to combining
        characters, ordering, and pre-combined characters.

Issues I don't number as lying within the provenance of legit Perl-
related Unicode troubles, but which certainly do occur, include:

    * font troubles
    * troubles with native system support  (LC_* envars)
    * problems finding, learning, and using Unicode-aware
      editors & tools
    * confusing I18N issues with those of G11N

Some fears may come from these, but there's not much we can 
do about almost any of them.

Most frequent of all is that they've simply never been
consciously exposed to Unicode, whether at all or whether in a
sufficiently intimate fashion as to get their fingers dirty.

Once these things we can't do anything about (native system
stuff) are discounted, almost always I find any Unicode phobia is
because they are at least one, and usually all, of these things:

    * monoglot speakers of English alone

    * not usually all that educated in the Humanities
      and/or of limited travel experience

    * overly accustomed to the very impoverished character
      repertoire found in 7-bit ASCII codes, mistakenly
      believing it sufficient for writing even English correctly

Such people therefore feel no need to "bother" learning about
Unicode, whether in Perl or anywhere.  So *their* fears, if any,
might be more closely related to ethnocentrism, xenophobia, or
neophobia than to what you've described or which I've myself

Anybody who fears Unicode because of Perl's internals is clearly
in the minority of an already-minority set.  I mean, come on,
many people fear Perl's externals, and most fear its internals--
and that's not even letting Unicode into the picture yet.

And while there *are* people troubled by UTF8 != "utf-8", I
suspect there to be next to none such outside this list--and 
even within it, still very few.

> Having part of the codebase assuming that the utf-8 flag means
> it is utf-8 encoded and not having it set means
> ANSI/locale/latin1/random garbage, and the perl core (the other
> part of the codebase) assuming this is just an internal flag,
> as originally designed, will kill perl in the long run.

The end of the world is near, eh?

> Regarding the perlunicode manpage, it is basically a 
> helpless case. 


More rhetorical hyperbole, I presume, since if it is incorrectly
worded, the road to helping it is obvious: just send patches.

> For example, it says:

>   In earlier releases the "utf8" pragma was used to declare
>   that operations in the current block or file would be Unicode-
>   aware.  This model was found to be wrong, or at least clumsy:
>   the "Unicodeness" is now carried with the data, instead of
>   being attached to the operations.

> This is completely untrue: in earlier releases, "use utf8/use
> bytes" switched between interpreting the strings as utf-8 vs.
> bytes, and did nothing about Unicode-awareness.

*That*, I think, is more a matter of casuistry than of correctness.

> Unicodeness is *not* carried with the data currently, as the
> manpage wrongly claims, and that is absolutely the correct way.


> Perl currently implements a model where encoding is *not*
> attached to perl scalars,

Right: it's supposed to be attached to the I/O Layer alone.

> and neither is *unicodeness* attached to perl scalars.

And now you've lost me.

If that is universally true, then might you gently explain how 
an SV set to chr(500) always has its UTF8 flag turned on?

    DB<1> p $]
    DB<2> use Devel::Peek
    DB<3> $x = "string"
    DB<4> p Dump ($x)
	SV = PV(0x3c267378) at 0x3c367680
	PV = 0x3c022e10 "string"\0
	CUR = 6
	LEN = 8
    DB<5> $y = chr(500)
    DB<6> p Dump($y)
	SV = PV(0x3c2f0aa0) at 0x3c367ac0
	PV = 0x3c03d180 "\307\264"\0 [UTF8 "\x{1f4}"]
	CUR = 2
	LEN = 4

Armed with that output, it seems to me that if you are correct,
then UTF8 is not "unicodeness", but that if it is, then you are
not correct.  And I am unable to discern which of those two here
obtains: either, both, or neither, nor in what degree.  In both
scenarios, the issue remains unclear, or you do, or I am--and
quite possibly more than one of these may apply.

So let's fix that, shall we?

I politely request that you kindly explain three things:

 * Start by explain just what it is that you are calling unicodeness.

 * Now explain the presence and purpose of the UTF8 flag on an SV.  

 * Finally, please demonstrate, especially in light of my Dump 
   above and the two answers you just gave, how your last-quoted
   statement can in its second half be deemed all of reasonable,
   accurate, and correct.

Thank you.


   "Those who know more than me will correct me if I'm wrong.
    Those who know less than me will correct me if I'm right."

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About