develooper Front page | perl.perl5.porters | Postings from May 2008

on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs

Thread Previous | Thread Next
Marc Lehmann
May 19, 2008 13:03
on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs
Message ID:
On Mon, May 19, 2008 at 11:45:49AM -0700, Jan Dubois <> wrote:
> point out problems in the actual implementation clueless.  Are you
> just trolling?

Also, from past disucssion, you should know that I know what I am talking
about when talking about unicode in perl.

I am trying to point out real issues and educate people on what is the right
way to tackle problems.

Note that you completely ignored the *real* issues I brought up.

Asking me wether I troll is so endlessly stupid and insulting. What are *you*
doing to fix the unicode problems in perl? You but in with totally idiotic
plans based on totally wrong assumptions of the perl core string handling.

Go and do something useful instead, even commenting on the issues I bring
up would be more useful than showing off your lack of knowledge regarding
perl internals (and the language).

This is of course symptomatic for perl5-porters regarding unicode handling.
Note how difficult it was for me to get a simple bugfix w.r.t. unpack into
the core (in the meantime, unpack "H*" has also been fixed - very nice).

It took me ages to explain why its a bug to those people who simply lack
the experience regarding string handling in perl (w.r.t. to wide chars).

I simply don't have the stamina to explain it again and again. Just research
a bit :(

If it is so extreemly hard to get even simple bugfixes into perl, how hard is
it to get more complciyted fixed (such as the Win32 module?).

I think this is a very bad attitude.

In case you didn't notice, perl has an extremely bad reputation for unicode
handling: most users fear unicode, because it is so complicated in perl.

The reason it is so complicated is because there are so many bugs, and it
takes insanely long discussions to fix those bugs.

If you are in disagreement with me (and also sarathy, which, as I found
out, has exactly the same model in mind as me, or actually vice versa, as he
is the principal architect afaics), then perl5-porters should, as quickly as
possible, find out how they want to implement unicode.

Having part of the codebase assuming that the utf-8 flag means it is utf-8
encoded and not having it set means ANSI/locale/latin1/random garbage, and
the perl core (the other part of the codebase) assuming this is just an
internal flag, as originally designed, will kill perl in the long run.

There is the "correct" model, where encoding is attached to operations
(because in most cases it already is, and perl cannot change this, despite
what garbage perlunicodeintro claims), and the utf-8 flag is only used to
change the internal interpretation of the codepoint encoding.

And there is the "wrong" model, where perl silently upgrades data at
undocumented points and also corrupts your string data while at it
(because a "ΓΌ" character might suddenly become a "ΡΏ") and the user has
to track these undocumented encoding changes. I call this quote openly
"wrong" because it is insanely complicated.

Currently most of the perl core implements the "correct" model.

Regarding the perlunicode manpage, it is basically a helpless case. For
example, it says:

   In earlier releases the "utf8" pragma was used to declare that
   operations in the current block or file would be Unicode-aware.  This
   model was found to be wrong, or at least clumsy: the "Unicodeness" is
   now carried with the data, instead of being attached to the operations.

This is completely untrue: in earlier releases, "use utf8/use bytes" switched
between interpreting the strings as utf-8 vs. bytes, and did nothing about

Unicodeness is *not* carried with the data currently, as the manpage
wrongly claims, and that is absolutely the correct way.

Encoding *is* a question of operations. Of course, not all operations are
equal: open on unix for example enforces interpretation of the string as
locale-dependent, regardless of the data is "unicode" or not: the encoding is
tied to the operation, inherently. The perlunicode manpage is wrong.

I work with users daily, and I lecture people about unicode in perl a lot.
And having *bad* documentation that clashes with the *implementation* is bad.

Perl currently implements a model where encoding is *not* attached to perl
scalars, and neitehr is *unicodeness* attached to perl scalars.

The fatc that some people and some manpages claim otherwise is the source of
the confusion.

Now, even implementing the "wrong" model, where the encoding of a string
changes in undocumented ways during the lifetime of a program would be an
advantage, if it was done fully.

But it isn't: neither the correct now the wrong model are implemented, the
wrong model isn't, because it is basically unimplementable, and the correct
model isn't because nobody cares enough, and people actively disagree with
it. Leading to broken XS modules and worse.

This is the problem with perl and unicode: it is buggy no matter how you
put it, because some parts use different models than others.

                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_    
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /
      -=====/_/_//_/\_,_/ /_/\_\

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About