develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
Marc Lehmann
March 30, 2007 11:00
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
On Fri, Mar 30, 2007 at 01:31:22PM +0100, Nicholas Clark <> wrote:
> > 
> > So fix it. It is easy to do, and I documented it years ago (during 5.6).
> "this one" that I was confident is a bug is the change of meaning on SvPV()
> And in turn what I'm not confident about is the fix.

Sorry. I can understand that it might be difficult as perl itself likely
relies on the current meaning of SvPV.

However, some of the obvious fixes would be to change ExtUtils/typemap so
that stuff such as "const char *" does no longer boil down to random bytes.

SV *compress (const char *data);

the right thing here is to use SvPVbyte, at leats in the majority of
cases.  The reason is that existing users either have to clal downgrade
explicitly themselves or suffer from random problems.


> > Besides, without any doubt, the code that relies on psuedo-random
> > behaviour is certainkly in the minority. The amount of code in the wild
> > that relies on "C" having 5.5 semantics is much larger. I doubt _anybody_
> > except me (or at leats not very many people) understands that he has to
> > downgrade scalars before passing them into unpack to decode structures.
> I don't know enough about "C" in pack offhand to know what the right thing to
> do is.

The right thing to do is the follow the documentation and existing code.

Could you tell me why almost every other 5.6 bug was fixed in 5.8, but
gratitious breakage of large parts of CPAN are accepted with this change?
Whats the rationale behind keeping this 5.6 bug, while fixing the rest?

For example, take a network protocol that sends packets prefixed with a
2-byte length header, a type, and data. There is currently no unpack format
available to do this, as:

   unpack "Cn", $data

Gives different results depending in the history of the string in $data.

If there were a pack type that gave me 5.005 behaviour of returning a
single character, I could use it:

   unpack "Wn", $data;

but there simply isn't. Besides, all code does use "C", so the right thing is
to move the new pack type to a different modifier.

(In my personal opinion, of course, pack should not expose internal
encoding at all. Use Devel::Peek or so, or one of the functions in the
utf8:: module.  The first one who shows me code that would need the
peculiar nondeterministic behaviour of unpack "C" gets a prize).

> I don't like anything Perl space that lets the abstraction leak, and "C" is
> one of them.

So why not fix it? Nobody made such a fuss when they fixed the remaining bugs
from 5.6. For example, PApp, one of my older modules using unicode, is full
of code such as this:

   Convert::Scalar::utf8_on($_); # DEVEL7952 bug workaround #d# #FIXME#

For various values of DEVEL and workaround. Some of that code broke in 5.8
because 5.8 did the right thing (not 5.8.0, mind you, as this fixing went
on during 5.8.x).

*Nobody* argued my case of "it breaks existing code", not even me, because
its clearly a bugfix that lets perl code just work, both old code and new
code (which is the beauty of the perl unicode model).

> The third thing that you didn't mention which I consider distinct from the two
> behaviours you did is that the encoding effects how regexps match, and
> lc/uc/lcfirst/ucfirst.

The difference is that I haven't seen code break so badly because of
that. I see lots of code break because of the incompatible change in the
meaning of "C", though.

(In fact, I haven't even seen a difference, apart from when use locale is
active, which is a rare case).

The other difference to that case is that those bugs are getting fixed,
while in the case of "C", people just ignore the problem, which increases
over time, saying they don't know why to fix this bug.

And as I said, there is no pack-type that gives me the old meaning of
"C" that every structure-decoding program relies on. Thats gratitious
undocumented breakage. (It really is undocumented because all of the perl
documentation tells me that the internal encoding doesn't surface, and the
small hint in the pack description for "C" seems to reinforce this as it
tells me it works "even in the presence of Unicode"!).

In any case, please could you answer to me why you accept obvious breakage
of old code in this case? I really wanna know.

The only argument in favour I have heard os far is that the camelbook
documents it in some obscure way. But that cannot be a reason to keep a
bug.  If the camelbook describes buggy behaviour, it needs a fix. It is
insane to force every existing perl program that uses that feature to
be changed in a way that contradicts the rest of the documentation, is
unintuitive and generaly useless (again, show me a useful application for
unpack "C" with 5.8 semantics).

                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __
      --==---/ / _ \/ // /\ \/ /
      -=====/_/_//_/\_,_/ /_/\_\      XX11-RIPE

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About