develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
Marc Lehmann
March 30, 2007 15:41
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
On Fri, Mar 30, 2007 at 07:46:41PM +0100, Nicholas Clark <> wrote:
> This seems a sane idea. However, I'm not going to change it for 5.8.9


> 5.10 is a different matter, but also not my call.


I know all that...

> > Could you tell me why almost every other 5.6 bug was fixed in 5.8, but
> > gratitious breakage of large parts of CPAN are accepted with this change?
> > Whats the rationale behind keeping this 5.6 bug, while fixing the rest?
> No, I can't.
> 5.8.0 and 5.8.1 were not my releases, *and* I wasn't aware that 'C' was a
> problem at that time.

Yes, you can. You control 5.8, and you said it won't gonna happen. So either
you have a reason and can tell me of it, or not.

The reason I wanna know is because I want to know what to tell
people. Either it is "your code is broken, unpack "C" without downgrade
is a bug in your code" or "it is a bug in perl, you can work around by
enabling ->shrink for the time being".

> I *think* that the reason may have been because "it is documented in
> Programming Perl" that it behaves the 5.6.0 way.

I would argue it doesn't behave the 5.6 way, though: 5.6 had a completely
broken unicode implementation, and lots of bugs. In 5.6 it would give me one
"character", because 5.6 often exposed the utf-8 encoding explicitly, so one
character in the 5.6 model often was a single internal byte.

Also, I still think it is a mistake to break working code without giving
an alternative(!) for unpack that isn't "you have to downgrade and keep
your fingers crossed".

> I went looking, and the closest I can find to an assertion about how it works
> is:
> * the pack/unpack letters "c" and "C" do /not/ change, since they're often
>   used for byte-orientated formats. (Again, think "char" in the C language.)
>   However, there is a new "U" specifier that will convert between UTF-8
>   characters an integers:
>     pack("U*", 1, 20 ,300, 4000) eq v1.20.300.4000

Exactly. But "C" somehow works on UTF-8, while it shouldn't. It should
work on characters, as documented (just like in C, char array[]; array[i]
is one character, regardless of how many bits a character in C has, or how
it is encoded).

> * The chr and ord functions work on characters
>     chr(1).chr(20).chr(300).chr(4000) eq v1.20.3000.4000
>   In other words, chr and ord are like pack("U") and unpack("U"), not like
>   pack("C") and unpack("C"). In fact, the latter two are how you now emulate
>   byte-orientated chr and ord if you're too lazy to use bytes.

So due to that documentation insanity it is now suggested that all code that
used "C" beforee muts use "U" now to get the same effect as in earlier perl

Then why was "use feature" introduced in the first place? Just document
existing programs to be broken. I am quite convinced (whatever that means
to you :) that that would result in less and less silent breakage then
renimong "C" to "U".

Besides, perl 5.8 does not follow that description:

   perl -e '$x = "\xc3\xbc"; die unpack "U*", $x'

This gives me 195188, two characters, although it is a single UTF-8
character, so why does it wrongly give me two? $x certainly is utf-8-encoded
(try Encode::encode_utf8 chr 252, it results in the above string).

Whoever wrote that part, simply said, was completely confused about unicode.
Thats fine, Sarathy had to hammer it into me too, and then made a mistake
himself after he did so. And it took me years to understand how it should be.
It is hard to do from an implementors standpoint because you are so near the
actual code.

But that doesn't mean it is right. Fact is, the above documentation is
simply wrong, either with regards to how it should be, and in regards to how
it is implemented.

> [3rd edition, page 408]

(Thanks for digging it out, btw, I haven't seen that yet).

> > So why not fix it? Nobody made such a fuss when they fixed the remaining bugs
> > from 5.6. For example, PApp, one of my older modules using unicode, is full
> I'm not going to change anything this late in 5.8.x.
> Whether 5.10 changes is not something I have the final say on.

Ok, so I will tell people to replace "C" by "U" in theor code then.

Thanks! (And go on with your good work, btw., it seems that wasn't quite
clear to some people, so again: you are doing tremendously good work! :).

> "unchanged" actually means to me that it would produce the same output.
> The only thing that seems to define the current 5.6 behaviour is the
> comparison of unpack("C") with ord under use bytes in the paragraph on chr
> and ord.

Right, while the documentation on unpack "U" disagrees with it, as it talks
about UTF-8. The documentation clearly does not apply to current perls, it
clearly applies to the 5.005_5x model where perl ahd no UTF-X flag.

                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __
      --==---/ / _ \/ // /\ \/ /
      -=====/_/_//_/\_,_/ /_/\_\      XX11-RIPE

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About