develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
Nicholas Clark
March 30, 2007 11:46
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
On Fri, Mar 30, 2007 at 08:00:36PM +0200, Marc Lehmann wrote:
> On Fri, Mar 30, 2007 at 01:31:22PM +0100, Nicholas Clark <> wrote:

> However, some of the obvious fixes would be to change ExtUtils/typemap so
> that stuff such as "const char *" does no longer boil down to random bytes.
> Example:
> SV *compress (const char *data);
> the right thing here is to use SvPVbyte, at leats in the majority of
> cases.  The reason is that existing users either have to clal downgrade
> explicitly themselves or suffer from random problems.

This seems a sane idea. However, I'm not going to change it for 5.8.9

5.10 is a different matter, but also not my call.

> Could you tell me why almost every other 5.6 bug was fixed in 5.8, but
> gratitious breakage of large parts of CPAN are accepted with this change?
> Whats the rationale behind keeping this 5.6 bug, while fixing the rest?

No, I can't.
5.8.0 and 5.8.1 were not my releases, *and* I wasn't aware that 'C' was a
problem at that time.

I *think* that the reason may have been because "it is documented in
Programming Perl" that it behaves the 5.6.0 way.


I went looking, and the closest I can find to an assertion about how it works

* the pack/unpack letters "c" and "C" do /not/ change, since they're often
  used for byte-orientated formats. (Again, think "char" in the C language.)
  However, there is a new "U" specifier that will convert between UTF-8
  characters an integers:

    pack("U*", 1, 20 ,300, 4000) eq v1.20.300.4000

* The chr and ord functions work on characters

    chr(1).chr(20).chr(300).chr(4000) eq v1.20.3000.4000

  In other words, chr and ord are like pack("U") and unpack("U"), not like
  pack("C") and unpack("C"). In fact, the latter two are how you now emulate
  byte-orientated chr and ord if you're too lazy to use bytes.

[3rd edition, page 408]

> > I don't like anything Perl space that lets the abstraction leak, and "C" is
> > one of them.
> So why not fix it? Nobody made such a fuss when they fixed the remaining bugs
> from 5.6. For example, PApp, one of my older modules using unicode, is full

I'm not going to change anything this late in 5.8.x.
Whether 5.10 changes is not something I have the final say on.

> And as I said, there is no pack-type that gives me the old meaning of
> "C" that every structure-decoding program relies on. Thats gratitious
> undocumented breakage. (It really is undocumented because all of the perl
> documentation tells me that the internal encoding doesn't surface, and the
> small hint in the pack description for "C" seems to reinforce this as it
> tells me it works "even in the presence of Unicode"!).
> In any case, please could you answer to me why you accept obvious breakage
> of old code in this case? I really wanna know.

> The only argument in favour I have heard os far is that the camelbook
> documents it in some obscure way. But that cannot be a reason to keep a
> bug.  If the camelbook describes buggy behaviour, it needs a fix. It is
> insane to force every existing perl program that uses that feature to
> be changed in a way that contradicts the rest of the documentation, is
> unintuitive and generaly useless (again, show me a useful application for
> unpack "C" with 5.8 semantics).

I agree with the obscure now.

Reading the wording of the Camel book carefully, this behaviour

$ perl5.00503 -le 'print unpack "c", chr (256+78)' 
$ perl5.00503 -le 'print unpack "C", chr (256+78)'

"unchanged" actually means to me that it would produce the same output.

The only thing that seems to define the current 5.6 behaviour is the
comparison of unpack("C") with ord under use bytes in the paragraph on chr
and ord.

Nicholas Clark

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About