develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
From:
Marc Lehmann
Date:
March 30, 2007 18:00
Subject:
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
20070331010024.GO18872@schmorp.de
Ok, last mail, because this is a different topic :)

On Sat, Mar 31, 2007 at 01:08:21AM +0200, Juerd Waalboer <juerd@convolution.nl> wrote:
> Marc Lehmann skribis 2007-03-31  0:25 (+0200):
> > If you send a compressed string over the network using JSON and decompress
> > it, you need to know that. 
> 
> Does JSON compress arbitrary data?

no.

> If so, then the user must do the decoding and encoding,
   
No, compression is something completely orthogonal from encoding. Neither
forces me to do the other.
   
> because arbitrary data only exists in byte form

Thats eems completely wrong to me.

> Once you dictate any specific encoding, it's no longer arbitrary.

JSON dictates unicode for the JSON text, and strongly hints at the use of
UTF-8 for interchange purposes.

> On the other hand, if JSON does text data only,
   
No, it does support binary data just as well. It is used a lot, too.

It works just like perl without the bugs: You have a string type that can
store bytes. It is up to the user to interpret them as she wants.

> it can just use any UTF encoding on both sides, and document it like
> that.

It is a bit complicated, but you can safely assume that 99% of all JSON
is UTF-8 encoded. In fact, you can recode all JSON documents into ASCII,
too. JSON::XS offers that, and JSON::XS by default encodes to/decodes
from UTF-8, but allows the user to decode/encode himself. JSON text is
composed of unicode characters, and in Perl some JSON modules store them
as a simple Perl string.

All that is not well-supported by most JSON modules, though, for example
JSON::XS is the only module for perl that correctly decodes escaped
surrogate pairs.

> Unless both sides are exactly the same platform (e.g. both Perl), you
> need to establish a protocol for sending data anyway. And that protocol
> should also describe encoding. If sender and receiver don't agree, you
> have a problem.

No, it doesn't have anything to do with the platform. Even when both sides
use Perl I need to decide on a common encoding. Thats strictly outside the
JSON definition, though.

> > I am really frustrated at that. It makes perl as a whole rather
> > questionable for unicode use, as you constantly have to think about
> > the internals.  And yes, that simply shouldn't be the case.
> 
> I maintain that it isn't the case, for almost any programming job,
> unless you're indeed doing things with internals.

Well, the JSON::XS module certainly does things with the internals, it
has to flag some strings as UTF-X, and in fact flags all strings that
way unless you enable the shrink option, which is documented to try to
shrink the memory used in various ways (one way is to try to downgrade the
scalar).

Certainly, the user who reported the bug also didn't look at the
internals.  Compress::Zlib called unpack "CCCV" or somesuch, though, which
unfortunately treats V very different from C, by looking at the internals
with "C", and not doing that and treating the string as an octte string
with "V".

The user suggested that JSON::XS corrupts binary data because it happens to
be returned upgraded unless you set the shrink option.

However, Perl does not expose the internals elsewhere, the upgraded
version is semantically equivalent to the downgraded one unless you use
an XS module using SvPV directly or indirectly (considered a bug in Perl
when I understood nick correctly), or when using unpack "C", as that has
a different meaning in perl 5.6 than in perl 5.005, and has confusing
documentation.

The right thing for Compress::Zlib is not to use unpack "CCCV" but unpack
"UUUV", which seems completely weird to me, as no unicode was ever
involved *on the perl level*.

-- 
                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __      pcg@goof.com
      --==---/ / _ \/ // /\ \/ /      http://schmorp.de/
      -=====/_/_//_/\_,_/ /_/\_\      XX11-RIPE

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About