develooper Front page | perl.perl5.porters | Postings from July 2014

Re: Encode vs. JSON

Thread Previous | Thread Next
From:
David E. Wheeler
Date:
July 18, 2014 05:19
Subject:
Re: Encode vs. JSON
Message ID:
986D6A88-C9BF-425C-9E11-5EB262429A5D@justatheory.com
On Jul 17, 2014, at 8:00 PM, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote:

> Hi David,

Hey Aristotle, many thanks for your reply. Super helpful.

>> It does not die on 5.14, which I assume is due to the addition of
>> Unicode 6 support.
> 
> why do you assume that? As far as I can tell, Unicode 6 has no changes
> of any kind WRT U+FFFF.

It was a guess.

> Sounds to me like it’s the behaviour of JSON that changes between 5.12
> and 5.14 rather than that of Encode?

Yes.

> What I can say is that U+FFFF is a non-character, but EF BF BF is the
> correct encoding of that codepoint. Using decode_utf8(...) is short for
> decode("utf8", ...), which is completely permissive. As long as it can
> decode the octet sequence according to the UTF-8 encoding, it will not
> complain. In contrast, if you do decode("UTF-8", ...) then you will get
> charset checking too. And *that* *will* reject your attempt to smuggle
> a U+FFFF into the string.

Ah, yes, quite right. I keep forgetting that utf8 is so permissive.

> So that’s why Encode behaves as it does.

So this data came from a Java app, which serialized the string "HOLIDAYBOLDI\xEF\xBF\xBFALIC" into JSON. This tells me that our Java app needs to be a little more careful about what it considers UTF-8, and perhaps replace bogus characters/bytes. But I am unable to get it to choke on \uFFFF at all on Java 6 or 7. This does not throw an exception:

    "\uFFFF".getBytes("UTF-8");

I Googled around a bit, and found this SO answer:

  http://stackoverflow.com/a/16619933/79202

Which suggests that, according to [Corrigendum 9](http://www.unicode.org/versions/corrigendum9.html), reserved non-characters now *are* allowed to appear in a UTF-8 string. Which makes me think I will never be able to get the Java server to clean up its act. Should Perl, Encode, and JSON relax things a bit with regard to these characters, then?

> Why does JSON go from rejecting to accepting the string if you go from
> 5.12 to 5.14? That, I have no idea about. (Or maybe it is goes from one
> to the other based on the version of JSON; you haven’t specified whether
> you have the same version of it installed in your 5.12 vs 5.14 perls.)

I used JSON 2.90 and JSON::XS 3.01 in all my tests.

Best,

David



Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About