Front page | perl.perl5.porters |
Postings from July 2014
Re: Encode vs. JSON
From: David E. Wheeler
July 18, 2014 05:19
Re: Encode vs. JSON
Message ID: 986D6A88-C9BF-425C-9E11-5EB262429A5D@justatheory.com
On Jul 17, 2014, at 8:00 PM, Aristotle Pagaltzis <firstname.lastname@example.org> wrote:
> Hi David,
Hey Aristotle, many thanks for your reply. Super helpful.
>> It does not die on 5.14, which I assume is due to the addition of
>> Unicode 6 support.
> why do you assume that? As far as I can tell, Unicode 6 has no changes
> of any kind WRT U+FFFF.
It was a guess.
> Sounds to me like it’s the behaviour of JSON that changes between 5.12
> and 5.14 rather than that of Encode?
> What I can say is that U+FFFF is a non-character, but EF BF BF is the
> correct encoding of that codepoint. Using decode_utf8(...) is short for
> decode("utf8", ...), which is completely permissive. As long as it can
> decode the octet sequence according to the UTF-8 encoding, it will not
> complain. In contrast, if you do decode("UTF-8", ...) then you will get
> charset checking too. And *that* *will* reject your attempt to smuggle
> a U+FFFF into the string.
Ah, yes, quite right. I keep forgetting that utf8 is so permissive.
> So that’s why Encode behaves as it does.
So this data came from a Java app, which serialized the string "HOLIDAYBOLDI\xEF\xBF\xBFALIC" into JSON. This tells me that our Java app needs to be a little more careful about what it considers UTF-8, and perhaps replace bogus characters/bytes. But I am unable to get it to choke on \uFFFF at all on Java 6 or 7. This does not throw an exception:
I Googled around a bit, and found this SO answer:
Which suggests that, according to [Corrigendum 9](http://www.unicode.org/versions/corrigendum9.html), reserved non-characters now *are* allowed to appear in a UTF-8 string. Which makes me think I will never be able to get the Java server to clean up its act. Should Perl, Encode, and JSON relax things a bit with regard to these characters, then?
> Why does JSON go from rejecting to accepting the string if you go from
> 5.12 to 5.14? That, I have no idea about. (Or maybe it is goes from one
> to the other based on the version of JSON; you haven’t specified whether
> you have the same version of it installed in your 5.12 vs 5.14 perls.)
I used JSON 2.90 and JSON::XS 3.01 in all my tests.