develooper Front page | perl.perl5.porters | Postings from July 2014

Re: Encode vs. JSON

Thread Previous | Thread Next
From:
Aristotle Pagaltzis
Date:
July 18, 2014 03:01
Subject:
Re: Encode vs. JSON
Message ID:
20140718030055.GA29580@plasmasturm.org
Hi David,

* David E. Wheeler <david@justatheory.com> [2014-07-17 00:05]:
> I have a script:
>
>     use v5.10;
>     use warnings;
>     use JSON;
>     use Encode qw(encode_utf8 decode_utf8);
>
>     my $json = qq{{"FFONTS":"HOLIDAYBOLDI\xEF\xBF\xBFALIC"}};
>     my $parser = JSON->new->utf8;
>
>     my $data = $parser->decode($json);
>     say encode_utf8 $data->{FFONTS};
>
> On Perl 5.12 and earlier, this dies:
>
>     malformed UTF-8 character in JSON string, at character offset 23 (before "\x{ffff}ALIC"}")
>
> It does not die on 5.14, which I assume is due to the addition of
> Unicode 6 support.

why do you assume that? As far as I can tell, Unicode 6 has no changes
of any kind WRT U+FFFF.

> But oddly, while JSON complains on 5.12 and earlier, Encode does not:
>
>     use v5.10;
>     use warnings;
>     use JSON;
>     use Encode qw(encode_utf8 decode_utf8);
>
>     my $json = qq{{"FFONTS":"HOLIDAYBOLDI\xEF\xBF\xBFALIC"}};
>     $json = decode_utf8 $json, Encode::FB_CROAK;
>
>     my $parser = JSON->new;
>
>     my $data = $parser->decode($json);
>     say encode_utf8 $data->{FFONTS};
>
> This dies with the same error from JSON.pm, but note that the call to
> decode_utf8() worked. I’m left wondering why JSON and Encode seem to
> disagree on the validity of those bytes as UTF-8 in Perl 5.12. Ideas?

Sounds to me like it’s the behaviour of JSON that changes between 5.12
and 5.14 rather than that of Encode?

What I can say is that U+FFFF is a non-character, but EF BF BF is the
correct encoding of that codepoint. Using decode_utf8(...) is short for
decode("utf8", ...), which is completely permissive. As long as it can
decode the octet sequence according to the UTF-8 encoding, it will not
complain. In contrast, if you do decode("UTF-8", ...) then you will get
charset checking too. And *that* *will* reject your attempt to smuggle
a U+FFFF into the string.

So that’s why Encode behaves as it does.

Why does JSON go from rejecting to accepting the string if you go from
5.12 to 5.14? That, I have no idea about. (Or maybe it is goes from one
to the other based on the version of JSON; you haven’t specified whether
you have the same version of it installed in your 5.12 vs 5.14 perls.)

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About