develooper Front page | perl.perl5.porters | Postings from July 2014

Re: Encode vs. JSON

Thread Previous | Thread Next
David E. Wheeler
July 23, 2014 05:38
Re: Encode vs. JSON
Message ID:
On Jul 22, 2014, at 8:55 PM, Karl Williamson <> wrote:

> We have a backwards compatibility problem here.  Corrigendum 9 is controversial, and the wording has not been incorporated into the text of Unicode 7.0 because that hasn't been published yet (the data has, but not the text of the standard).
> Noncharacters are still supposed to be used only for internal purposes.  The genesis of #9 was that ICU and CLDR were having trouble with off-the-shelf editors and version control systems rejecting their code that used them legitimately (though it appears that there are some poor design decisions involving their use).
> I sent a query about things to the Unicode mailing list some months ago, and it stirred up quite a bit of resentment about the #9 decision.  It was made without public input, and during a single meeting, so there wasn't time to consider all the ramifications.

Huh. So much tempest!

> One of my points was that we have a gatekeeper that has kept non-characters out of input.  Code that uses non-characters internally has relied on that gatekeeper to prevent conflicts.  If we change the gatekeeper to allow noncharacters, there is a potential security hole. Even the people on the Unicode list that were the promulgators of the change given by #9 agree that any existing code that excludes noncharacters should not be changed to allow them.

Well, for now, for my purposes, I put this into our code:

    use constant PERL514 => $] >= 5.014;
    # ... later in that same file…
        unless (PERL514) {
            # Replace noncharacters with the UNICODE REPLACEMENT character.
            $json =~ s/\xEF(?:\xBF[\xBF\xBE]|\xB7[\x90-\xAF])/\xEF\xBF\xBD/g;

Which fixes the immediate issue for us on 5.10.1 (Thanks RedHat!) and should allow it to keep working once we get on a more modern Perl. This is because JSON(::XS)? on 5.14 and higher is okay with noncharacters, even if `decode("UTF-8", $json)` isn’t.

As for where the “EF BF BF” is coming from: JavaScript and Flash running in a browser. Cool, right? FWIW, neither Java, JavaScript, nor Postgres complain about this noncharacter. I guess they tend to behave more like `decode_utf8`.

As I’ve solved my immediate problem, I’m fine to let you guys decide whether or not to change UTF8_DISALLOW_ILLEGAL_INTERCHANGE to exclude UTF8_DISALLOW_NONCHAR. Do you want a ticket to track the issue, or is sufficient (I can add a comment there if you’d like, access controls allowing).

Thanks for the detailed reply.



Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About