Front page | perl.perl5.porters |
Postings from July 2014
Re: Encode vs. JSON
From: Karl Williamson
July 23, 2014 03:55
Re: Encode vs. JSON
Message ID: 53CF321B.firstname.lastname@example.org
On 07/21/2014 12:22 PM, David E. Wheeler wrote:
> On Jul 19, 2014, at 9:58 PM, David E. Wheeler <email@example.com> wrote:
>>> there is a ticket about that:
>> Ah, interesting. I had not run into that warning. What I ran into with Encode I now think should be changed:
>> perl -MEncode -E 'say Encode::decode("UTF-8", "\xEF\xBF\xBF", Encode::FB_CROAK)'
>> utf8 "\xFFFF" does not map to Unicode at /usr/local/lib/perl5/site_perl/5.20.0/darwin-thread-multi-2level/Encode.pm line 175.
>> In fact it *does* map to Unicode, IIUC Corrigendum 9 correctly. I’ll file a bug with Dan.
> I did so, here:
> Dan replied to report that it’s UTF8_DISALLOW_ILLEGAL_INTERCHANGE from the Perl core that’s at fault:
>> If it were are a bug, it belongs to perl core because the strictness of UTF8 is #defined in the value of UTF8_DISALLOW_ILLEGAL_INTERCHANGE which is defined in perl core:
>> In other words, Encode faithfully believes perl core with that respect. And I want to leave Encode that way. If it is to be fixed, it should be fixed by redefining UTF8_DISALLOW_ILLEGAL_INTERCHANGE to exclude UTF8_DISALLOW_NONCHAR in perl core.
> ISTM that, given the change in Corrigendum 9, UTF8_DISALLOW_ILLEGAL_INTERCHANGE should exclude UTF8_DISALLOW_NONCHAR.
> Is this part of of the same issue as that described in RT-97358? Or should I start a new issue?
We have a backwards compatibility problem here. Corrigendum 9 is
controversial, and the wording has not been incorporated into the text
of Unicode 7.0 because that hasn't been published yet (the data has, but
not the text of the standard).
Noncharacters are still supposed to be used only for internal purposes.
The genesis of #9 was that ICU and CLDR were having trouble with
off-the-shelf editors and version control systems rejecting their code
that used them legitimately (though it appears that there are some poor
design decisions involving their use).
I sent a query about things to the Unicode mailing list some months ago,
and it stirred up quite a bit of resentment about the #9 decision. It
was made without public input, and during a single meeting, so there
wasn't time to consider all the ramifications.
One of my points was that we have a gatekeeper that has kept
non-characters out of input. Code that uses non-characters internally
has relied on that gatekeeper to prevent conflicts. If we change the
gatekeeper to allow noncharacters, there is a potential security hole.
Even the people on the Unicode list that were the promulgators of the
change given by #9 agree that any existing code that excludes
noncharacters should not be changed to allow them.