develooper Front page | perl.perl5.porters | Postings from December 2010

Re: RFC: Summary of proposed handling of surrogates, non-characters,etc for 5.14. Note some backward incompatibility

Thread Previous | Thread Next
From:
Eric Brine
Date:
December 20, 2010 13:52
Subject:
Re: RFC: Summary of proposed handling of surrogates, non-characters,etc for 5.14. Note some backward incompatibility
Message ID:
AANLkTi=o4VsENYrhSWopm8e+GDWwXrpyTraG6d+KWKqw@mail.gmail.com
On Mon, Dec 20, 2010 at 1:13 PM, demerphq <demerphq@gmail.com> wrote:

> And the point is that certain codepoints are illegal in general, and
>
 so can be treated as essentially maping to themselves.

Others are legal ONLY in UTF16,


You are calling both the encoded form and the decoded form "code point", and
you are using them interchangeably. I can't respond to your post if you
don't clear that up.

So for an example. Consider we have the codepoint U+10400 which case
> folds to U+10428. When represented in UTF-16 the codepoint U+10400
> ends up as the surrogate pair U+D801,U+DC00. Now, if somebody naively
> converts this UTF-16 sequence to UTF8 by converting code point by
> codepoint, the end result will be that OUR code does NOT see codepoint
> U+10400,


No, the result will be a warning when you decode the bad UTF-8 you produced
this way. We all agree the decoder should warn (with an option to disable)
when it sees invalid UTF-8 or Unicode.

- Eric


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About