develooper Front page | perl.perl5.porters | Postings from December 2010

Re: RFC: Processing Unicode non-characters and code points beyondUnicode's

Thread Previous | Thread Next
karl williamson
December 8, 2010 21:53
Re: RFC: Processing Unicode non-characters and code points beyondUnicode's
Message ID:
SADAHIRO Tomoyuki wrote:
> On Thu, 25 Nov 2010 10:23:17 -0700
> karl williamson wrote:
>> I think we have agreement here, but let me sum up to be sure.
>> 1) The current API will change (because it doesn't really have the 
>> capability to do things properly) so that by default the internal utf8 
>> encoding/decoding functions will allow non-character code points and 
>> above-Unicode code points.  The default for surrogates will continue to 
>> be that they are not allowed.  It will be possible to specify 
>> disallowing non-characters and beyond-Unicode characters by appropriate 
>> flags.  (Actually, the current API for utf8n_to_uvuni() always allows 
>> above-Unicode code points; I would extend it to allow excluding these.) 
>>   Existing macros that match subsets of the non-character code points 
>> will be removed and replaced by a single macro with a new name that 
>> matches all of them.
> Though I don't object defining a new flag macro that makes
> utf8n_to_uvuni() will disallow beyond-Unicode (uv >= 0x110000)
> and, if necessary, changing the flags passed to utf8n_to_uvuni()
> called in perl core,
> I guess removal of any existing macro, that has been long-standing
> since perl 5.7.x or around 5.8.0, has a problem of backward compatibity.
> The removal of an existing macro makes any XS code using the macro
> can't be built.
> The API doc for utf8n_to_uvuni() in perl 5.12.2 (latest maint)
> states (see )
>      If s does not point to a well-formed UTF-8 character,
>      the behaviour is dependent on the value of flags :
>      [snip]
>      The flags can also contain various flags to allow
>      deviations from the strict UTF-8 encoding (see utf8.h).
>      UV utf8n_to_uvuni(const U8 *s, STRLEN curlen,
>                                     STRLEN *retlen, U32 flags)
> and then this document seems to allow for perl users to use the macros
> defined in utf8.h as flags passed to utf8n_to_uvuni().
> Regards,
> SADAHIRO Tomoyuki

I have delayed responding to this while I did some research.  It is true 
that it breaks backward compatibility; I thought that the previous 
discussions on this list had established the necessity of this. 
Basically Perl's handling of these is so badly broken that I don't know 
how to fix it without breaking backwards compatibility.  We strive not 
to do that, but it is my reluctant opinion that this situation qualifies 
for an exception.

I looked on CPAN and Google code search, and the only uses of these (not 
counting the forked Kurila) found outside the core are in three cpan 
packages.  Two I already knew about, Encode and Normalize, but the third 
is another one you support Unicode::Transform.

Code that uses the existing flags likely has the incorrect model that 
Perl has promulgated.  That may not be the case for certain of the ALLOW 
flags, such as UTF8_ALLOW_FFFF, which is the only one I found used 
outside the core.  I could accept those, but raise a deprecation 
warning.  On the other hand, I don't know for sure if the author fully 
appreciates the issues these had when s/he wrote the code, since the 
model so poorly reflects the Unicode standard, and a compiler error is 
more likely to get someone's attention, to point out that they need to 
think about this.

There is some sentiment, which I tend to agree with, that this be 
extended to the surrogates, that once a string is stored inside a 
variable, it ought to be able to represent any character representable 
in the machine word.

If it is decided to go ahead and do this, I am happy to offer up the 
essentially trivial patches for both your packages.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About