SADAHIRO Tomoyuki wrote: > On Thu, 25 Nov 2010 10:23:17 -0700 > karl williamson wrote: > >> I think we have agreement here, but let me sum up to be sure. >> >> 1) The current API will change (because it doesn't really have the >> capability to do things properly) so that by default the internal utf8 >> encoding/decoding functions will allow non-character code points and >> above-Unicode code points. The default for surrogates will continue to >> be that they are not allowed. It will be possible to specify >> disallowing non-characters and beyond-Unicode characters by appropriate >> flags. (Actually, the current API for utf8n_to_uvuni() always allows >> above-Unicode code points; I would extend it to allow excluding these.) >> Existing macros that match subsets of the non-character code points >> will be removed and replaced by a single macro with a new name that >> matches all of them. > > Though I don't object defining a new flag macro that makes > utf8n_to_uvuni() will disallow beyond-Unicode (uv >= 0x110000) > and, if necessary, changing the flags passed to utf8n_to_uvuni() > called in perl core, > I guess removal of any existing macro, that has been long-standing > since perl 5.7.x or around 5.8.0, has a problem of backward compatibity. > > The removal of an existing macro makes any XS code using the macro > can't be built. > > The API doc for utf8n_to_uvuni() in perl 5.12.2 (latest maint) > states (see http://perldoc.perl.org/perlapi.html#utf8n_to_uvuni ) > > If s does not point to a well-formed UTF-8 character, > the behaviour is dependent on the value of flags : > [snip] > The flags can also contain various flags to allow > deviations from the strict UTF-8 encoding (see utf8.h). > > UV utf8n_to_uvuni(const U8 *s, STRLEN curlen, > STRLEN *retlen, U32 flags) > > and then this document seems to allow for perl users to use the macros > defined in utf8.h as flags passed to utf8n_to_uvuni(). > > Regards, > SADAHIRO Tomoyuki > > I have delayed responding to this while I did some research. It is true that it breaks backward compatibility; I thought that the previous discussions on this list had established the necessity of this. Basically Perl's handling of these is so badly broken that I don't know how to fix it without breaking backwards compatibility. We strive not to do that, but it is my reluctant opinion that this situation qualifies for an exception. I looked on CPAN and Google code search, and the only uses of these (not counting the forked Kurila) found outside the core are in three cpan packages. Two I already knew about, Encode and Normalize, but the third is another one you support Unicode::Transform. Code that uses the existing flags likely has the incorrect model that Perl has promulgated. That may not be the case for certain of the ALLOW flags, such as UTF8_ALLOW_FFFF, which is the only one I found used outside the core. I could accept those, but raise a deprecation warning. On the other hand, I don't know for sure if the author fully appreciates the issues these had when s/he wrote the code, since the model so poorly reflects the Unicode standard, and a compiler error is more likely to get someone's attention, to point out that they need to think about this. There is some sentiment, which I tend to agree with, that this be extended to the surrogates, that once a string is stored inside a variable, it ought to be able to represent any character representable in the machine word. If it is decided to go ahead and do this, I am happy to offer up the essentially trivial patches for both your packages.Thread Previous | Thread Next