develooper Front page | perl.perl5.porters | Postings from December 2010

RFC: Summary of proposed handling of surrogates, non-characters,etc for 5.14. Note some backward incompatibility

From:
karl williamson
Date:
December 18, 2010 21:56
Subject:
RFC: Summary of proposed handling of surrogates, non-characters,etc for 5.14. Note some backward incompatibility
Message ID:
4D0D9E0F.5090609@khwilliamson.com
This summarizes my understanding of what's come out of the latest rounds 
of discussions on this; which is the latest set on this topic of many 
over the years.  Hopefully, now that there is someone with the tuits, 
me, to put a plan into action, we can move forward.

My claim has been that the way Perl currently handles non-character code 
points is so broken that to fix it, backward compatibility can't be 
preserved fully.  The basics of the plan are to start allowing these by 
default in the API's instead of forbidding them.  Based on feedback, 
I've expanded this to include the code points above the Unicode legal 
maximum (called 'supers' internally), and the latest change, Unicode 
surrogates as well.  This means that any UV can be converted to/from 
utf8 without fuss.  I also plan to remove the distinction of supers that 
don't fit into 31 bits from other supers.  I believe the size of a UV 
should be the determining factor in what is allowed.  A search of the 
p5p archives and git log did not turn up any reason for this distinction.

Certain flags in the API are no longer applicable, since they specified 
what is now the default behavior.  I originally proposed to remove them 
to force module authors to realize that there has been a change.  But 
based on feedback by Tatsuhiko and a search of both CPAN and a Google 
code search, I propose to retain them, as no-ops.

Given this, the only code I found outside blead-upstream that needs to 
change is Encode.  The changes there are minimal, essentially calling 
the API with new flags, if available.  Even if we were to remove the 
no-op flags, there are only two CPAN modules that would need to change, 
and the changes to both are minimal; both are maintained by Tatsuhiko. 
I would submit the patch to Encode, which is upstream undef.

Basically, the proposal comes down to, once inside Perl, any UV can be 
used freely.  I know that Yves thinks that surrogates should warn when 
certain operations are performed on them; I'm not very opposed to that, 
but I don't think it's necessary, given the Unicode definitions of the 
properties applied to surrogates, they are essentially inert: upper and 
lower casing to themselves, etc.

Encode needs to check that inputting these code points doesn't happen by 
default.  Similarly, when outputting, a warning is raised, much like 
"Wide character in print".  A separate proposal for hardening the :utf8 
layer is also being negotiated, but its final disposition is not 
necessary for moving forward on this.  In any event there will be a way 
to input these code points.  I don't think it should be the defaultish 
method.

I also intend to subclass the UTF8 warning so that it is possible to 
turn it off separately for surrogates, non-chars, and supers.

So here is the proposed APIs for the two changed functions.  (They don't 
reflect subclassing the warnings):

uvuni_to_utf8_flags()

        Adds the UTF-8 representation of the code point "uv" to the end 
of the string "d"; "d" should be have at least "UTF8_MAXBYTES+1" free 
bytes available. The return value is the pointer to the byte after the 
end of the new character. In other words,

            d = uvuni_to_utf8_flags(d, uv, flags);

        or, in most cases,

            d = uvuni_to_utf8(d, uv);

        (which is equivalent to)

            d = uvuni_to_utf8_flags(d, uv, 0);

        is the recommended Unicode-aware way of saying

            *(d++) = uv;

        The default is to not warn for code points that are illegal or 
problematic in Unicode, but if UTF8 warnings are enabled, this can be 
overridden by setting any combination of the following flags in "flags": 
UNICODE_WARN_SURROGATE (for code points that are UTF-16 surrogates in 
Unicode), UNICODE_WARN_NONCHAR (for code points that Unicode considers 
ok internally in an application but illegal for interchange), 
UNICODE_WARN_SUPER (for code points that are above the Unicode maximum 
legal code point of 0x10FFFF), and UNICODE_WARN_NON_STRICT_UNICODE 
(meaning any of the three types of problematic code points).

utf8n_to_uvuni()

        Bottom level UTF-8 decode routine.  Returns the code point value 
of the first character in the string "s" which is assumed to be in UTF-8 
(or UTF-EBCDIC) encoding and no longer than "curlen"; "retlen" will be 
set to the length, in bytes, of that character.

        If "s" does not point to a well-formed UTF-8 character, the 
behaviour is dependent on the value of "flags".  The flags can be set to 
allow any desired set of deviations from this set and/or warn on certain 
subsets.

        If a malformation is found, the default behavior is to raise a 
warning, set "retlen" to the expected length of the UTF-8 character in 
bytes, and return zero.  See below for ways of overriding the default.

        Code points corresponding to Unicode surrogates and 
non-characters, and code points above the Unicode maximum of 0x10FFF (up 
to the limit of what is storable in a UV) are by default allowed and do 
not raise warnings.  But if "flags" contains 
UTF8_ALLOW_ONLY_UNICODE_STRICT, all of these are treated as 
malformations.  The flags UTF8_DISALLOW_SURROGATE, 
UTF8_DISALLOW_NONCHAR, and UTF8_DISALLOW_SUPER (meaning above the legal 
Unicode maximum) can be set to disallow these categories individually. 
(Note that in spite of the name, UTF8_ALLOW_ONLY_UNICODE_STRICT, 
non-character code points are ok in Unicode except when interchanging 
with other applications.)

        The flags UTF8_WARN_NON_STRICT_UNICODE, UTF8_WARN_SURROGATES, 
UTF8_WARN_NONCHAR, and UTF8_WARN_SUPER will return the code point values 
for their respective categories (unless the corresponding DISALLOW flag 
is also set), but will raise a warning in doing so.

        All other code points corresponding to Unicode characters, 
including private use and those yet to be assigned, are never considered 
malformed and never warn.

        Various ALLOW flags can be set to allow and not warn on 
individual types of malformations, such as a continuation byte where a 
start byte was expected.  See utf8.h for the list of such flags.  Of 
course, the value returned by this function under such conditions is not 
reliable.

        This is also the behavior if UTF8 warnings have been turned off 
lexically.  This overrides any WARN flags specified, and causes any 
DISALLOW flags to not warn, but the DISALLOW flags will always force 
their respective code points to be treated as malformations.

        The UTF8_CHECK_ONLY flag overrides the behavior when a 
malformation is found.  If this flag is set, the routine assumes that 
the caller will raise a warning, and this function will silently just 
set "retlen" to "-1" and return zero.

        Most code should use utf8_to_uvchr() rather than call this directly.




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About