develooper Front page | perl.perl5.porters | Postings from November 2013

Re: deprecation of utf8n_to_uvuni etc

Thread Previous | Thread Next
Karl Williamson
November 28, 2013 17:07
Re: deprecation of utf8n_to_uvuni etc
Message ID:
On 10/17/2013 06:28 AM, SADAHIRO Tomoyuki wrote:
> Hello.
> Now perl5194delta includes the following statement:
>      Certain rarely used functions and macros available to XS code are
>      now, or are planned to be, deprecated. These are: utf8n_to_uvuni
>     (use utf8_to_uvchr_buf instead), utf8_to_uni_buf (use utf8_to_uvchr_buf
>      instead), valid_utf8_to_uvuni (use utf8_to_uvchr_buf instead),
>      uvuni_to_utf8 (use uvchr_to_utf8 instead), NATIVE_TO_NEED (this
>      did not work properly anyway), and ASCII_TO_NEED (this did not
>      work properly anyway).
>      Starting in this release, almost never does application code need
>      to distinguish between the platform's character set and Latin1,
>      on which the lowest 256 characters of Unicode are based.
> Dual-lived modules using these deprecated functions are only three:
> Encode, Unicode-Normalize, and Unicode-Collate. Certain rare.
> But replacement of uvuni_to_utf8 with uvchr_to_utf8 etc. should
> break the code on EBCDIC platforms.
> In Unicode, the value of 'A' (U+0041) must be always 65 (0x41).
> The deprecated functions consider the value of 'A' is always 65
> (0x41), even on EBCDIC platforms.
> Replacements consider the value of latin capital 'A' is 193 (0xC1)
> on EBCDIC platforms. There is no compatibility.
>>From Perl's XS API, the interface for conversion between characters
> and Unicode code points will be removed. It's very sad.
> For the conversion, pack('U') and unpack('U') in pure perl still remain.
> Though it's possible that Unicode-Normalize and Unicode-Collate go back
> to pure perl immediately, since their CPAN releases include both of
> XS and pure perl, XS of which is incorporated in the perl distribution
> currently, of course pure perl is quite less efficient.

First, be assured that these functions will not be marked as deprecated, 
much less removed until blead contains the patches I'm writing that 
makes the affected modules work on EBCDIC platforms, or we've decided to 
remove EBCDIC support.

(Also, there are undocumented functions that do the conversion, 
utf8::native_to_unicode() and utf8::unicode_to_native().  I'm working on 
a patch to document them.)

Blead now has the Unicode tables stored in native order.  What that 
means is the vast majority of programs that haven't bothered with these 
functions will now work without having to bother.  (This is not true, 
however, if a program has hard-coded a numerical code point constant. 
There's nothing I can think of to fix that, except adding a pragma to 
indicate whether such constants should be automatically translated, or not.)

This all started several years ago as I looked at the core code, and saw 
that a bunch of it that dealt with Unicode and EBCDIC had special casing 
just to translate to/from Unicode-order.  It occurred to me that if the 
Unicode tables were in terms of EBCDIC, this special casing could be 
removed.  As I recall, I ran this by Jarkko, and he didn't see a problem 
with it.  It turns out that it was fairly easy to change the tables.

We have a version of v5.19, the core of which mostly (93% passing tests) 
works on z/OS EBCDIC.  I had not gotten to cpan-upstream modules, 
including the ones you mentioned.

EBCDIC support is very tenuous, and will end up being removed unless we 
get something that passes 100% of the tests (skipping some may be 
allowed), and runs smoke reports regularly on blead.

My goal was to save EBCDIC, not destroy these modules.  I have gotten 
rid of most cases in the core where EBCDIC has different code paths than 
ASCII platforms.  The remaining ones are mostly for performance, and 
handling the fact that A-Z is not strictly sequential, mostly in toke.c 
concerning tr///, and one in regcomp.c.  The differences are almost 
entirely isolated in 2 headers and one dot c.  This makes things easier 
to maintain and also to argue allowing EBCDIC support to stay in 
somewhat longer even without a smoke platform, as there just aren't that 
many places where it diverges from ASCII.  It also makes it easier to 
remove should we decide to do so.  And that was my other goal, to make 
things easier to remove if and when we do.  The changes involve removing 
this extra layer of indirection that these functions (that are to be 
deprecated) can involve on ASCII platforms.

It may be that getting Normalize to work on EBCDIC will involve me 
changing the tables that it uses back to Unicode order even on those 
platforms.  But again, if EBCDIC stays, I will make sure to submit 
patches to you so that it passes its tests on EBCDIC.

You do bring up a point that I had not thought of, which involves 
Collate.  I believe it uses its own Unicode-ordered files, and doesn't 
use the ones the core generates, unlike Normalize and Encode.  There may 
be other distributions out there, or to come, that do the same.  Unicode 
publishes other files that are not part of the Unicode Character 
Database, but which refer to Unicode characters nonetheless.  Those 
distributions may need functions like the ones that are to be 
deprecated.  However, the vast majority of distributions don't need 
those, and those few that are using them (you saw 3 in all of cpan) 
should be forced to convert by us removing them.  We can easily provide 
this functionality under functions with new names for Collate's benefit.

Thus, pack('U') needs to handle both Unicode-ordered code points, and 
EBCDIC-ordered ones.  Most applications will want the latter.  Collate 
and perhaps Normalize and Encode, the former.  pack() formats already 
allow the '!' modifier to specify "native" behavior.  I'm thinking that 
should be added to pack('U').  The question then becomes does ! here 
mean native, and the default is Unicode; or vice versa.  Consistency 
with current practice would argue the former, but the quantity of 
applications that would have to change argues the latter, as it is rare 
for applications to have to use the base Unicode tables.  Your search 
yielded just the three mentioned.  I've looked at the code for Normalize 
and Collate, and there is a subroutine pack_U() through which the 
pack('U') calls go.  Hence it is a one line change to add the ! to get 
those to work properly no matter what the platform.

Re-reading the above, I'm not sure I've been very clear, so feel free to 
ask for clarification.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About