Front page | perl.perl5.porters |
Postings from November 2013
Re: deprecation of utf8n_to_uvuni etc
From: Karl Williamson
November 28, 2013 17:07
Re: deprecation of utf8n_to_uvuni etc
Message ID: 5297784F.email@example.com
On 10/17/2013 06:28 AM, SADAHIRO Tomoyuki wrote:
> Now perl5194delta includes the following statement:
> Certain rarely used functions and macros available to XS code are
> now, or are planned to be, deprecated. These are: utf8n_to_uvuni
> (use utf8_to_uvchr_buf instead), utf8_to_uni_buf (use utf8_to_uvchr_buf
> instead), valid_utf8_to_uvuni (use utf8_to_uvchr_buf instead),
> uvuni_to_utf8 (use uvchr_to_utf8 instead), NATIVE_TO_NEED (this
> did not work properly anyway), and ASCII_TO_NEED (this did not
> work properly anyway).
> Starting in this release, almost never does application code need
> to distinguish between the platform's character set and Latin1,
> on which the lowest 256 characters of Unicode are based.
> Dual-lived modules using these deprecated functions are only three:
> Encode, Unicode-Normalize, and Unicode-Collate. Certain rare.
> But replacement of uvuni_to_utf8 with uvchr_to_utf8 etc. should
> break the code on EBCDIC platforms.
> In Unicode, the value of 'A' (U+0041) must be always 65 (0x41).
> The deprecated functions consider the value of 'A' is always 65
> (0x41), even on EBCDIC platforms.
> Replacements consider the value of latin capital 'A' is 193 (0xC1)
> on EBCDIC platforms. There is no compatibility.
>>From Perl's XS API, the interface for conversion between characters
> and Unicode code points will be removed. It's very sad.
> For the conversion, pack('U') and unpack('U') in pure perl still remain.
> Though it's possible that Unicode-Normalize and Unicode-Collate go back
> to pure perl immediately, since their CPAN releases include both of
> XS and pure perl, XS of which is incorporated in the perl distribution
> currently, of course pure perl is quite less efficient.
First, be assured that these functions will not be marked as deprecated,
much less removed until blead contains the patches I'm writing that
makes the affected modules work on EBCDIC platforms, or we've decided to
remove EBCDIC support.
(Also, there are undocumented functions that do the conversion,
utf8::native_to_unicode() and utf8::unicode_to_native(). I'm working on
a patch to document them.)
Blead now has the Unicode tables stored in native order. What that
means is the vast majority of programs that haven't bothered with these
functions will now work without having to bother. (This is not true,
however, if a program has hard-coded a numerical code point constant.
There's nothing I can think of to fix that, except adding a pragma to
indicate whether such constants should be automatically translated, or not.)
This all started several years ago as I looked at the core code, and saw
that a bunch of it that dealt with Unicode and EBCDIC had special casing
just to translate to/from Unicode-order. It occurred to me that if the
Unicode tables were in terms of EBCDIC, this special casing could be
removed. As I recall, I ran this by Jarkko, and he didn't see a problem
with it. It turns out that it was fairly easy to change the tables.
We have a version of v5.19, the core of which mostly (93% passing tests)
works on z/OS EBCDIC. I had not gotten to cpan-upstream modules,
including the ones you mentioned.
EBCDIC support is very tenuous, and will end up being removed unless we
get something that passes 100% of the tests (skipping some may be
allowed), and runs smoke reports regularly on blead.
My goal was to save EBCDIC, not destroy these modules. I have gotten
rid of most cases in the core where EBCDIC has different code paths than
ASCII platforms. The remaining ones are mostly for performance, and
handling the fact that A-Z is not strictly sequential, mostly in toke.c
concerning tr///, and one in regcomp.c. The differences are almost
entirely isolated in 2 headers and one dot c. This makes things easier
to maintain and also to argue allowing EBCDIC support to stay in
somewhat longer even without a smoke platform, as there just aren't that
many places where it diverges from ASCII. It also makes it easier to
remove should we decide to do so. And that was my other goal, to make
things easier to remove if and when we do. The changes involve removing
this extra layer of indirection that these functions (that are to be
deprecated) can involve on ASCII platforms.
It may be that getting Normalize to work on EBCDIC will involve me
changing the tables that it uses back to Unicode order even on those
platforms. But again, if EBCDIC stays, I will make sure to submit
patches to you so that it passes its tests on EBCDIC.
You do bring up a point that I had not thought of, which involves
Collate. I believe it uses its own Unicode-ordered files, and doesn't
use the ones the core generates, unlike Normalize and Encode. There may
be other distributions out there, or to come, that do the same. Unicode
publishes other files that are not part of the Unicode Character
Database, but which refer to Unicode characters nonetheless. Those
distributions may need functions like the ones that are to be
deprecated. However, the vast majority of distributions don't need
those, and those few that are using them (you saw 3 in all of cpan)
should be forced to convert by us removing them. We can easily provide
this functionality under functions with new names for Collate's benefit.
Thus, pack('U') needs to handle both Unicode-ordered code points, and
EBCDIC-ordered ones. Most applications will want the latter. Collate
and perhaps Normalize and Encode, the former. pack() formats already
allow the '!' modifier to specify "native" behavior. I'm thinking that
should be added to pack('U'). The question then becomes does ! here
mean native, and the default is Unicode; or vice versa. Consistency
with current practice would argue the former, but the quantity of
applications that would have to change argues the latter, as it is rare
for applications to have to use the base Unicode tables. Your search
yielded just the three mentioned. I've looked at the code for Normalize
and Collate, and there is a subroutine pack_U() through which the
pack('U') calls go. Hence it is a one line change to add the ! to get
those to work properly no matter what the platform.
Re-reading the above, I'm not sure I've been very clear, so feel free to
ask for clarification.