David Nicol wrote: > On Tue, Dec 23, 2008 at 5:51 AM, Nicholas Clark <nick@ccl4.org> wrote: >> Jarkko alerted me to this >> http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html >> >> which references our very own Aristotle Pagaltzis. >> >> Is anyone interested in experimenting with his bit-smashing approach and >> seeing whether it can be used in Perl_utf8_length(), and what sort of a >> speedup it gives? It's not the world's largest function: > > Well, it can't be used, because perl's arbitrary-length extension > apparently does not follow the > all-continuation-bytes-and-only-continuation-bytes-are-in-the-0x80-0xBF-range > convention. Well maybe it could be used by checking for and > special-casing 0xFF before using the walk-then-run approach in the > patch. Although there are then scary alignment issues, and the > 0b100..... bytes that are invariants in I8 for the EBCDIC port. > > What can be done, or what I have done, is to use masking to accelerate > skipping aligned runs of invariants. Haven't benchmarked it, but the > attached patch passes all tests in t/uni both with and without > -DNEVERMINDABOUTTHEWORDSKIPPINGTHING which disables the accelerated > skipping while keeping the subtractive rather than additive approach > to length calculation. > > > Happy 2009. > I'm pretty certain that this won't work with UTF-EBCDIC. I'm sorry if I wasn't clear earlier. I8 encoding (which has nice properties like UTF-8) is INTERMEDIATE only. UTF-EBCDIC is formed by a byte-by-byte transform of I8 into something else, so that the 160 EBCDIC invariants are actually invariant. This mapping depends on the particular flavor of EBCDIC. Perl is supposed to recognize 3 such flavors. Most operations that want to do hopping, etc. first transform UTF-EBCDIC into I8, or use one of the three compiled in skip tables that have been pre-computed to avoid the transformation when all that is needed is to know the skip value. For example, here is a definition from utfebcdic.h: #define UTF8_IS_CONTINUED(c) (NATIVE_TO_UTF(c) >= 0xA0) The NATIVE_TO_UTF macro transforms c into I8 using the appropriate table, and then the comparison is done. I don't see how the transform can be paralleled, except on a multi-processor system. So I think the answer for EBCDIC is to just #ifdef the new code for non-ebcdic only. My understanding is that we shouldn't go out of our way to support EBCDIC, but I don't think we should deliberately break it either. I can say that there are a number of places in the code where the ASCII (Latin1) character ordinal is hard-coded in, and so these right now don't work with EBCDIC, and we don't get complaints. (The German Sharp S is one where \xdf is often used, whereas in all the EBCDIC variants that Perl is supposed to support it should be \x59.) But it seems that if we have some basic code that supposedly used to work with EBCDIC we should leave it alone.Thread Previous | Thread Next