develooper Front page | perl.perl5.porters | Postings from January 2009

Re: Even faster Unicode character counting

Thread Previous | Thread Next
karl williamson
January 3, 2009 11:21
Re: Even faster Unicode character counting
Message ID:
David Nicol wrote:
> On Tue, Dec 23, 2008 at 5:51 AM, Nicholas Clark <> wrote:
>> Jarkko alerted me to this
>> which references our very own Aristotle Pagaltzis.
>> Is anyone interested in experimenting with his bit-smashing approach and
>> seeing whether it can be used in Perl_utf8_length(), and what sort of a
>> speedup it gives? It's not the world's largest function:
> Well, it can't be used, because perl's arbitrary-length extension
> apparently does not follow the
> all-continuation-bytes-and-only-continuation-bytes-are-in-the-0x80-0xBF-range
> convention.  Well maybe it could be used by checking for and
> special-casing 0xFF before  using the walk-then-run approach in the
> patch.  Although there are then scary alignment issues, and the
> 0b100..... bytes that are invariants in I8 for the EBCDIC port.
> What can be done, or what I have done, is to use masking to accelerate
> skipping aligned runs of invariants.  Haven't benchmarked it,  but the
> attached patch passes all tests in t/uni both with and without
> skipping while keeping the subtractive rather than additive approach
> to length calculation.
> Happy 2009.
I'm pretty certain that this won't work with UTF-EBCDIC.  I'm sorry if I 
wasn't clear earlier.  I8 encoding (which has nice properties like 
UTF-8) is INTERMEDIATE only.  UTF-EBCDIC is formed by a byte-by-byte 
transform of I8 into something else, so that the 160 EBCDIC invariants 
are actually invariant.  This mapping depends on the particular flavor 
of EBCDIC.  Perl is supposed to recognize 3 such flavors.  Most 
operations that want to do hopping, etc. first transform UTF-EBCDIC into 
I8, or use one of the three compiled in skip tables that have been 
pre-computed to avoid the transformation when all that is needed is to 
know the skip value.  For example, here is a definition from utfebcdic.h:
#define UTF8_IS_CONTINUED(c) 		(NATIVE_TO_UTF(c) >= 0xA0)

The NATIVE_TO_UTF macro transforms c into I8 using the appropriate 
table, and then the comparison is done.

I don't see how the transform can be paralleled, except on a 
multi-processor system.  So I think the answer for EBCDIC is to just 
#ifdef the new code for non-ebcdic only.

My understanding is that we shouldn't go out of our way to support 
EBCDIC, but I don't think we should deliberately break it either.  I can 
say that there are a number of places in the code where the ASCII 
(Latin1) character ordinal is hard-coded in, and so these right now 
don't work with EBCDIC, and we don't get complaints.  (The German Sharp 
S is one where \xdf is often used, whereas in all the EBCDIC variants 
that Perl is supposed to support it should be \x59.)

But it seems that if we have some basic code that supposedly used to 
work with EBCDIC we should leave it alone.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About