develooper Front page | perl.perl5.porters | Postings from January 2009

Re: Even faster Unicode character counting

Thread Previous | Thread Next
David Nicol
January 4, 2009 14:02
Re: Even faster Unicode character counting
Message ID:
On Sat, Jan 3, 2009 at 1:21 PM, karl williamson <> wrote:
> I'm pretty certain that this won't work with UTF-EBCDIC.  I'm sorry if I
> wasn't clear earlier.  I8 encoding (which has nice properties like UTF-8) is
> INTERMEDIATE only.  UTF-EBCDIC is formed by a byte-by-byte transform of I8
> into something else, so that the 160 EBCDIC invariants are actually
> invariant.  This mapping depends on the particular flavor of EBCDIC.  Perl
> is supposed to recognize 3 such flavors.  Most operations that want to do
> hopping, etc. first transform UTF-EBCDIC into I8, or use one of the three
> compiled in skip tables that have been pre-computed to avoid the
> transformation when all that is needed is to know the skip value.  For
> example, here is a definition from utfebcdic.h:
> #define UTF8_IS_CONTINUED(c)            (NATIVE_TO_UTF(c) >= 0xA0)
> The NATIVE_TO_UTF macro transforms c into I8 using the appropriate table,
> and then the comparison is done.
> I don't see how the transform can be paralleled, except on a multi-processor
> system.  So I think the answer for EBCDIC is to just #ifdef the new code for
> non-ebcdic only.
> My understanding is that we shouldn't go out of our way to support EBCDIC,
> but I don't think we should deliberately break it either.  I can say that
> there are a number of places in the code where the ASCII (Latin1) character
> ordinal is hard-coded in, and so these right now don't work with EBCDIC, and
> we don't get complaints.  (The German Sharp S is one where \xdf is often
> used, whereas in all the EBCDIC variants that Perl is supposed to support it
> should be \x59.)
> But it seems that if we have some basic code that supposedly used to work
> with EBCDIC we should leave it alone.

the last patch does two things.  One is replace the UTF8SKIP table,
which returns the expected length of a character based on the first
byte of it, with another table that gives (UTF8SKIP(c) - 1) instead.
The other is to zip over series of bytes all less than 0x80 when they
are in aligned words.  As soon as a high bit is detected, we return to
business as usual.  This should give exactly the same results as the
current method.  No continuation bytes are examined.

I figured out how to do a bit operation to detect a 0xFF byte, and was
considering using Percival's method until one of those appears, but as
Percival's method won't work with the EBCDIC I'll leave that for
someone else to experiment with.

The test, which will catch 0xFF and false-positive on 0xFE (which is
not a problem), is

   if ((((utmp & ~ONESMASK) >> 1) + ONESMASK) | ( ONESMASK << 7)){
       sum up counts in current u;
       set bptr to address where we found utmp;
       resume bytewise examination

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About