develooper Front page | perl.perl5.porters | Postings from January 2009

Re: Even faster Unicode character counting

Thread Previous | Thread Next
From:
karl williamson
Date:
January 29, 2009 18:32
Subject:
Re: Even faster Unicode character counting
Message ID:
498266BA.7090206@khwilliamson.com
David Nicol wrote:
> On Thu, Jan 29, 2009 at 3:17 PM, Jan Dubois <jand@activestate.com> wrote:
>> On Thu, 29 Jan 2009, karl williamson wrote:
>>> The C standard guarantees that one can address one more than
>>> the maximum size of a pointer's data,
>> You can have a pointer that points "just beyond" the allocated size of
>> an array, but you are not allowed to dereference it. It is only valid
>> for comparison with other pointer values, or for subtracting another
>> pointer from it that also points either inside or just beyond the same
>> array object.
> 
> 
> This particular issue is handled by only looking for 8skips when we're
> before the final 8 bytes in the block; we check for sptr < eptr before
> taking *sptr++.  Oh, no we don't.  eptr might be at the end of a
> non-aligned block (is there an architecture which would allocate such
> a block?  Can perl data be such a block?)  Decrementing eptr after
> calculating it would let us switch the comparison and the
> postincrementing dereference, then we wouldn't need the preincrement
> at the bottom when assigning back to s.
> 
> --- utf8_c_8skips.patch 2009-01-28 14:19:48.767165800 -0600
> +++ utf8_c_8skips.patch_modified        2009-01-29 15:52:52.589057500 -0600
> @@ -25,7 +25,7 @@
>        * the bitops (especially ~) can create illegal UTF-8.
>        * In other words: in Perl UTF-8 is not just for Unicode. */
> 
> -+    eptr = (const U64* ) ( (uintptr_t)(e) & 0xFFFFFFFFFFFFFFF8);
> ++    eptr = (const U64* ) ( (uintptr_t)(e) & 0xFFFFFFFFFFFFFFF8) - 1;
>       if (e < s)
>         goto warn_and_return;
>       while (s < e) {
> @@ -34,9 +34,8 @@
>  +            register U64 manybits;
>  +            /* skip bytes less than 128 eight at a time */
>  +            sptr = (const U64*)(s);
> -+            for(;;){
> ++            while ( sptr < eptr ) {
>  +                manybits = *sptr++;
> -+                if ( sptr >= eptr) break;
> ++
>  +                if ((manybits       & 0x8080808080808080) == 0 ){
>  +                    /* warn( "Successful 8skip at len %i",(int)(len)); */
>  +                    len += 8;
> @@ -82,7 +81,7 @@
>  +                    goto found_high_bit;
>  +                };
>  +            };/* end of the loop by 8 bytes */
> -+            s = (U8*)(--sptr);
> ++            s = (U8*)(sptr);
>  +        };
>  +        found_high_bit:
>         t = UTF8SKIP(s);
> 
> 
> 
> I don't know about the endianness issues, the patch uses the U64 macro
> which should be an appropriate size even if it has to be char[8] or
> such.

Why then does the code have a HAS_QUAD macro to say whether the machine 
even accepts 64 bits or not, and other macros to declare a constant 
suffixed with an L, for example, or not.
> 
> EBCDIC high invariants aren't a correctness problem as neither of the
> last two proposals count continuation characters any more, they're
> just concerned with identifying sequences of characters that would
> have a UTF8SKIP table look-up value of 1.
> 
You're right about this.  Sorry.  I hadn't realized that the high order 
byte of a variant UTF-EBCDIC character has to have its high bit set.  I 
don't remember reading that, and I found it through experimentation.

However, I still think you should do it the old way for EBCDIC.  The 
reason is because your method is slower on any real input for EBCDIC. 
And the reason for that is because only punctuation doesn't have its 
high bit set.  That is any text will be full of high bit set characters, 
which, as you said, is slower with your patch.

And, having said that, I doubt that EBCDIC is actually being used right 
now.  In my experimentation, I took the EBCDIC macros and compiled them. 
  There is a bug in which it doesn't cast the result of a left shift to 
8 bits, so it exceeds the array bounds.  I noticed that problem a couple 
months ago when I was reading code, but thought that as long as the 
result was 8 bits, it didn't matter.  But it does matter.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About