develooper Front page | perl.perl5.porters | Postings from January 2009

Re: Even faster Unicode character counting

Thread Previous | Thread Next
From:
David Nicol
Date:
January 29, 2009 13:57
Subject:
Re: Even faster Unicode character counting
Message ID:
934f64a20901291357j3cb65922u9def28c47fdf5099@mail.gmail.com
On Thu, Jan 29, 2009 at 3:17 PM, Jan Dubois <jand@activestate.com> wrote:
> On Thu, 29 Jan 2009, karl williamson wrote:
>> The C standard guarantees that one can address one more than
>> the maximum size of a pointer's data,
>
> You can have a pointer that points "just beyond" the allocated size of
> an array, but you are not allowed to dereference it. It is only valid
> for comparison with other pointer values, or for subtracting another
> pointer from it that also points either inside or just beyond the same
> array object.


This particular issue is handled by only looking for 8skips when we're
before the final 8 bytes in the block; we check for sptr < eptr before
taking *sptr++.  Oh, no we don't.  eptr might be at the end of a
non-aligned block (is there an architecture which would allocate such
a block?  Can perl data be such a block?)  Decrementing eptr after
calculating it would let us switch the comparison and the
postincrementing dereference, then we wouldn't need the preincrement
at the bottom when assigning back to s.

--- utf8_c_8skips.patch 2009-01-28 14:19:48.767165800 -0600
+++ utf8_c_8skips.patch_modified        2009-01-29 15:52:52.589057500 -0600
@@ -25,7 +25,7 @@
       * the bitops (especially ~) can create illegal UTF-8.
       * In other words: in Perl UTF-8 is not just for Unicode. */

-+    eptr = (const U64* ) ( (uintptr_t)(e) & 0xFFFFFFFFFFFFFFF8);
++    eptr = (const U64* ) ( (uintptr_t)(e) & 0xFFFFFFFFFFFFFFF8) - 1;
      if (e < s)
        goto warn_and_return;
      while (s < e) {
@@ -34,9 +34,8 @@
 +            register U64 manybits;
 +            /* skip bytes less than 128 eight at a time */
 +            sptr = (const U64*)(s);
-+            for(;;){
++            while ( sptr < eptr ) {
 +                manybits = *sptr++;
-+                if ( sptr >= eptr) break;
++
 +                if ((manybits       & 0x8080808080808080) == 0 ){
 +                    /* warn( "Successful 8skip at len %i",(int)(len)); */
 +                    len += 8;
@@ -82,7 +81,7 @@
 +                    goto found_high_bit;
 +                };
 +            };/* end of the loop by 8 bytes */
-+            s = (U8*)(--sptr);
++            s = (U8*)(sptr);
 +        };
 +        found_high_bit:
        t = UTF8SKIP(s);



I don't know about the endianness issues, the patch uses the U64 macro
which should be an appropriate size even if it has to be char[8] or
such.

EBCDIC high invariants aren't a correctness problem as neither of the
last two proposals count continuation characters any more, they're
just concerned with identifying sequences of characters that would
have a UTF8SKIP table look-up value of 1.




-- 
"The only thing that separates us from the animals is tattoos" -- Drea Smith

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About