On Thu, Jan 29, 2009 at 3:17 PM, Jan Dubois <jand@activestate.com> wrote: > On Thu, 29 Jan 2009, karl williamson wrote: >> The C standard guarantees that one can address one more than >> the maximum size of a pointer's data, > > You can have a pointer that points "just beyond" the allocated size of > an array, but you are not allowed to dereference it. It is only valid > for comparison with other pointer values, or for subtracting another > pointer from it that also points either inside or just beyond the same > array object. This particular issue is handled by only looking for 8skips when we're before the final 8 bytes in the block; we check for sptr < eptr before taking *sptr++. Oh, no we don't. eptr might be at the end of a non-aligned block (is there an architecture which would allocate such a block? Can perl data be such a block?) Decrementing eptr after calculating it would let us switch the comparison and the postincrementing dereference, then we wouldn't need the preincrement at the bottom when assigning back to s. --- utf8_c_8skips.patch 2009-01-28 14:19:48.767165800 -0600 +++ utf8_c_8skips.patch_modified 2009-01-29 15:52:52.589057500 -0600 @@ -25,7 +25,7 @@ * the bitops (especially ~) can create illegal UTF-8. * In other words: in Perl UTF-8 is not just for Unicode. */ -+ eptr = (const U64* ) ( (uintptr_t)(e) & 0xFFFFFFFFFFFFFFF8); ++ eptr = (const U64* ) ( (uintptr_t)(e) & 0xFFFFFFFFFFFFFFF8) - 1; if (e < s) goto warn_and_return; while (s < e) { @@ -34,9 +34,8 @@ + register U64 manybits; + /* skip bytes less than 128 eight at a time */ + sptr = (const U64*)(s); -+ for(;;){ ++ while ( sptr < eptr ) { + manybits = *sptr++; -+ if ( sptr >= eptr) break; ++ + if ((manybits & 0x8080808080808080) == 0 ){ + /* warn( "Successful 8skip at len %i",(int)(len)); */ + len += 8; @@ -82,7 +81,7 @@ + goto found_high_bit; + }; + };/* end of the loop by 8 bytes */ -+ s = (U8*)(--sptr); ++ s = (U8*)(sptr); + }; + found_high_bit: t = UTF8SKIP(s); I don't know about the endianness issues, the patch uses the U64 macro which should be an appropriate size even if it has to be char[8] or such. EBCDIC high invariants aren't a correctness problem as neither of the last two proposals count continuation characters any more, they're just concerned with identifying sequences of characters that would have a UTF8SKIP table look-up value of 1. -- "The only thing that separates us from the animals is tattoos" -- Drea SmithThread Previous | Thread Next