On Thu, Jan 29, 2009 at 3:17 PM, Jan Dubois <jand@activestate.com> wrote:
> On Thu, 29 Jan 2009, karl williamson wrote:
>> The C standard guarantees that one can address one more than
>> the maximum size of a pointer's data,
>
> You can have a pointer that points "just beyond" the allocated size of
> an array, but you are not allowed to dereference it. It is only valid
> for comparison with other pointer values, or for subtracting another
> pointer from it that also points either inside or just beyond the same
> array object.
This particular issue is handled by only looking for 8skips when we're
before the final 8 bytes in the block; we check for sptr < eptr before
taking *sptr++. Oh, no we don't. eptr might be at the end of a
non-aligned block (is there an architecture which would allocate such
a block? Can perl data be such a block?) Decrementing eptr after
calculating it would let us switch the comparison and the
postincrementing dereference, then we wouldn't need the preincrement
at the bottom when assigning back to s.
--- utf8_c_8skips.patch 2009-01-28 14:19:48.767165800 -0600
+++ utf8_c_8skips.patch_modified 2009-01-29 15:52:52.589057500 -0600
@@ -25,7 +25,7 @@
* the bitops (especially ~) can create illegal UTF-8.
* In other words: in Perl UTF-8 is not just for Unicode. */
-+ eptr = (const U64* ) ( (uintptr_t)(e) & 0xFFFFFFFFFFFFFFF8);
++ eptr = (const U64* ) ( (uintptr_t)(e) & 0xFFFFFFFFFFFFFFF8) - 1;
if (e < s)
goto warn_and_return;
while (s < e) {
@@ -34,9 +34,8 @@
+ register U64 manybits;
+ /* skip bytes less than 128 eight at a time */
+ sptr = (const U64*)(s);
-+ for(;;){
++ while ( sptr < eptr ) {
+ manybits = *sptr++;
-+ if ( sptr >= eptr) break;
++
+ if ((manybits & 0x8080808080808080) == 0 ){
+ /* warn( "Successful 8skip at len %i",(int)(len)); */
+ len += 8;
@@ -82,7 +81,7 @@
+ goto found_high_bit;
+ };
+ };/* end of the loop by 8 bytes */
-+ s = (U8*)(--sptr);
++ s = (U8*)(sptr);
+ };
+ found_high_bit:
t = UTF8SKIP(s);
I don't know about the endianness issues, the patch uses the U64 macro
which should be an appropriate size even if it has to be char[8] or
such.
EBCDIC high invariants aren't a correctness problem as neither of the
last two proposals count continuation characters any more, they're
just concerned with identifying sequences of characters that would
have a UTF8SKIP table look-up value of 1.
--
"The only thing that separates us from the animals is tattoos" -- Drea Smith
Thread Previous
|
Thread Next