First, my apologies to anyone who tried to do anything with yesterday's patch. I am now testing against the whole suite instead of just one section before sending this list any serious patches!!! It seems that the whole count the continuation bytes concept just plain won't work, as Perl supports malformed utf8 for various purposes, so the following is the most that can be done with the concept of vectorized examination. It should make finding the length of mostly 7-bit data faster, in exchange for adding a second condition and, 1/4 of the time, a third condition and four additions, per each character in wide data. Regardless of what the optimization goal is, the word "inclusive" in the apidoc was confusing. diff --git a/utf8.c b/utf8.c index 8243793..ffce7a0 100644 --- a/utf8.c +++ b/utf8.c @@ -671,8 +671,8 @@ Perl_utf8_to_uvuni(pTHX_ const U8 *s, STRLEN *retlen) =for apidoc utf8_length Return the length of the UTF-8 char encoded string C<s> in characters. -Stops at C<e> (inclusive). If C<e E<lt> s> or if the scan would end -up past C<e>, croaks. +Stops at C<e> without examining it. If C<e E<lt> s> or if the scan would +end past C<e>, croaks. =cut */ @@ -693,6 +693,18 @@ Perl_utf8_length(pTHX_ const U8 *s, const U8 *e) if (e < s) goto warn_and_return; while (s < e) { + if (((uintptr_t)(s) & 0x00000003) == 0){ + register const U32 *sptr; + sptr = (const U32*)(s); + /* skip bytes less than 128 four at a time */ + do { + len += 4; + if (*sptr++ & 0x80808080) + break; + } while ((U8*)(sptr) < e); + len -= 4; + s = (U8*)(--sptr); + }; t = UTF8SKIP(s); if (e - s < t) { warn_and_return:Thread Previous | Thread Next