First, my apologies to anyone who tried to do anything with
yesterday's patch. I am now testing against the whole suite instead
of just one section before sending this list any serious patches!!!
It seems that the whole count the continuation bytes concept just
plain won't work, as Perl supports malformed utf8 for various
purposes, so the following is the most that can be done with the
concept of vectorized examination.
It should make finding the length of mostly 7-bit data faster, in
exchange for adding a second condition and, 1/4 of the time, a third
condition and four additions, per each character in wide data.
Regardless of what the optimization goal is, the word "inclusive" in
the apidoc was confusing.
diff --git a/utf8.c b/utf8.c
index 8243793..ffce7a0 100644
--- a/utf8.c
+++ b/utf8.c
@@ -671,8 +671,8 @@ Perl_utf8_to_uvuni(pTHX_ const U8 *s, STRLEN *retlen)
=for apidoc utf8_length
Return the length of the UTF-8 char encoded string C<s> in characters.
-Stops at C<e> (inclusive). If C<e E<lt> s> or if the scan would end
-up past C<e>, croaks.
+Stops at C<e> without examining it. If C<e E<lt> s> or if the scan would
+end past C<e>, croaks.
=cut
*/
@@ -693,6 +693,18 @@ Perl_utf8_length(pTHX_ const U8 *s, const U8 *e)
if (e < s)
goto warn_and_return;
while (s < e) {
+ if (((uintptr_t)(s) & 0x00000003) == 0){
+ register const U32 *sptr;
+ sptr = (const U32*)(s);
+ /* skip bytes less than 128 four at a time */
+ do {
+ len += 4;
+ if (*sptr++ & 0x80808080)
+ break;
+ } while ((U8*)(sptr) < e);
+ len -= 4;
+ s = (U8*)(--sptr);
+ };
t = UTF8SKIP(s);
if (e - s < t) {
warn_and_return:
Thread Previous
|
Thread Next