develooper Front page | perl.perl5.porters | Postings from January 2009

Re: Even faster Unicode character counting

Thread Previous | Thread Next
From:
David Nicol
Date:
January 13, 2009 09:26
Subject:
Re: Even faster Unicode character counting
Message ID:
934f64a20901130926o21733aa8sfc4eb2350abf10f@mail.gmail.com
First, my apologies to anyone who tried to do anything with
yesterday's patch.  I am now testing against the whole suite instead
of just one section before sending this list any serious patches!!!

It seems that the whole count the continuation bytes concept just
plain won't work, as Perl supports malformed utf8 for various
purposes,  so the following is the most that can be done with the
concept of vectorized examination.

It should make finding the length of mostly 7-bit data faster, in
exchange for adding a second condition and, 1/4 of the time, a third
condition and four additions, per each character in wide data.

Regardless of what the optimization goal is, the word "inclusive" in
the apidoc was confusing.


diff --git a/utf8.c b/utf8.c
index 8243793..ffce7a0 100644
--- a/utf8.c
+++ b/utf8.c
@@ -671,8 +671,8 @@ Perl_utf8_to_uvuni(pTHX_ const U8 *s, STRLEN *retlen)
 =for apidoc utf8_length

 Return the length of the UTF-8 char encoded string C<s> in characters.
-Stops at C<e> (inclusive).  If C<e E<lt> s> or if the scan would end
-up past C<e>, croaks.
+Stops at C<e> without examining it.  If C<e E<lt> s> or if the scan would
+end past C<e>, croaks.

 =cut
 */
@@ -693,6 +693,18 @@ Perl_utf8_length(pTHX_ const U8 *s, const U8 *e)
     if (e < s)
        goto warn_and_return;
     while (s < e) {
+        if (((uintptr_t)(s) & 0x00000003) == 0){
+            register const U32 *sptr;
+            sptr = (const U32*)(s);
+            /* skip bytes less than 128 four at a time */
+            do {
+                len += 4;
+                if (*sptr++ & 0x80808080)
+                    break;
+            } while ((U8*)(sptr) < e);
+            len -= 4;
+            s = (U8*)(--sptr);
+        };
        t = UTF8SKIP(s);
        if (e - s < t) {
            warn_and_return:

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About