develooper Front page | perl.perl5.porters | Postings from December 2008

Even faster Unicode character counting

Thread Next
From:
Nicholas Clark
Date:
December 23, 2008 03:51
Subject:
Even faster Unicode character counting
Message ID:
20081223115102.GB15435@plum.flirble.org
Jarkko alerted me to this
    http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html

which references our very own Aristotle Pagaltzis.

Is anyone interested in experimenting with his bit-smashing approach and
seeing whether it can be used in Perl_utf8_length(), and what sort of a
speedup it gives? It's not the world's largest function:

/*
=for apidoc utf8_length

Return the length of the UTF-8 char encoded string C<s> in characters.
Stops at C<e> (inclusive).  If C<e E<lt> s> or if the scan would end
up past C<e>, croaks.

=cut
*/

STRLEN
Perl_utf8_length(pTHX_ const U8 *s, const U8 *e)
{
    dVAR;
    STRLEN len = 0;
    U8 t = 0;

    PERL_ARGS_ASSERT_UTF8_LENGTH;

    /* Note: cannot use UTF8_IS_...() too eagerly here since e.g.
     * the bitops (especially ~) can create illegal UTF-8.
     * In other words: in Perl UTF-8 is not just for Unicode. */

    if (e < s)
	goto warn_and_return;
    while (s < e) {
	t = UTF8SKIP(s);
	if (e - s < t) {
	    warn_and_return:
	    if (ckWARN_d(WARN_UTF8)) {
	        if (PL_op)
		    Perl_warner(aTHX_ packWARN(WARN_UTF8),
			    "%s in %s", unees, OP_DESC(PL_op));
		else
		    Perl_warner(aTHX_ packWARN(WARN_UTF8), unees);
	    }
	    return len;
	}
	s += t;
	len++;
    }

    return len;
}

Note, you can't use his code directly, as his is a strlen()-a-like that will
terminate on the first \0, whereas we have a routine that counts a block of
memory.

Nicholas Clark

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About