Jarkko alerted me to this
http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html
which references our very own Aristotle Pagaltzis.
Is anyone interested in experimenting with his bit-smashing approach and
seeing whether it can be used in Perl_utf8_length(), and what sort of a
speedup it gives? It's not the world's largest function:
/*
=for apidoc utf8_length
Return the length of the UTF-8 char encoded string C<s> in characters.
Stops at C<e> (inclusive). If C<e E<lt> s> or if the scan would end
up past C<e>, croaks.
=cut
*/
STRLEN
Perl_utf8_length(pTHX_ const U8 *s, const U8 *e)
{
dVAR;
STRLEN len = 0;
U8 t = 0;
PERL_ARGS_ASSERT_UTF8_LENGTH;
/* Note: cannot use UTF8_IS_...() too eagerly here since e.g.
* the bitops (especially ~) can create illegal UTF-8.
* In other words: in Perl UTF-8 is not just for Unicode. */
if (e < s)
goto warn_and_return;
while (s < e) {
t = UTF8SKIP(s);
if (e - s < t) {
warn_and_return:
if (ckWARN_d(WARN_UTF8)) {
if (PL_op)
Perl_warner(aTHX_ packWARN(WARN_UTF8),
"%s in %s", unees, OP_DESC(PL_op));
else
Perl_warner(aTHX_ packWARN(WARN_UTF8), unees);
}
return len;
}
s += t;
len++;
}
return len;
}
Note, you can't use his code directly, as his is a strlen()-a-like that will
terminate on the first \0, whereas we have a routine that counts a block of
memory.
Nicholas Clark
Thread Next