develooper Front page | perl.perl5.porters | Postings from December 2008

Even faster Unicode character counting

Thread Next
Nicholas Clark
December 23, 2008 03:51
Even faster Unicode character counting
Message ID:
Jarkko alerted me to this

which references our very own Aristotle Pagaltzis.

Is anyone interested in experimenting with his bit-smashing approach and
seeing whether it can be used in Perl_utf8_length(), and what sort of a
speedup it gives? It's not the world's largest function:

=for apidoc utf8_length

Return the length of the UTF-8 char encoded string C<s> in characters.
Stops at C<e> (inclusive).  If C<e E<lt> s> or if the scan would end
up past C<e>, croaks.


Perl_utf8_length(pTHX_ const U8 *s, const U8 *e)
    STRLEN len = 0;
    U8 t = 0;


    /* Note: cannot use UTF8_IS_...() too eagerly here since e.g.
     * the bitops (especially ~) can create illegal UTF-8.
     * In other words: in Perl UTF-8 is not just for Unicode. */

    if (e < s)
	goto warn_and_return;
    while (s < e) {
	t = UTF8SKIP(s);
	if (e - s < t) {
	    if (ckWARN_d(WARN_UTF8)) {
	        if (PL_op)
		    Perl_warner(aTHX_ packWARN(WARN_UTF8),
			    "%s in %s", unees, OP_DESC(PL_op));
		    Perl_warner(aTHX_ packWARN(WARN_UTF8), unees);
	    return len;
	s += t;

    return len;

Note, you can't use his code directly, as his is a strlen()-a-like that will
terminate on the first \0, whereas we have a routine that counts a block of

Nicholas Clark

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About