develooper Front page | perl.perl5.porters | Postings from January 2009

Re: Even faster Unicode character counting

Thread Previous | Thread Next
From:
David Nicol
Date:
January 2, 2009 15:16
Subject:
Re: Even faster Unicode character counting
Message ID:
934f64a20901021516l5e29dd1et748446c382cabd46@mail.gmail.com
On Tue, Dec 23, 2008 at 5:51 AM, Nicholas Clark <nick@ccl4.org> wrote:
> Jarkko alerted me to this
>    http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html
>
> which references our very own Aristotle Pagaltzis.
>
> Is anyone interested in experimenting with his bit-smashing approach and
> seeing whether it can be used in Perl_utf8_length(), and what sort of a
> speedup it gives? It's not the world's largest function:

Well, it can't be used, because perl's arbitrary-length extension
apparently does not follow the
all-continuation-bytes-and-only-continuation-bytes-are-in-the-0x80-0xBF-range
convention.  Well maybe it could be used by checking for and
special-casing 0xFF before  using the walk-then-run approach in the
patch.  Although there are then scary alignment issues, and the
0b100..... bytes that are invariants in I8 for the EBCDIC port.

What can be done, or what I have done, is to use masking to accelerate
skipping aligned runs of invariants.  Haven't benchmarked it,  but the
attached patch passes all tests in t/uni both with and without
-DNEVERMINDABOUTTHEWORDSKIPPINGTHING which disables the accelerated
skipping while keeping the subtractive rather than additive approach
to length calculation.


Happy 2009.

-- 
Lucky Cap'n Rabbit King Nuggets: For the Irish seafaring nobleman in YOU!

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About