develooper Front page | perl.perl5.porters | Postings from December 2008

Re: Even faster Unicode character counting

Thread Previous | Thread Next
December 30, 2008 17:03
Re: Even faster Unicode character counting
Message ID:
2008/12/31 karl williamson <>:
> David Nicol wrote:
>> On Tue, Dec 30, 2008 at 12:39 PM, David Nicol <>
>> wrote:
>>> Here's an ifdeffed version that attempts to use vectorization to skip
>>> runs of invariants quickly while still using the information from the
>>> start bytes to skip ahead, for the EBCDIC channel.  And I corrected
>>> the off-by-one in the comment.  This compiles, but I have neither
>>> tested nor benchmarked.
>> open note to self:
>> Sent a patch without even testing?  What kind of idiot are you trying
>> to pass yourself off as, anyway?
>> I tested your patch and discovered that you had an off-by-one -- your
>> initializing count to -1 was based on a misunderstanding of how the
>> function is used!
>> end open note to self
>> But there's a thing I don't quite get -- the UTFSKIP lookup table
>> assigns a length of 13 to a 0xFF start byte.  Could someone explain
>> how that works?  The tests pertaining to length now work, but tests
>> presumably based on the functionality involved in this 13-long thing
>> are now failing, as that uses a different mechanism than counting
>> continuation bytes can handle.
> 0xff is an illegal start byte.  Here's some info:

Dont forget that perls utf8 != UTF-8. The latter is subset of the former.

You have to read the comments in utf8.h to see what i mean, and im not
sure if it impacts your statement, but some aspects of true UTF-8 dont
apply to perls internal implementation.


perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About