develooper Front page | perl.perl5.porters | Postings from December 2008

Re: Even faster Unicode character counting

Thread Previous | Thread Next
From:
karl williamson
Date:
December 31, 2008 21:19
Subject:
Re: Even faster Unicode character counting
Message ID:
495C5256.4040801@khwilliamson.com
demerphq wrote:
> 2008/12/31 karl williamson <public@khwilliamson.com>:
>> David Nicol wrote:
>>> On Tue, Dec 30, 2008 at 12:39 PM, David Nicol <davidnicol@gmail.com>
>>> wrote:
>>>> Here's an ifdeffed version that attempts to use vectorization to skip
>>>> runs of invariants quickly while still using the information from the
>>>> start bytes to skip ahead, for the EBCDIC channel.  And I corrected
>>>> the off-by-one in the comment.  This compiles, but I have neither
>>>> tested nor benchmarked.
>>> open note to self:
>>> Sent a patch without even testing?  What kind of idiot are you trying
>>> to pass yourself off as, anyway?
>>> I tested your patch and discovered that you had an off-by-one -- your
>>> initializing count to -1 was based on a misunderstanding of how the
>>> function is used!
>>> end open note to self
>>>
>>> But there's a thing I don't quite get -- the UTFSKIP lookup table
>>> assigns a length of 13 to a 0xFF start byte.  Could someone explain
>>> how that works?  The tests pertaining to length now work, but tests
>>> presumably based on the functionality involved in this 13-long thing
>>> are now failing, as that uses a different mechanism than counting
>>> continuation bytes can handle.
>>>
>>>
>> 0xff is an illegal start byte.  Here's some info:
>> http://en.wikipedia.org/wiki/UTF-8
> 
> Dont forget that perls utf8 != UTF-8. The latter is subset of the former.
> 
> You have to read the comments in utf8.h to see what i mean, and im not
> sure if it impacts your statement, but some aspects of true UTF-8 dont
> apply to perls internal implementation.
> 
> Yves

OK, I see now.  But how would someone use this 13 byte sequence?  And 
does anyone, given that Unicode only goes to 6 bytes.  I would hate to 
see the algorithm not used or slowed down by an unused feature.



Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About