demerphq wrote: > 2008/12/31 karl williamson <public@khwilliamson.com>: >> David Nicol wrote: >>> On Tue, Dec 30, 2008 at 12:39 PM, David Nicol <davidnicol@gmail.com> >>> wrote: >>>> Here's an ifdeffed version that attempts to use vectorization to skip >>>> runs of invariants quickly while still using the information from the >>>> start bytes to skip ahead, for the EBCDIC channel. And I corrected >>>> the off-by-one in the comment. This compiles, but I have neither >>>> tested nor benchmarked. >>> open note to self: >>> Sent a patch without even testing? What kind of idiot are you trying >>> to pass yourself off as, anyway? >>> I tested your patch and discovered that you had an off-by-one -- your >>> initializing count to -1 was based on a misunderstanding of how the >>> function is used! >>> end open note to self >>> >>> But there's a thing I don't quite get -- the UTFSKIP lookup table >>> assigns a length of 13 to a 0xFF start byte. Could someone explain >>> how that works? The tests pertaining to length now work, but tests >>> presumably based on the functionality involved in this 13-long thing >>> are now failing, as that uses a different mechanism than counting >>> continuation bytes can handle. >>> >>> >> 0xff is an illegal start byte. Here's some info: >> http://en.wikipedia.org/wiki/UTF-8 > > Dont forget that perls utf8 != UTF-8. The latter is subset of the former. > > You have to read the comments in utf8.h to see what i mean, and im not > sure if it impacts your statement, but some aspects of true UTF-8 dont > apply to perls internal implementation. > > Yves OK, I see now. But how would someone use this 13 byte sequence? And does anyone, given that Unicode only goes to 6 bytes. I would hate to see the algorithm not used or slowed down by an unused feature.Thread Previous | Thread Next