On Wed, Dec 31, 2008 at 10:19:18PM -0700, karl williamson wrote: > demerphq wrote: >> 2008/12/31 karl williamson <public@khwilliamson.com>: >>> David Nicol wrote: >>> >>> 0xff is an illegal start byte. Here's some info: >>> http://en.wikipedia.org/wiki/UTF-8 >> >> Dont forget that perls utf8 != UTF-8. The latter is subset of the former. >> >> You have to read the comments in utf8.h to see what i mean, and im not >> sure if it impacts your statement, but some aspects of true UTF-8 dont >> apply to perls internal implementation. > > OK, I see now. But how would someone use this 13 byte sequence? And does > anyone, given that Unicode only goes to 6 bytes. I would hate to see the > algorithm not used or slowed down by an unused feature. Any data that can be encoded as integers can also be encoded as "utf8" and searched or manipulated efficiently using regexes[1]. I believe one reason perl supports longer sequences is that it makes this more practical by greatly raising an otherwise 'arbitrary' limit on the largest integer value. Tim. [1] I know one company using code points to encode keyword ids.Thread Previous | Thread Next