develooper Front page | perl.perl5.porters | Postings from January 2009

Re: Even faster Unicode character counting

Thread Previous | Thread Next
From:
Tim Bunce
Date:
January 1, 2009 02:21
Subject:
Re: Even faster Unicode character counting
Message ID:
20090101102106.GB14728@timac.local
On Wed, Dec 31, 2008 at 10:19:18PM -0700, karl williamson wrote:
> demerphq wrote:
>> 2008/12/31 karl williamson <public@khwilliamson.com>:
>>> David Nicol wrote:
>>>
>>> 0xff is an illegal start byte.  Here's some info:
>>> http://en.wikipedia.org/wiki/UTF-8
>>
>> Dont forget that perls utf8 != UTF-8. The latter is subset of the former.
>>
>> You have to read the comments in utf8.h to see what i mean, and im not
>> sure if it impacts your statement, but some aspects of true UTF-8 dont
>> apply to perls internal implementation.
>
> OK, I see now.  But how would someone use this 13 byte sequence?  And does 
> anyone, given that Unicode only goes to 6 bytes.  I would hate to see the 
> algorithm not used or slowed down by an unused feature.

Any data that can be encoded as integers can also be encoded as "utf8"
and searched or manipulated efficiently using regexes[1]. I believe one
reason perl supports longer sequences is that it makes this more
practical by greatly raising an otherwise 'arbitrary' limit on the
largest integer value.

Tim.

[1] I know one company using code points to encode keyword ids.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About