develooper Front page | perl.perl5.porters | Postings from December 2017

Re: Even faster Unicode character counting

Thread Previous | Thread Next
Karl Williamson
December 8, 2017 04:16
Re: Even faster Unicode character counting
Message ID:
On 01/30/2009 02:33 PM, karl williamson wrote:
> David Nicol wrote:
>> On Thu, Jan 29, 2009 at 8:32 PM, karl williamson
>> <> wrote:
>>> David Nicol wrote:
>>>> I don't know about the endianness issues, the patch uses the U64 macro
>>>> which should be an appropriate size even if it has to be char[8] or
>>>> such.
>>> Why then does the code have a HAS_QUAD macro to say whether the 
>>> machine even
>>> accepts 64 bits or not, and other macros to declare a constant 
>>> suffixed with
>>> an L, for example, or not.
>> to determine what U64 is, on any particular platform.  I'm not
>> entirely certain that it's available everywhere; and I'm hoping
>> someone else will be pleased to take this little optimization project
>> over.  Without the massive speed gain of counting continuation
>> characters in parallel and subtracting them available, and the fact
>> that the additional tests will slow down operations on
>> all-extended-character data, I'm not sure this is even worthwhile at
>> all; better to do deeper reengineering to create a "pure utf8" data
>> type that would be guaranteed to hold valid utf8, or similarly grand
>> unfunded mandate.
> I'm not quite ready to give up on speeding this up.  Let's not just 
> throw in the towel because the solution isn't 100%.
> I think you said earlier that a previous version of the patch failed 
> because the input wasn't always well-formed.  If so, one possibility is 
> to see if there are significant cases where the input should be 
> well-formed, and where a speed-up would be helpful.  I think that an 
> example of this would be the length function.  Then this routine can be 
> changed to take a flag argument indicating if the input should be 
> well-formed, and to use a sped up version if so, and the old slow way if 
> not.  Then create a macro replacement for the current interface that 
> calls the new one with the flag set to non-well-formed.  Then go through 
> the Perl source, and change the calls where the input should be 
> well-formed to use the new flag.
> It turns out that U64 is only defined when the system has 64 bit 
> capability, and Configure has set it so.  I tried compiling this patch 
> on a work space without this, and it failed.
> I wonder if the reason one of the tests didn't do all of the ones 
> expected was because of a segmentation fault.  It would be interesting 
> to run the latest version (which attempted to not exceed the address 
> space) to see.  I don't have a suitable 64-bit workspace at the moment 
> available to try.
> I do think that the patch as written does not work on machines with a 
> different endianess.  But I've been wrong before, and recently.  This 
> could be solved by appropriate #ifs.
> And I agree that it might not be worth putting in as currently written, 
> since it is slower on strings containing enough characters outside the 
> ASCII range,

I have revisited this thread from 2009.  The answer is doing 
vectorization can speed things up significantly.

I don't know why David's patch did not work entirely; I haven't examined 
it closely.  I do know that there have been a bunch of fixes to UTF-8 in 
the intervening years.

The issue with the FF start byte is a red herring.  Vectorization counts 
continuation bytes, and it "just works".

Attached are outputs for 1) a 32-bit platform, courtesy 
Nicholas; 2) my 64-bit one.

The results are bytes-per-character dependent, as the old method starts 
at the beginning and counts starter bytes, skipping the continuation 
ones.  Thus the more continuation bytes per character, the more get 
skipped in one fell swoop.  The new method counts the continuation bytes 
word-by-word; no skipping.  If your word length is 8 bytes, and your 
code points require 13 bytes, you are skipping longer than a word at a 
time.  But the maximum legal Unicode code point requires only 4 bytes, 
and most in commercial use are 1, 2 or 3, except for emojis.  Extremely 
large code points should not factor into performance considerations.

I'm close to being ready to push this.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About