develooper Front page | perl.perl5.porters | Postings from January 2009

Re: Even faster Unicode character counting

Thread Previous | Thread Next
karl williamson
January 30, 2009 13:33
Re: Even faster Unicode character counting
Message ID:
David Nicol wrote:
> On Thu, Jan 29, 2009 at 8:32 PM, karl williamson
> <> wrote:
>> David Nicol wrote:
>>> I don't know about the endianness issues, the patch uses the U64 macro
>>> which should be an appropriate size even if it has to be char[8] or
>>> such.
>> Why then does the code have a HAS_QUAD macro to say whether the machine even
>> accepts 64 bits or not, and other macros to declare a constant suffixed with
>> an L, for example, or not.
> to determine what U64 is, on any particular platform.  I'm not
> entirely certain that it's available everywhere; and I'm hoping
> someone else will be pleased to take this little optimization project
> over.  Without the massive speed gain of counting continuation
> characters in parallel and subtracting them available, and the fact
> that the additional tests will slow down operations on
> all-extended-character data, I'm not sure this is even worthwhile at
> all; better to do deeper reengineering to create a "pure utf8" data
> type that would be guaranteed to hold valid utf8, or similarly grand
> unfunded mandate.
I'm not quite ready to give up on speeding this up.  Let's not just 
throw in the towel because the solution isn't 100%.

I think you said earlier that a previous version of the patch failed 
because the input wasn't always well-formed.  If so, one possibility is 
to see if there are significant cases where the input should be 
well-formed, and where a speed-up would be helpful.  I think that an 
example of this would be the length function.  Then this routine can be 
changed to take a flag argument indicating if the input should be 
well-formed, and to use a sped up version if so, and the old slow way if 
not.  Then create a macro replacement for the current interface that 
calls the new one with the flag set to non-well-formed.  Then go through 
the Perl source, and change the calls where the input should be 
well-formed to use the new flag.

It turns out that U64 is only defined when the system has 64 bit 
capability, and Configure has set it so.  I tried compiling this patch 
on a work space without this, and it failed.

I wonder if the reason one of the tests didn't do all of the ones 
expected was because of a segmentation fault.  It would be interesting 
to run the latest version (which attempted to not exceed the address 
space) to see.  I don't have a suitable 64-bit workspace at the moment 
available to try.

I do think that the patch as written does not work on machines with a 
different endianess.  But I've been wrong before, and recently.  This 
could be solved by appropriate #ifs.

And I agree that it might not be worth putting in as currently written, 
since it is slower on strings containing enough characters outside the 
ASCII range,

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About