develooper Front page | perl.perl5.porters | Postings from January 2009

Re: Even faster Unicode character counting

Thread Previous | Thread Next
karl williamson
January 29, 2009 12:58
Re: Even faster Unicode character counting
Message ID:
David Nicol wrote:
> With this in place, the same tests as an unpatched blead pass, which
> is, all of them except the one that passes 38 tests even though it
> thinks its supposed to only do 37 tests.
> Acceleration is supposed to be provided by skipping bytes less than
> 128 eight at a time, by examining them as a U64.  When there is a high
> bit, three tests are performed to determine which of the eight bytes
> has the high bit set and normal counting proceeds from that point,
> instead of starting over at s=alignment-point regardless of which byte
> had the high bit.
> The benefit is:
>    (1) faster passing over runs of bytes all less than or equal to
> 127, which all have lengths of 1 in the lookup table.
> The costs are
>    (1) one extra comparison each "walking" go-around to check if we
> are at a 64-bit alignment
>    (2) extra examinations in the "running" loop will consume extra
> ticks when the string is mostly characters greater than 127
>    (3) unknown performance of bitmasking a U64 on a 32-bit architecture
> cost #1 could be mitigated by unrolling the "walking" test into an
> 8-wide Duff's device, but hopefully I can restrain myself.

I have looked at this some, but not experimented with it, and I don't 
fully understand it, but I think I have the gist of the patch.  Also, I 
have never had to think about portability much in C.  I've learned some 
things from the Perl source about portability, but maybe some of it is 
out-dated.  Given those caveats, here are some comments.

First, I applaud your efforts to speed this up.

Second, as I've said before, I don't think it works on UTF-EBCDIC.  The 
reason, briefly, is that in UTF-EBCDIC the high bit being set or cleared 
doesn't say anything about if the character is variant or not, and I 
don't see this patch doing anything but looking at those high bits.

Third, and again I'm just going from the Perl source, it looks like 
there are some other portability issues involving the use of 64 bit 
values.  The source has things like #define HAS_QUAD from Configure to 
tell the program if a 64 bit quantity is ok on the architecture or not. 
  There are macros to define such constants like U64_CONST().  Maybe 
these are no longer necessary; I don't know.

Also, does this assume a particular byte order, and packing?  Configure 
goes out and calculates a BYTEORDER definition.  Not all machines are 
even big or little -endian.

And it occurs to me that there might be problems with addressing more 
than a word at a time.  The underlying data structure is going to be 
declared as a character or unsigned character.  It will obviously be 
part of a full word, so one can safely address it by word.  And the Perl 
internals support 2, 4, or 8 byte words.  But it seems to me that there 
might be issues with, say, going out of the address space and getting a 
segmentation fault.  This could happen, I think, if the word the 
character is in is the last in the address space, and one addresses it 
as a double or quadruple word (it would be quad on 16-bit words).  The C 
standard guarantees that one can address one more than the maximum size 
of a pointer's data, but I don't know if that really is applicable here, 
since the pointer is to an 8-bit quantity.  Maybe more of a C expert can 
chime in.  There may be other glitches like this that I'm not aware of.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About