Front page | perl.perl5.porters |
Postings from January 2009
Re: Even faster Unicode character counting
Thread Previous
|
Thread Next
From:
karl williamson
Date:
January 29, 2009 12:58
Subject:
Re: Even faster Unicode character counting
Message ID:
49821842.7070805@khwilliamson.com
David Nicol wrote:
> With this in place, the same tests as an unpatched blead pass, which
> is, all of them except the one that passes 38 tests even though it
> thinks its supposed to only do 37 tests.
>
> Acceleration is supposed to be provided by skipping bytes less than
> 128 eight at a time, by examining them as a U64. When there is a high
> bit, three tests are performed to determine which of the eight bytes
> has the high bit set and normal counting proceeds from that point,
> instead of starting over at s=alignment-point regardless of which byte
> had the high bit.
>
>
> The benefit is:
> (1) faster passing over runs of bytes all less than or equal to
> 127, which all have lengths of 1 in the lookup table.
>
> The costs are
> (1) one extra comparison each "walking" go-around to check if we
> are at a 64-bit alignment
> (2) extra examinations in the "running" loop will consume extra
> ticks when the string is mostly characters greater than 127
> (3) unknown performance of bitmasking a U64 on a 32-bit architecture
>
> cost #1 could be mitigated by unrolling the "walking" test into an
> 8-wide Duff's device, but hopefully I can restrain myself.
I have looked at this some, but not experimented with it, and I don't
fully understand it, but I think I have the gist of the patch. Also, I
have never had to think about portability much in C. I've learned some
things from the Perl source about portability, but maybe some of it is
out-dated. Given those caveats, here are some comments.
First, I applaud your efforts to speed this up.
Second, as I've said before, I don't think it works on UTF-EBCDIC. The
reason, briefly, is that in UTF-EBCDIC the high bit being set or cleared
doesn't say anything about if the character is variant or not, and I
don't see this patch doing anything but looking at those high bits.
Third, and again I'm just going from the Perl source, it looks like
there are some other portability issues involving the use of 64 bit
values. The source has things like #define HAS_QUAD from Configure to
tell the program if a 64 bit quantity is ok on the architecture or not.
There are macros to define such constants like U64_CONST(). Maybe
these are no longer necessary; I don't know.
Also, does this assume a particular byte order, and packing? Configure
goes out and calculates a BYTEORDER definition. Not all machines are
even big or little -endian.
And it occurs to me that there might be problems with addressing more
than a word at a time. The underlying data structure is going to be
declared as a character or unsigned character. It will obviously be
part of a full word, so one can safely address it by word. And the Perl
internals support 2, 4, or 8 byte words. But it seems to me that there
might be issues with, say, going out of the address space and getting a
segmentation fault. This could happen, I think, if the word the
character is in is the last in the address space, and one addresses it
as a double or quadruple word (it would be quad on 16-bit words). The C
standard guarantees that one can address one more than the maximum size
of a pointer's data, but I don't know if that really is applicable here,
since the pointer is to an 8-bit quantity. Maybe more of a C expert can
chime in. There may be other glitches like this that I'm not aware of.
Thread Previous
|
Thread Next