Front page | perl.perl5.porters |
Postings from August 2012
Re: Seeking guidance on Benchmarking and potentially removing swashcaching
Thread Previous
|
Thread Next
From:
Karl Williamson
Date:
August 21, 2012 13:27
Subject:
Re: Seeking guidance on Benchmarking and potentially removing swashcaching
Message ID:
5033EF02.40303@khwilliamson.com
On 08/19/2012 04:53 PM, Jarkko Hietaniemi wrote:
>>
>> I would prefer to benchmark more realistic real-world data, and wonder
>> if anyone has some ideas on such, or any other thoughts on this issue.
>
> Just a very random idea: pull in a few wikipedia pages in, say, French,
> Russian, and Japanese, and while (m{\pL}g) { $n++ } through them?
>
>>
>
Thank you for this idea. I did it for Russian, and it showed the
current scheme had between 20-25% advantage over my proposed one, so I
won't be pursuing the proposal as-is.
It got me to being more realistic about what real-world applications
look like. Most things are written in just one language, or at most a
few. And so processing will be of just a relatively few code points.
The hash implementation shines in this regard. Modern Cyrillic has 32
characters IIRC, times 2 for upper/lower case. Swash hashes will handle
these in just a couple of keys. Chinese has a lot more characters, so I
tried it on a Chinese wikipedia page, and got similar results. So even
though there are more keys in the hash, it's not enough to degrade hash
performance.
I did an experiment with a a property for Korean with a small (4) number
of entries in the inversion list, and ran it on a Korean wikipedia
entry. That showed the inversion list was about 8% faster than a swash.
As I said at first, the swash implementation has the disadvantage that
it continues to use memory and add keys as different, not already-tried
code points are tested against it. That should lead to a performance
degradation over time, in such a situation. (I'm saying this not
knowing anything about the Perl implementation, but from my general
knowledge of how hashes work, which is decades old.) But I don't think
there are that many real world situations where this likely happens.
One possible one is a long-running web server that serves up pages in
many different languages. Perhaps Wikipedia's is structured this way.
Other situations could include linguistics analysis code, and code that
is analyzing Unicode itself. These last two possibilities are so
specialized that we shouldn't worry about them at all.
I'm now leaning towards a composite algorithm determined at regex
compile time. If the inversion list is sufficiently small, use it;
otherwise use the current method. If it's easy to detect when the hash
performance is suffering, we could do that at every insertion, and
probably clear the hash completely to start over.
In the long run, it would be best to get most or all of the standard
Unicode properties into memory, using the techniques that ICU does; then
this wouldn't much matter.
Thread Previous
|
Thread Next