develooper Front page | perl.perl5.porters | Postings from August 2012

Re: Seeking guidance on Benchmarking and potentially removing swashcaching

Thread Previous | Thread Next
Karl Williamson
August 21, 2012 13:27
Re: Seeking guidance on Benchmarking and potentially removing swashcaching
Message ID:
On 08/19/2012 04:53 PM, Jarkko Hietaniemi wrote:
>> I would prefer to benchmark more realistic real-world data, and wonder
>> if anyone has some ideas on such, or any other thoughts on this issue.
> Just a very random idea: pull in a few wikipedia pages in, say, French,
> Russian, and Japanese, and while (m{\pL}g) { $n++ } through them?

Thank you for this idea.  I did it for Russian, and it showed the 
current scheme had between 20-25% advantage over my proposed one, so I 
won't be pursuing the proposal as-is.

It got me to being more realistic about what real-world applications 
look like.  Most things are written in just one language, or at most a 
few.  And so processing will be of just a relatively few code points. 
The hash implementation shines in this regard.  Modern Cyrillic has 32 
characters IIRC, times 2 for upper/lower case.  Swash hashes will handle 
these in just a couple of keys.  Chinese has a lot more characters, so I 
tried it on a Chinese wikipedia page, and got similar results.  So even 
though there are more keys in the hash, it's not enough to degrade hash 

I did an experiment with a a property for Korean with a small (4) number 
of entries in the inversion list, and ran it on a Korean wikipedia 
entry.  That showed the inversion list was about 8% faster than a swash.

As I said at first, the swash implementation has the disadvantage that 
it continues to use memory and add keys as different, not already-tried 
code points are tested against it.  That should lead to a performance 
degradation over time, in such a situation.  (I'm saying this not 
knowing anything about the Perl implementation, but from my general 
knowledge of how hashes work, which is decades old.)  But I don't think 
there are that many real world situations where this likely happens. 
One possible one is a long-running web server that serves up pages in 
many different languages.  Perhaps Wikipedia's is structured this way. 
Other situations could include linguistics analysis code, and code that 
is analyzing Unicode itself.  These last two possibilities are so 
specialized that we shouldn't worry about them at all.

I'm now leaning towards a composite algorithm determined at regex 
compile time.  If the inversion list is sufficiently small, use it; 
otherwise use the current method.  If it's easy to detect when the hash 
performance is suffering, we could do that at every insertion, and 
probably clear the hash completely to start over.

In the long run, it would be best to get most or all of the standard 
Unicode properties into memory, using the techniques that ICU does; then 
this wouldn't much matter.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About