develooper Front page | perl.perl5.porters | Postings from March 2006

UTF-8 caching code

Thread Next
Nicholas Clark
March 23, 2006 07:44
UTF-8 caching code
Message ID:
I have done battle with the UTF-8 offset caching code and, well, it came a
close second.

There is now a ${^UTF8CACHE} variable. By default, it's set to 1, which means
that the offset cache is enabled. Set it to 0, and the cache is disabled.
Set it to -1 and the cache is enabled, but every result is double checked
with the uncached version. Compiling with -DPERL_UTF8_CACHE_ASSERT makes
-1 the default, and all the regression tests with this.

The cache itself is tweaked slightly - it's still 4 STRLEN values, but
instead of a byte/UTF-8 pair for a position, and a second pair for a substr
offset from there, it's now 2 absolute byte/UTF-8 tuples. This should mean
that there's more information stored on average, and so a greater chance of a
cache hit (or at least near-miss).

The code itself is tweaked somewhat - I believe that there were some
situations where the old code wasn't using all the information it had to
minimise the amount of linear searching it needed to perform. It should now
be using all the information possible to constrain the range of string that
needs to be searched linearly.

I think that it would be useful to have a command line (sub)flag to set
${^UTF8CACHE} to -1, to allow easy debugging of scripts. Should this be an
option on -C, or -D? Any suggestions?

Nicholas Clark

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About