develooper Front page | perl.perl5.porters | Postings from November 2017

word-at-a-time searching for UTF-8 invariants

Thread Next
Karl Williamson
November 16, 2017 19:03
word-at-a-time searching for UTF-8 invariants
Message ID:
I have pushed a branch for review at

which changes is_utf8_invariant_string_loc() (and hence 
is_utf8_invariant_string(), which is defined as a special case of the 
former) to use word-at-a-time (instead of per-byte) parsing through the 
input string.

This is commonly used functionality for parsing strings to decide if 
they are UTF-8 and need to have the UTF-8 flag on.

On a 64-bit system, it yields the following data from Porting/

         byte   word
        ------ ------
     Ir 100.00 665.35
     Dr 100.00 797.03
     Dw 100.00 102.12
   COND 100.00 799.27
    IND 100.00  97.56

COND_m 100.00 144.83
  IND_m 100.00  75.00

  Ir_m1 100.00 100.00
  Dr_m1 100.00 100.02
  Dw_m1 100.00 104.12

  Ir_mm 100.00 100.00
  Dr_mm 100.00 100.00
  Dw_mm 100.00 100.00

This means, for example, that the COND measurement is 800% faster.

On a 32-bit system, the gains would be roughly half this.

I intend to push this to blead in a week, depending on the comments 
received.  It may be that some code that checks for invariants locally 
should change to use this inline function for the speed up.

For example, bytes_to_utf8() might want to call this first to quickly 
dispose of any invariant head of the converted string.

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About