develooper Front page | perl.perl5.porters | Postings from November 2017

word-at-a-time searching for UTF-8 invariants

Thread Next
From:
Karl Williamson
Date:
November 16, 2017 19:03
Subject:
word-at-a-time searching for UTF-8 invariants
Message ID:
666a63de-a8b2-0573-fbfa-591fe7ccf09c@khwilliamson.com
I have pushed a branch for review at

https://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-invariant

which changes is_utf8_invariant_string_loc() (and hence 
is_utf8_invariant_string(), which is defined as a special case of the 
former) to use word-at-a-time (instead of per-byte) parsing through the 
input string.

This is commonly used functionality for parsing strings to decide if 
they are UTF-8 and need to have the UTF-8 flag on.

On a 64-bit system, it yields the following data from Porting/bench.pl

         byte   word
        ------ ------
     Ir 100.00 665.35
     Dr 100.00 797.03
     Dw 100.00 102.12
   COND 100.00 799.27
    IND 100.00  97.56

COND_m 100.00 144.83
  IND_m 100.00  75.00

  Ir_m1 100.00 100.00
  Dr_m1 100.00 100.02
  Dw_m1 100.00 104.12

  Ir_mm 100.00 100.00
  Dr_mm 100.00 100.00
  Dw_mm 100.00 100.00


This means, for example, that the COND measurement is 800% faster.

On a 32-bit system, the gains would be roughly half this.

I intend to push this to blead in a week, depending on the comments 
received.  It may be that some code that checks for invariants locally 
should change to use this inline function for the speed up.

For example, bytes_to_utf8() might want to call this first to quickly 
dispose of any invariant head of the converted string.

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About