develooper Front page | perl.perl5.porters | Postings from April 2003

[patch] probable Unicode speed trap solution

Thread Next
From:
Pradeep Hodigere
Date:
April 8, 2003 18:18
Subject:
[patch] probable Unicode speed trap solution
Message ID:
20030409011751.17703.qmail@web12303.mail.yahoo.com
Hi All,

 perldoc perlunicode's speed section mentions the
slowness of length(), substr() and index() functions
when handling UTF-8 encoded strings. A mail thread
titled 'Unicode speed trap' discusses probable
solutions to this problem.

   I have an implementation that might be a solution
to this issue and have attached a patch for the same.
The patch brought about a sizeable performance
improvement in length() and substr() functions. 

   Following are the results from benchmarking
length() and substr() with the modified perl build.

######################################################################################################

Benchmarking length() function

LENGTH_B = 10000 ASCII character string
LENGTH_U = 10000 unicode character string

perl 5.8.0:
Benchmark: running LENGTH_B, LENGTH_U for at least 5
CPU seconds...
  LENGTH_B:  5 wallclock secs ( 5.34 usr +  0.00 sys =
 5.34 CPU) @ 4747579.59/s (n=25352075)
  LENGTH_U:  5 wallclock secs ( 5.22 usr +  0.00 sys =
 5.22 CPU) @ 10010.73/s (n=52256)

modified perl 5.8.0:
Benchmark: running LENGTH_B, LENGTH_U for at least 5
CPU seconds...
  LENGTH_B:  5 wallclock secs ( 5.14 usr +  0.01 sys =
 5.15 CPU) @ 5396072.62/s (n=27789774)
  LENGTH_U:  5 wallclock secs ( 5.04 usr +  0.00 sys =
 5.04 CPU) @ 5322035.91/s (n=26823061)


#######################################################################################################

Benchmarking substr() function. 

SUBSTR_B: 100 ASCII character string
SUBSTR_U: 100 UTF-8 encoded character string

perl 5.8.0:
Benchmark: running SUBSTR_B, SUBSTR_U, each for at
least 5 CPU seconds...
  SUBSTR_B:  6 wallclock secs ( 5.23 usr +  0.00 sys =
 5.23 CPU) @ 977.25/s (n=5111)
  SUBSTR_U:  6 wallclock secs ( 5.18 usr +  0.00 sys =
 5.18 CPU) @ 62.55/s (n=324)

modified perl 5.8.0:
Benchmark: running SUBSTR_B, SUBSTR_U, each for at
least 5 CPU seconds...
  SUBSTR_B:  5 wallclock secs ( 5.40 usr +  0.00 sys =
 5.40 CPU) @ 928.89/s (n=5016)
  SUBSTR_U:  6 wallclock secs ( 5.30 usr +  0.00 sys =
 5.30 CPU) @ 176.98/s (n=938)

########################################################

   The result of benchmarking a few other string
functions are attached to the mail.

A brief description of the solution:

   The solution needs modifying struct STRUCT_SV in
sv.h by adding a length element to it. On the first
call of length(), for a UTF-8 encoded string,
sv_len_utf8() sets the string length to sv_length, an
element in struct sv. Following calls to sv_len_utf8()
by length() or substr() would return the 
value stored in sv_length.

   All string modifying functions reset sv_length to
0.

    The benchmark results would indicate that
functions  that operate on a single character at a
time don't see any improvements, for obvious reasons.
 
    On a side note, I believe the 'speed' section in
the current perlunicode document may not be reflecting
the comparison between split() and substr().

  perldoc perlunicode says: 

  "Even though the algorithm based on substr() is
faster than split() for byte-encoded data, it pales in
comparison to the speed of split() when used with
UTF-8 data. "

  Benchmarking split() and substr() gave me the
following results that indicate substr() is faster
than split().

  Please review the following benchmark results:
  
Since we need to benchmark subtr() alone, i store the
length of the string in advance.

% perl -e '
  use Benchmark;
  use strict;
  our $l = 10000;
  our $b; our $u;
  our $bl; our $ul;
 
  our $b = our $u = "x" x $l;
  substr($u,0, 1) = "\x{100}";
  my $bl = length($b) - 1;
  my $ul = length($u) - 1;
  timethese(-5,{
    SPLIT_B => q{ for my $c (split //, $b){}  },
    SPLIT_U => q{ for my $c (split //, $u){}  },
    SUBSTR_N_B => q{ for my $i (0..$bl){my $c =
substr($b,$i,1);} },
    SUBSTR_N_U => q{ for my $i (0..$ul){my $c =
substr($u,$i,1);} },
  });'
 
following are the results:
Benchmark: running SPLIT_B, SPLIT_U, SUBSTR_N_B,
SUBSTR_N_U, each for at least 5 CPU seconds...
   SPLIT_B:  6 wallclock secs ( 5.30 usr +  0.02 sys =
 5.32 CPU) @ 51.13/s (n=272)
   SPLIT_U:  5 wallclock secs ( 5.37 usr +  0.00 sys =
 5.37 CPU) @ 50.28/s (n=270)
SUBSTR_N_B:  5 wallclock secs ( 5.13 usr +  0.00 sys =
 5.13 CPU) @ 250791.62/s (n=1286561)
SUBSTR_N_U:  4 wallclock secs ( 5.31 usr +  0.00 sys =
 5.31 CPU) @ 10028.63/s (n=53252)

  as against the following script that's in
perlunicode man page:

%perl -e '
  use Benchmark;
  use strict;
  our $l = 10000;
  our $b; our $u;
  our $bl; our $ul;
 
  our $b = our $u = "x" x $l;
  substr($u,0, 1) = "\x{100}";
  timethese(-5,{
  SPLIT_B => q{ for my $c (split //, $b){}  },
  SPLIT_U => q{ for my $c (split //, $u){}  },
  SUBSTR_L_B => q{ for my $i (0..length($b)-1){my $c =
substr($b,$i,1);} },
  SUBSTR_L_U => q{ for my $i (0..length($u)-1){my $c =
substr($u,$i,1);} },
});'

  whose results are:
Benchmark: running SPLIT_B, SPLIT_U, SUBSTR_L_B,
SUBSTR_L_U, each for at least 5 CPU seconds...
   SPLIT_B:  5 wallclock secs ( 5.35 usr +  0.00 sys =
 5.35 CPU) @ 50.84/s (n=272)
   SPLIT_U:  5 wallclock secs ( 5.36 usr +  0.01 sys =
 5.37 CPU) @ 50.65/s (n=272)
SUBSTR_L_B:  6 wallclock secs ( 5.34 usr +  0.00 sys =
 5.34 CPU) @ 98.31/s (n=525)
SUBSTR_L_U:  6 wallclock secs ( 5.74 usr +  0.00 sys =
 5.74 CPU) @  0.70/s (n=4)


  The first set of results is faster than the second
as length() is called while benchmarking substr() in
the second test.

  So the actual comparison between split() and
substr() is in the first set of results. As is
apparent, substr() is faster than split() in those
tests.

  I would greatly appreciate advise/other ideas on the
patch.

thanks,
-Pradeep  

__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - File online, calculators, forms, and more
http://tax.yahoo.com
Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About