develooper Front page | perl.perl5.porters | Postings from January 2004

[perl #24826] approx. 10 times faster utf8 string operations

From:
roal@anet.at
Date:
January 24, 2004 15:43
Subject:
[perl #24826] approx. 10 times faster utf8 string operations
Message ID:
m3smi4rjzd.wl_rspier@pobox.com

This is a bug report for perl from roal@anet.at,
generated with the help of perlbug 1.34 running under perl v5.8.2.

The perlunicode pod says

	In Perl 5.8.0 the slowness was often quite spectacular; 
	in Perl 5.8.1 a caching scheme was introduced which will hopefully make the 
	slowness somewhat less spectacular, at least for some operations. In general, 
	operations with UTF-8 encoded strings are still slower.

Regular Expression have always been what Perl is so famous for, and are certainly 
one reason for Perl's name, being a Practical Extraction and Reporting Language.
But, unfortunately, there is yet no more efficiency on strings if they are 
flagged as UTF-8.

I have investigated on this and found a solution to make regular expression operations
including case-insensivity, lower- and uppercasing on UTF-8 encoded strings 
to about 10 times faster as before.

Below, there is the test script I used to measure the effectiveness, on a simple pure
ASCII string. I got the following results:

On Perl 5.8.0, case-insensitive searches on utf8 strings are always extremly slow!
Although, the performance became a little better with the patched files
(524 s -> 474 s on a Windows machine).

On Perl 5.8.2, the performance is much better, but still very poor per default!
Fortunately, with the patched files the performance really speeds up!!!
(77 s -> 9 s on the same Windows machine and 775 s -> 66 s on a slower BSD/OS system).

Save the code given below as "utf8.pl" and run it by executing

	perl utf8.pl

to get test results as shown below, or, with another multiplier value used to create the test string, 
for example 10000 (which is equal to 1e4):

	perl utf8.pl 1e4

My results have been:

with default Perl 5.8.2:
=======================

UTF-8 Performance Test on MSWin32 with Perl 5.8.2
Tue Jan  6 16:50:55 2004: Test String created: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan  6 16:50:55 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan  6 16:50:59 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 4 seconds

String is now treated as utf8
Tue Jan  6 16:51:55 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan  6 16:52:16 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 77 seconds

Switching to utf8 semantics required the following additional files to load:
        unicore/Canonical.pl
        unicore/Exact.pl
        unicore/To/Fold.pl
        unicore/To/Lower.pl
        unicore/To/Upper.pl
        unicore/lib/Word.pl
        utf8.pm
        utf8_heavy.pl
--

UTF-8 Performance Test on bsdos with Perl 5.8.2
Tue Jan  6 06:33:46 2004: Test String created: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan  6 06:33:47 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan  6 06:34:16 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 30 seconds

String is now treated as utf8
Tue Jan  6 06:45:21 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan  6 06:47:11 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 775 seconds

with Perl 5.8.2, after the patch:
================================

UTF-8 Performance Test on MSWin32 with Perl 5.8.2
Tue Jan  6 16:54:24 2004: Test String created: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan  6 16:54:24 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan  6 16:54:28 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 4 seconds

String is now treated as utf8
Tue Jan  6 16:54:30 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan  6 16:54:37 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 9 seconds
--

UTF-8 Performance Test on bsdos with Perl 5.8.2
Tue Jan  6 06:52:35 2004: Test String created: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan  6 06:52:36 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan  6 06:53:05 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 30 seconds

String is now treated as utf8
Tue Jan  6 06:53:14 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan  6 06:54:11 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 66 seconds

with default Perl 5.8.0:
=======================

UTF-8 Performance Test on MSWin32 with Perl 5.8.0
Tue Jan  6 16:56:56 2004: Test String created: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan  6 16:56:56 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan  6 16:57:00 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 4 seconds

String is now treated as utf8
Tue Jan  6 17:05:24 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan  6 17:05:44 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 524 seconds

with Perl 5.8.0, after the patch:
================================

UTF-8 Performance Test on MSWin32 with Perl 5.8.0
Tue Jan  6 17:07:14 2004: Test String created: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan  6 17:07:14 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan  6 17:07:19 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 5 seconds

String is now treated as utf8
Tue Jan  6 17:15:05 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan  6 17:15:13 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 474 seconds


The Solution for the patch:
---------------------------

Entirely remove the '%utf8::ToSpecFUNCTION = (...)' definition from 'unicore/To/FUNCTION.pl',
where FUNCTION stands for 'Fold', 'Lower' and 'Upper'. Therefore,

from 'unicore/To/Fold.pl'  -> remove '%utf8::ToSpecFold  = (...)'
from 'unicore/To/Lower.pl' -> remove '%utf8::ToSpecLower = (...)'
from 'unicore/To/Upper.pl' -> remove '%utf8::ToSpecUpper = (...)'

Even when only the Variable name '%utf8::ToSpecFUNCTION' is used in Perl code anywhere, some black magic
turns on and causes the horrible performance slow down on utf8-strings! %utf8::ToSpecFUNCTION needs even not
to be defined to "enable" that "black magic"! 

This is just a workaround, but a very effective one. I guess that the real solution for that problem 
is somewhere within 'utf8.c', which contains

	The "special" is a string like "utf8::ToSpecLower", which means the
	hash %utf8::ToSpecLower.  The access to the hash is through
	Perl_to_utf8_case().

Further investigation into that C source may find the reason of this.

best,
rob.


=cut

######## start 'utf8.pl' test script ########
# Robert Allerstorfer 2004 01 06
# applies to Perl 5.8.2
#
$^W = 1;
use strict;
use 5.008;
require Encode;
# Encode is required because even in Perl 5.8.2 there is nothing like utf8::_utf8_on($octets)

my $multiply = @ARGV ? shift(@ARGV) + 0 : 0;
$multiply ||= 5e4;

printf "UTF-8 Performance Test on $^O with Perl %vd\n", $^V;
my $string .= join ("", ('A'..'Z', 'a'..'z')) x $multiply;
($_) = &now;
print "$_: Test String created: pure ASCII 'A-Za-z' x $multiply (";
print int((length($string) / 1024**2 * 10) + .5) / 10, " MB)\n\n";

utf8::encode($string);
(undef, my $t0) = &now;
&search($string);
my @inckeys0 = keys %INC;
(undef, my $t1) = &now;
print "Required time: ", $t1 - $t0, " seconds\n\n";

Encode::_utf8_on($string);
&search($string);
(undef, my $t2) = &now;
print "Required time: ", $t2 - $t1, " seconds\n\n";

print "Switching to utf8 semantics required the following additional files to load:\n\t";
my %seen;
@seen{@inckeys0} = ();
my @newinckeys;
foreach (keys %INC) { 
	push @newinckeys, $_
	unless exists $seen{$_}
	;
}
print "$_\n\t" foreach (sort @newinckeys);
print "\n";
exit 0;


sub now {
	my $t = time;
	return scalar localtime($t), $t;
}

sub search {
	my $string = shift;
	print "String is now treated as ";
	my $utf8_flag = $] < 5.008001 ? Encode::is_utf8($string) : utf8::is_utf8($string);
	print $utf8_flag ? "utf8" : "bytes", "\n";

	my $term = "abc";
	my $matches = $string =~ s/($term)/$1/gi;
	($_) = &now;
	print "$_: $matches case-insensitive occurencies of '$term' found in String\n";

	my ($lc, $uc) = (0, 0);
	while ($string =~ /(\w)/g) {
		if ($1 eq lc $1) {
			$lc ++;
		} elsif ($1 eq uc $1) {
			$uc ++;
		}
	}
	($_) = &now;
	print "$_: $lc lowercase and $uc uppercase characters found in String\n";
}


__END__
#
######## end of script ########

---
Flags:
    category=core
    severity=high
---
Site configuration information for perl v5.8.2:

Configured by roal at Fri Dec 19 04:41:37 EST 2003.

Summary of my perl5 (revision 5.0 version 8 subversion 2) configuration:
  Platform:
    osname=bsdos, osvers=4.2, archname=i386-bsdos
    uname='bsdos bsdos.anet.at 4.2 bsdi bsdos 4.2 kernel #0: wed oct 25 17:38:20 mdt 2000 polk@hephaestus.bsdi.com:mntproto4.2-i386usrsrcsyscompilegeneric i386 '
    config_args='-es -Duseshrplib -Adefine:libperl=p2x582.so -Dccdlflags=-Wl,-rpath,. -Ud_procselfexe -Uinstallusrbinperl -Dinstallprefix=~/perl -Dprefix=~/perl -Dcf_email=info@anet.at'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fno-strict-aliasing -I/usr/local/include',
    optimize='-O2',
    cppflags='-fno-strict-aliasing -I/usr/local/include'
    ccversion='', gccversion='2.95.2 19991024 (release)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='ld', ldflags =' -L/usr/X11/lib -L/usr/local/lib'
    libpth=~/usr/lib /usr/lib /usr/local/lib /usr/shlib /shlib /lib /usr/X11/lib
    libs=-lutil -lbind -ldl -lm -lc
    perllibs=-lutil -lbind -ldl -lm -lc
    libc=/shlib/libc.so, so=so, useshrplib=true, libperl=p2x582.so
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-rpath,. -Wl,-rpath,/usr/home/roal/perl/lib/5.8.2/i386-bsdos/CORE'
    cccdlflags='-fPIC', lddlflags='-shared -x  -L/usr/X11/lib -L/usr/local/lib'

Locally applied patches:
    ACTIVEPERL_LOCAL_PATCHES_ENTRY
    21846 Configure gets d_u32align wrong
    21739 [perl #24493] install.html not working
    21737 Ooops. left an XXX comment in, and worse still it's a // comment
    21735 utf8 keys now work for tied hashes
    21734 Accessing unicode keys in tie hashes via hv_exists was broken
    21733 ext/threads/t/problem.t
    21732 Config::myconfig() fails under ithreads
    21728 Update perlhist with 5.6.2
    21723 Include 'SCCS' in the list of dir names ignored by installperl
    21718 Empty subroutine as object method segfaults in 5.8.2 (sometimes)
    21714 Fix bug #24380: assigning list with duplicated keys to a hash
    21706 [perl #24460] [DOC PATCH] the begincheck program
    21693 must copy changes from win32/makeifle.mk to wince/makefile.ce
    21691 Update the list of pumpkings in perlhist.pod
    21687 [PATCH 5.6.2-RC1 pod/perlhist.pod]  Updated
    21677 OS/2 docu
    21676 Bug #24407: key for shared hash got stringified into wrong pool
    21673 Be sure to use -fPIC not -fpic on Linux/SPARC
    21672 extending the hash attack test
    21671 Benchmark.pm cmpthese segfault
    21662 'make minitest' fails for op/cproto and op/pat
    21586 Comment that this 'optimisation' is actually a necessary fixup
    21548 Sync with Pod::Perldoc 3.12
    21540 Fix backward-compatibility issues in if.pm

---
@INC for perl v5.8.2:
    /usr/home/roal/perl/lib/5.8.2/i386-bsdos
    /usr/home/roal/perl/lib/5.8.2
    /usr/home/roal/perl/lib/site_perl/5.8.2/i386-bsdos
    /usr/home/roal/perl/lib/site_perl/5.8.2
    /usr/home/roal/perl/lib/site_perl
    .

---
Environment for perl v5.8.2:
    HOME=/usr/home/roal
    LANG (unset)
    LANGUAGE (unset)
    LC_CTYPE=ISO8859-1
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/usr/home/roal/bin:/bin:/usr/bin:/usr/X11/bin:/usr/contrib/bin:/usr/contrib/mh/bin:/usr/games:/usr/local/bin
    PERL_BADLANG (unset)
    SHELL=/bin/bash






nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About