develooper Front page | perl.perl5.porters | Postings from August 2012

utf8 pos cache: always keep the most recent value

Thread Next
From:
Dave Mitchell
Date:
August 21, 2012 03:52
Subject:
utf8 pos cache: always keep the most recent value
Message ID:
20120821105216.GC4366@iabyn.com
I've just pushed the following. Can anyone see any flaw in my logic?

commit 88621eaff54e5f5ea9adea13440d1750968643a6
Author:     David Mitchell <davem@iabyn.com>
AuthorDate: Tue Aug 21 10:55:00 2012 +0100
Commit:     David Mitchell <davem@iabyn.com>
CommitDate: Tue Aug 21 10:55:00 2012 +0100

    utf8 pos cache: always keep most recent value
    
    UTF-8 strings may have magic attached that caches up to two byte position
    to char position (or vice versa) mappings.
    
    When a third position has been calculated (e.g. via sv_pos_b2u()), the
    code has to decide how to update the cache: i.e. which value to discard.
    Currently for each of the three possibilities, it looks at what would be
    the remaining two values, and calculates the RMS sum of the three
    distances between ^ ... cache A .. cache B ... $. Whichever permutation
    gives the lowest result is picked. Note that this means that the most
    recently calculated value may be discarded.
    
    This makes sense if the next position request will be for a random part of
    the string; however in reality, the next request is more likely to be for
    the same position, or one a bit further along. Consider the following
    innocuous code:
    
        $_ = "\x{100}" x 1_000_000;
        $p = pos while /./g;
    
    This goes quadratic, and takes 150s on my system. The fix is is to always
    keep the newest value, and use the RMS calculation only to decide which of
    the two older values to discard. With this fix, the above code takes 0.4s.
    The test suite takes the same time in both cases, so there's no obvious
    slowdown elsewhere with this change.


Affected files ...
    
    M	sv.c
    M	t/op/utf8cache.t

Differences ...

diff --git a/sv.c b/sv.c
index 2004224..904f4bd 100644
--- a/sv.c
+++ b/sv.c
@@ -6928,7 +6928,6 @@ S_utf8_mg_pos_cache_update(pTHX_ SV *const sv, MAGIC **const mgp, const STRLEN b
 	   calculation in bytes simply because we always know the byte
 	   length.  squareroot has the same ordering as the positive value,
 	   so don't bother with the actual square root.  */
-	const float existing = THREEWAY_SQUARE(0, cache[3], cache[1], blen);
 	if (byte > cache[1]) {
 	    /* New position is after the existing pair of pairs.  */
 	    const float keep_earlier
@@ -6937,18 +6936,14 @@ S_utf8_mg_pos_cache_update(pTHX_ SV *const sv, MAGIC **const mgp, const STRLEN b
 		= THREEWAY_SQUARE(0, cache[1], byte, blen);
 
 	    if (keep_later < keep_earlier) {
-		if (keep_later < existing) {
-		    cache[2] = cache[0];
-		    cache[3] = cache[1];
-		    cache[0] = utf8;
-		    cache[1] = byte;
-		}
+                cache[2] = cache[0];
+                cache[3] = cache[1];
+                cache[0] = utf8;
+                cache[1] = byte;
 	    }
 	    else {
-		if (keep_earlier < existing) {
-		    cache[0] = utf8;
-		    cache[1] = byte;
-		}
+                cache[0] = utf8;
+                cache[1] = byte;
 	    }
 	}
 	else if (byte > cache[3]) {
@@ -6959,16 +6954,12 @@ S_utf8_mg_pos_cache_update(pTHX_ SV *const sv, MAGIC **const mgp, const STRLEN b
 		= THREEWAY_SQUARE(0, byte, cache[1], blen);
 
 	    if (keep_later < keep_earlier) {
-		if (keep_later < existing) {
-		    cache[2] = utf8;
-		    cache[3] = byte;
-		}
+                cache[2] = utf8;
+                cache[3] = byte;
 	    }
 	    else {
-		if (keep_earlier < existing) {
-		    cache[0] = utf8;
-		    cache[1] = byte;
-		}
+                cache[0] = utf8;
+                cache[1] = byte;
 	    }
 	}
 	else {
@@ -6979,18 +6970,14 @@ S_utf8_mg_pos_cache_update(pTHX_ SV *const sv, MAGIC **const mgp, const STRLEN b
 		= THREEWAY_SQUARE(0, byte, cache[1], blen);
 
 	    if (keep_later < keep_earlier) {
-		if (keep_later < existing) {
-		    cache[2] = utf8;
-		    cache[3] = byte;
-		}
+                cache[2] = utf8;
+                cache[3] = byte;
 	    }
 	    else {
-		if (keep_earlier < existing) {
-		    cache[0] = cache[2];
-		    cache[1] = cache[3];
-		    cache[2] = utf8;
-		    cache[3] = byte;
-		}
+                cache[0] = cache[2];
+                cache[1] = cache[3];
+                cache[2] = utf8;
+                cache[3] = byte;
 	    }
 	}
     }
diff --git a/t/op/utf8cache.t b/t/op/utf8cache.t
index 7ac0011..83ad4e8 100644
--- a/t/op/utf8cache.t
+++ b/t/op/utf8cache.t
@@ -10,7 +10,7 @@ BEGIN {
 
 use strict;
 
-plan(tests => 1);
+plan(tests => 2);
 
 my $pid = open CHILD, '-|';
 die "kablam: $!\n" unless defined $pid;
@@ -35,3 +35,15 @@ my $utf8magic = qr{ ^ \s+ MAGIC \s = .* \n
                       \s+ MG_LEN \s = .* \n }xm;
 
 unlike($_, qr{ $utf8magic $utf8magic }x);
+
+# With bad caching, this code used to go quadratic and take 10s of minutes.
+# The 'test' in this case is simply that it doesn't hang.
+
+{
+    local ${^UTF8CACHE} = 1; # enable cache, disable debugging
+    my $x = "\x{100}" x 1000000;
+    while ($x =~ /./g) {
+	my $p = pos($x);
+    }
+    pass("quadratic pos");
+}


-- 
"Procrastination grows to fill the available time"
    -- Mitchell's corollary to Parkinson's Law

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About