develooper Front page | perl.perl5.porters | Postings from June 2002

Re: Another Unicode s/// buglet?

Thread Previous | Thread Next
From:
Hugo van der Sanden
Date:
June 26, 2002 09:42
Subject:
Re: Another Unicode s/// buglet?
Message ID:
200206261643.g5QGh7519204@crypt.compulink.co.uk
SADAHIRO Tomoyuki <bqw10602@nifty.com> wrote:
:With Perl 5.8.0 RC2 (or plus Change 17353),
:there is something strange.
:
:In $unicode =~ s/$regex/$bytes/,
:$bytes is not upgraded,
:and a malformed Unicode string is generated.
:
:$unicode =~ s/$regex/$bytes/e is ok, though.

As far as I can tell, this is missing code rather than buggy code:
coping with a non-utf8 replacement string does not seem to have
been catered for in this class of cases.

Attached patch passes all existing tests here, as well as some new ones.

Due to the current RC status, I've taken the simplest approach I could
see, but there may be higher performance alternatives: the upgrade is
done regardless of whether the replacement string is ever needed, and
since it is not done in place, the upgrade will be repeated each time
it is needed. That means if you expect to perform the same substitution
on many utf8 strings, it would probably be faster if you ensure that
the replacement string is utf8.

Hugo
--- pp_hot.c.old	Tue Jun 25 17:21:07 2002
+++ pp_hot.c	Wed Jun 26 17:32:55 2002
@@ -1983,8 +1983,16 @@
 
     /* known replacement string? */
     if (dstr) {
-        c = SvPV(dstr, clen);
 	doutf8 = DO_UTF8(dstr);
+	if (doutf8 || !PL_reg_match_utf8) {
+	    c = SvPV(dstr, clen);
+	} else {
+	    SV* sv = sv_newmortal();
+	    SvSetMagicSV(sv, dstr);
+	    sv_utf8_upgrade(sv);
+	    c = SvPV(sv, clen);
+	    doutf8 = TRUE;
+	}
     }
     else {
         c = Nullch;
--- t/op/subst.t.old	Tue Jun 25 17:21:07 2002
+++ t/op/subst.t	Wed Jun 26 17:34:25 2002
@@ -7,7 +7,7 @@
 }
 
 require './test.pl';
-plan( tests => 92 );
+plan( tests => 106 );
 
 $x = 'foo';
 $_ = "x";
@@ -401,3 +401,43 @@
     like($a, qr/ñ/, "use utf8 LHS and RHS");
 }
 
+{
+    # subst with mixed utf8/non-utf8 type
+    my($ua, $ub, $uc, $ud) = ("\x{101}", "\x{102}", "\x{103}", "\x{104}");
+    my($na, $nb) = ("\x{ff}", "\x{fe}");
+    my $a = "$ua--$ub";
+    my $b;
+    ($b = $a) =~ s/--/$na/;
+    is($b, "$ua$na$ub", "s///: replace non-utf8 into utf8");
+    ($b = $a) =~ s/--/--$na--/;
+    is($b, "$ua--$na--$ub", "s///: replace long non-utf8 into utf8");
+    ($b = $a) =~ s/--/$uc/;
+    is($b, "$ua$uc$ub", "s///: replace utf8 into utf8");
+    ($b = $a) =~ s/--/--$uc--/;
+    is($b, "$ua--$uc--$ub", "s///: replace long utf8 into utf8");
+    $a = "$na--$nb";
+    ($b = $a) =~ s/--/$ua/;
+    is($b, "$na$ua$nb", "s///: replace utf8 into non-utf8");
+    ($b = $a) =~ s/--/--$ua--/;
+    is($b, "$na--$ua--$nb", "s///: replace long utf8 into non-utf8");
+
+    # now with utf8 pattern
+    $a = "$ua--$ub";
+    ($b = $a) =~ s/-($ud)?-/$na/;
+    is($b, "$ua$na$ub", "s///: replace non-utf8 into utf8 (utf8 pattern)");
+    ($b = $a) =~ s/-($ud)?-/--$na--/;
+    is($b, "$ua--$na--$ub", "s///: replace long non-utf8 into utf8 (utf8 pattern)");
+    ($b = $a) =~ s/-($ud)?-/$uc/;
+    is($b, "$ua$uc$ub", "s///: replace utf8 into utf8 (utf8 pattern)");
+    ($b = $a) =~ s/-($ud)?-/--$uc--/;
+    is($b, "$ua--$uc--$ub", "s///: replace long utf8 into utf8 (utf8 pattern)");
+    $a = "$na--$nb";
+    ($b = $a) =~ s/-($ud)?-/$ua/;
+    is($b, "$na$ua$nb", "s///: replace utf8 into non-utf8 (utf8 pattern)");
+    ($b = $a) =~ s/-($ud)?-/--$ua--/;
+    is($b, "$na--$ua--$nb", "s///: replace long utf8 into non-utf8 (utf8 pattern)");
+    ($b = $a) =~ s/-($ud)?-/$na/;
+    is($b, "$na$na$nb", "s///: replace non-utf8 into non-utf8 (utf8 pattern)");
+    ($b = $a) =~ s/-($ud)?-/--$na--/;
+    is($b, "$na--$na--$nb", "s///: replace long non-utf8 into non-utf8 (utf8 pattern)");
+}

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About