develooper Front page | perl.perl5.porters | Postings from April 2010

[perl #41530] RFC: internal string upgrade latin-1 => utf8 after s/// results in illegal utf8

Thread Next
From:
Karl Williamson via RT
Date:
April 12, 2010 00:11
Subject:
[perl #41530] RFC: internal string upgrade latin-1 => utf8 after s/// results in illegal utf8
Message ID:
rt-3.6.HEAD-6227-1271045227-538.41530-15-0@perl.org
I'm preparing a patch for this bug, and I'm uncertain about the best way
to do it.

First, the bug is caused by the code not realizing that when you have
two strings that independently may be in utf8 or not, that there are 4
cases to take care of.  I mention this because the error of only taking
care of 3 of the cases occurs in other places in the code as well.

The code does not consider the possibility that the replacement string
could be in utf8 when the source/target string isn't.  Thus 

$latin1 =~ s/latin1/utf8/;

fails.  The solution is to upgrade the variable to utf8.  My dilemma is
whether to always do the upgrade when the replacement string is in utf8,
or to do it only if the match succeeds.  The difference can lead to
different results later, as if there is no upgrade, the scalar's
characters in the 128-255 range will have different semantics than if
the upgrade takes place.

I'm leaning towards doing the upgrade, as I think we can infer from the
replacement string being in utf8 that the programmer intended that the
string have Unicode semantics, even if it isn't in utf8.  Therefore,
it's better to do the upgrade to force those semantics.

Is there a contrary opinion?

--Karl Williamson

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About