develooper Front page | perl.perl5.porters | Postings from January 2020

Re: =?utf-8?B?4oCcc3RyaWN04oCd?= strings?

Thread Previous | Thread Next
From:
Dave Mitchell
Date:
January 6, 2020 12:53
Subject:
Re: =?utf-8?B?4oCcc3RyaWN04oCd?= strings?
Message ID:
20200106115747.GK9181@iabyn.com
On Sat, Jan 04, 2020 at 11:22:38PM -0500, Felipe Gasper wrote:
> `perldoc perlunitut` makes clear that a Perl program should not confuse
> text and byte strings.

There is no such thing as "text strings" and "byte strings" in perl, or
even "character strings". In perl, a string is just a sequence of
codepoints.

As a (mostly internal) implementation detail, perl currently chooses to
use the UTF-8 transformation format to store strings containing codepoints
greater than 255, and sets the SVf_UTF8 flag to indicate this. Strings not
containing chars > 255 may or may not have the flag set. For example, perl
regards these strings as identical, but only the first has SVf_UTF8 set:

    $s = "ab\x{100}; chop($s);
    $t = "ab";

In principle a future perl could choose to use a completely different
internal representation to store strings, e.g. as an array of 32-bit
unsigned ints.

About the only valid use for inspecting the SVf_UTF8 flag is to determine
what storage format the string is using to store that array of small
integer values.  Any other use is likely a bug. In fact, this extra use of
the flag caused what is known here as the Unicode Bug, and we've spent the
last 20 years trying gradually to eradicate it. Specifically, the way perl
assigned semantic meaning to codepoints 128..255 varied depending on
whether SVf_UTF8 was set, which is wrong.

> - concatenating variable text and byte strings together
> ex.: perl -e'my $a = "\x{100}"; my $b = "\xff"; my $c = $a . $b'

There is absolutely nothing wrong with doing that, and I can't see any
valid reason for making that an error.

Which of the following $a.$b concatenations do you envisage being errors
under 'use strictstrings':

    $a = "\x{100}"; my $b = "\xff";
    $a = "\x{100}"; my $b = "\x41";
    $a = "\x{100}"; my $b = "A";
    $a = "\x{100}"; my $b = "A\xff"; chop($b);

-- 
Justice is when you get what you deserve.
Law is when you get what you pay for.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About