Front page | perl.perl5.porters |
Postings from January 2020
Re: “strict” strings?
Thread Previous
|
Thread Next
From:
Felipe Gasper
Date:
January 6, 2020 16:06
Subject:
Re: “strict” strings?
Message ID:
91FF4C28-CFA9-441E-A04F-71D5F7F4EB75@felipegasper.com
>
> On Jan 6, 2020, at 6:57 AM, Dave Mitchell <davem@iabyn.com> wrote:
>
> In principle a future perl could choose to use a completely different
> internal representation to store strings, e.g. as an array of 32-bit
> unsigned ints.
“strictstrings” would not conflict with that. There’s still no presumption of how Perl internally encodes strings; in fact, that abstraction would be *strengthened* by a workflow that mandates explicit encode/decode for all encodings.
>
> About the only valid use for inspecting the SVf_UTF8 flag is to determine
> what storage format the string is using to store that array of small
> integer values. Any other use is likely a bug. In fact, this extra use of
> the flag caused what is known here as the Unicode Bug, and we've spent the
> last 20 years trying gradually to eradicate it. Specifically, the way perl
> assigned semantic meaning to codepoints 128..255 varied depending on
> whether SVf_UTF8 was set, which is wrong.
Is Sereal::Encode wrong, then? It serializes Perl strings to a format that encodes binary and text as separate types, and the current implementation uses SVfUTF8 to make that distinction.
And won’t code that follows `perlunitut`’s prescribed workflow already fit that perfectly well?
undecode()d == binary == !SVfUTF8
decode()d == character == SVfUTF8
It seems like the language’s public documentation already tells people to write code in a way where SVfUTF8 can indeed be a reliable determinant between character or octet strings.
>
>> - concatenating variable text and byte strings together
>> ex.: perl -e'my $a = "\x{100}"; my $b = "\xff"; my $c = $a . $b'
>
> There is absolutely nothing wrong with doing that, and I can't see any
> valid reason for making that an error.
>
> Which of the following $a.$b concatenations do you envisage being errors
> under 'use strictstrings':
>
> $a = "\x{100}"; my $b = "\xff";
> $a = "\x{100}"; my $b = "\x41";
> $a = "\x{100}"; my $b = "A";
> $a = "\x{100}"; my $b = "A\xff"; chop($b);
They would all be errors.
Proper usage would require the binary->text conversion that `perlunitut` describes, e.g.:
$a = "\x{100}"; # text, U+0100
$b = "\xff"; # binary, 0xff
utf8::upgrade($b); # now it’s text, U+00FF
$c = $a . $b;
… or:
$b = "\xff"; # binary, 0xff
$b = Encode::decode('Latin-2', $b); # text, U+02D9
$c = $a . $b;
… or, if we want to go the other way:
$a = "\x{100}"; # text
$a = Encode::encode('UTF-32', $a); # binary, 0x00 0x00 0xfe 0xff 0x00 0x00 0x01 0x00
$b = "\xff"; # binary
$c = $a . $b;
To me that seems much cleaner than treating byte 0xff as interchangeable with U+00FF, and much more in accord with what `perlunitut` says should happen.
-F
Thread Previous
|
Thread Next