develooper Front page | perl.perl5.porters | Postings from January 2020

=?utf-8?B?UmU6IOKAnHN0cmljdOKAnSBzdHJpbmdzPw==?=

Thread Previous | Thread Next
From:
Felipe Gasper
Date:
January 6, 2020 16:06
Subject:
=?utf-8?B?UmU6IOKAnHN0cmljdOKAnSBzdHJpbmdzPw==?=
Message ID:
91FF4C28-CFA9-441E-A04F-71D5F7F4EB75@felipegasper.com
> 
> On Jan 6, 2020, at 6:57 AM, Dave Mitchell <davem@iabyn.com> wrote:

> 
> In principle a future perl could choose to use a completely different
> internal representation to store strings, e.g. as an array of 32-bit
> unsigned ints.

“strictstrings” would not conflict with that. There’s still no presumption of how Perl internally encodes strings; in fact, that abstraction would be *strengthened* by a workflow that mandates explicit encode/decode for all encodings.

> 
> About the only valid use for inspecting the SVf_UTF8 flag is to determine
> what storage format the string is using to store that array of small
> integer values.  Any other use is likely a bug. In fact, this extra use of
> the flag caused what is known here as the Unicode Bug, and we've spent the
> last 20 years trying gradually to eradicate it. Specifically, the way perl
> assigned semantic meaning to codepoints 128..255 varied depending on
> whether SVf_UTF8 was set, which is wrong.

Is Sereal::Encode wrong, then? It serializes Perl strings to a format that encodes binary and text as separate types, and the current implementation uses SVfUTF8 to make that distinction.

And won’t code that follows `perlunitut`’s prescribed workflow already fit that perfectly well?

undecode()d == binary == !SVfUTF8
decode()d == character == SVfUTF8

It seems like the language’s public documentation already tells people to write code in a way where SVfUTF8 can indeed be a reliable determinant between character or octet strings.

> 
>> - concatenating variable text and byte strings together
>> ex.: perl -e'my $a = "\x{100}"; my $b = "\xff"; my $c = $a . $b'
> 
> There is absolutely nothing wrong with doing that, and I can't see any
> valid reason for making that an error.
> 
> Which of the following $a.$b concatenations do you envisage being errors
> under 'use strictstrings':
> 
>   $a = "\x{100}"; my $b = "\xff";
>   $a = "\x{100}"; my $b = "\x41";
>   $a = "\x{100}"; my $b = "A";
>   $a = "\x{100}"; my $b = "A\xff"; chop($b);

They would all be errors.

Proper usage would require the binary->text conversion that `perlunitut` describes, e.g.:

$a = "\x{100}";	   # text, U+0100
$b = "\xff";       # binary, 0xff
utf8::upgrade($b); # now it’s text, U+00FF
$c = $a . $b;

… or:

$b = "\xff";                         # binary, 0xff
$b = Encode::decode('Latin-2', $b);  # text, U+02D9
$c = $a . $b;

… or, if we want to go the other way:

$a = "\x{100}";                      # text
$a = Encode::encode('UTF-32', $a);   # binary, 0x00 0x00 0xfe 0xff 0x00 0x00 0x01 0x00
$b = "\xff";                         # binary
$c = $a . $b;

To me that seems much cleaner than treating byte 0xff as interchangeable with U+00FF, and much more in accord with what `perlunitut` says should happen.

-F
Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About