develooper Front page | perl.perl5.porters | Postings from January 2020

Re: “strict” strings?

Thread Previous
From:
aw
Date:
January 5, 2020 09:18
Subject:
Re: “strict” strings?
Message ID:
daa68481-2916-b593-0736-4c4e4b7a6cd1@ice-sa.com
On 05.01.2020 05:22, Felipe Gasper wrote:
> Hello,
> 
> 	Consider the following:
> 
> perl -MCpanel::JSON::XS -MCBOR::XS -e'print encode_cbor( Cpanel::JSON::XS->new()->encode(["\xc2\xa9"]))' | node -e 'var input = require("fs").readFileSync(0); var cbor = require("cbor"); console.log( JSON.parse(cbor.decodeAllSync(input)) )'
> [ 'é' ]
> 
> 
> Note the mangling of our original string, "é".
> 
> This is a confluence of two coercions:
> 
> 1) Cpanel::JSON::XS, on receiving a byte string, accepts it and parses it as Latin-1. encode()’s output is the corresponding character string.
> 
> 2) Perl, when it sends a character string to a plain filehandle, encodes the text as Latin-1. (For non-Latin-1 characters, a warning is thrown, and the character is encoded as UTF-8.)
> 
> 
> Has it ever been considered to make such cases trigger exceptions rather than coercions? i.e., a JSON encoder would reject byte strings, and Perl would reject character strings when printing to filehandles that lack an encoding. This would force proper handling of encodings, which would, in turn, avoid “surprises” like the JSON/CBOR code above.
> 
> 
> Cpanel::JSON::XS is its own thing, but as to Perl itself, maybe a “strictstrings” pragma could enable behavior whereby (at least) the following would trigger errors:
> 
> - writing a text string to a filehandle that lacks an associated encoding
> ex.: perl -e'binmode *STDOUT, ":bytes"; my $str = "\xc2\xa9"; utf8::decode($str); print $str'
> 
> - writing a character to a filehandle whose encoding can’t accommodate the character (this is currently a warning)
> ex.: perl -e'binmode *STDOUT, ":encoding(Latin-1)"; my $str = "\x{100}"; print $str'
> 
> - concatenating variable text and byte strings together
> ex.: perl -e'my $a = "\x{100}"; my $b = "\xff"; my $c = $a . $b'
> 
> - most (?) pack() and unpack() operations on a text string
> ex.: perl -E'say for unpack "C*", "\x{100}"' (prints 256 … odd value for an “unsigned char”)
> 
> -----
> 
> `perldoc perlunitut` makes clear that a Perl program should not confuse text and byte strings. It just seems to me that Perl’s accommodation of workflows that violate that ideal creates a lot of confusion. Even if that can’t be changed as a default, hopefully adding a mode of operation that more strongly enforces correct handling of character encodings would help everyone.
> 
> Thank you for your consideration!
> 
> cheers,
> -Felipe Gasper
> 

+1 (or rather, +100)

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About