develooper Front page | perl.perl5.porters | Postings from January 2020

“strict” strings?

Thread Next
Felipe Gasper
January 5, 2020 04:22
“strict” strings?
Message ID:

	Consider the following:

perl -MCpanel::JSON::XS -MCBOR::XS -e'print encode_cbor( Cpanel::JSON::XS->new()->encode(["\xc2\xa9"]))' | node -e 'var input = require("fs").readFileSync(0); var cbor = require("cbor"); console.log( JSON.parse(cbor.decodeAllSync(input)) )'
[ 'é' ]

Note the mangling of our original string, "é".

This is a confluence of two coercions:

1) Cpanel::JSON::XS, on receiving a byte string, accepts it and parses it as Latin-1. encode()’s output is the corresponding character string.

2) Perl, when it sends a character string to a plain filehandle, encodes the text as Latin-1. (For non-Latin-1 characters, a warning is thrown, and the character is encoded as UTF-8.)

Has it ever been considered to make such cases trigger exceptions rather than coercions? i.e., a JSON encoder would reject byte strings, and Perl would reject character strings when printing to filehandles that lack an encoding. This would force proper handling of encodings, which would, in turn, avoid “surprises” like the JSON/CBOR code above.

Cpanel::JSON::XS is its own thing, but as to Perl itself, maybe a “strictstrings” pragma could enable behavior whereby (at least) the following would trigger errors:

- writing a text string to a filehandle that lacks an associated encoding
ex.: perl -e'binmode *STDOUT, ":bytes"; my $str = "\xc2\xa9"; utf8::decode($str); print $str'

- writing a character to a filehandle whose encoding can’t accommodate the character (this is currently a warning)
ex.: perl -e'binmode *STDOUT, ":encoding(Latin-1)"; my $str = "\x{100}"; print $str'

- concatenating variable text and byte strings together
ex.: perl -e'my $a = "\x{100}"; my $b = "\xff"; my $c = $a . $b'

- most (?) pack() and unpack() operations on a text string
ex.: perl -E'say for unpack "C*", "\x{100}"' (prints 256 … odd value for an “unsigned char”)


`perldoc perlunitut` makes clear that a Perl program should not confuse text and byte strings. It just seems to me that Perl’s accommodation of workflows that violate that ideal creates a lot of confusion. Even if that can’t be changed as a default, hopefully adding a mode of operation that more strongly enforces correct handling of character encodings would help everyone.

Thank you for your consideration!

-Felipe Gasper
Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About