Front page | perl.perl5.porters |
Postings from January 2020
Re: ???strict??? strings?
From: Felipe Gasper
January 6, 2020 03:43
Re: ???strict??? strings?
Message ID: 126DB2B6-85BB-4594-A946-EB4DE17C3B75@felipegasper.com
> On Jan 5, 2020, at 9:53 PM, Tony Cook <email@example.com> wrote:
> On Sun, Jan 05, 2020 at 09:11:37PM -0500, Felipe Gasper wrote:
>> Just to clarify: such a “character string” can only contain code points 0-255, right? Whereas a character string *with* the UTF8 flag may contain any code point?
> Certainly, and an attempt to add a code point over 0xff to a SV
> without the flag will upgrade the SV, enabling the flag.
FWIW, “strictstrings” would require the caller to decode() before adding that >0xff code point. So “no surprises”.
>> The paragraph labeled “String Types” clarifies that the specification indeed envisions “binary” versus “unicode” strings. It describes binary strings with the word “encodingless”.
> That's in conflict with the table of Tags.
Agreed. I filed an issue.
> I can see some value in a binary type, but I don't believe it would
> ever be implemented with the current SVfUTF8 flag, since that isn't
> what that flag is for.
I could well just be missing something, but I still don’t see anything that invalidates the idea of using SVfUTF8 for this assuming “strictstrings”. Yes, some existing code won’t work with it, but all code *could* be made to work with it, and most who learned from `perlunitut` probably already write code that works with it.
>> Follow-up question: does any binary/text-aware encoder (CBOR, Sereal, etc.) ever encode a non-UTF8-flag SV as text rather than binary?
> I don't know, and it's irrelevant.
> That isn't the purpose of the SVfUTF8 flag.
> Code that uses the SVfUTF8 flag to decide whether a string should be
> interpreted as octets or characters is broken.
It may be irrelevant to the language, but to an application it’s a game-changer. You’re basically saying that Sereal::Encoder, CBOR::XS, CBOR::Free, and CBOR::PP are all doing it wrong, and that to implement a correct distinction between binary and text strings those encoders need the application to provide a schema, similar to how Perl JSON encoders require the likes of JSON::Schema::Fit in order to achieve reliable strings vs. numbers. So to achieve reliable interoperation with, e.g., Python, not only do we need a “pre-schema” for strings/numbers/booleans, but also text/binary?
That kind of awkwardness would seem to make it harder to advocate for use of Perl in new code and hasten people’s desires to rewrite stuff in languages that behave more … straightforwardly? … in this regard.
Sereal::Encoder seems pretty popular. If both it and CBOR::XS are “wrong”, and if explicit encode/decode makes it all work anyway (i.e., no UTF8 == undecoded/bytes; UTF8 == decoded/text), isn’t it worthwhile to reconsider the problem?