develooper Front page | perl.perl5.porters | Postings from January 2020

Re: ???strict??? strings?

Thread Previous | Thread Next
From:
Felipe Gasper
Date:
January 6, 2020 02:11
Subject:
Re: ???strict??? strings?
Message ID:
8F41F89A-4E6C-4E27-9E67-5C454CFFCDA9@felipegasper.com

> On Jan 5, 2020, at 5:55 PM, Tony Cook <tony@develop-help.com> wrote:
> 
> On Sun, Jan 05, 2020 at 01:05:55PM -0500, Felipe Gasper wrote:
>> 
>>> On Jan 5, 2020, at 10:07 AM, Zefram <zefram@fysh.org> wrote:
>>> 
>>> Felipe Gasper wrote:
>>>> Is there any supported text-decode operation, as per the
>>>> input-decode-work-encode-output workflow described in `perlunitut`,
>>>> that doesn't set that flag?
>>> 
>>> Yes.  Anyone who knows they're decoding Latin-1 is free to do it by *not*
>>> calling any decoding function.  In general, anything that decodes to
>>> any subset of the Latin-1 character repertoire is free to represent its
>>> result in the internal Latin-1 encoding rather than the internal UTF-8.
>>> Plenty of code that never generates non-Latin-1 characters does yield
>>> downgraded results.
>> 
>> The workflow you’re describing--considering a non-decode as equivalent to decoding as Latin-1--violates the workflow that `perlunitut` prescribes. At least, I submit that *most* people who read `perlunitut` would think that document inconsistent with what you’re saying. Moreover, the “implicit” Latin-1 decode you describe mismatches how Perl implements an explicit Latin-1 decode (i.e., adds the UTF8 flag).
> 
> The UTF8 flag doesn't mark a SV (with PV) as a character string.  A SV
> (with PV) without the UTF8 flag may be a character string.

Just to clarify: such a “character string” can only contain code points 0-255, right? Whereas a character string *with* the UTF8 flag may contain any code point?

>> Sereal does indeed distinguish explicitly between character and octet strings:
>> 
>> https://github.com/Sereal/Sereal/blob/master/sereal_spec.pod
>> 
>> In this regard it’s consistent with the other such technologies that I mentioned above.
> 
> It calls the BYTES strings "binary/latin1 string", so to me it looks
> like it only distinguishes on the encoding, not on whether it's some
> special binary type.

The paragraph labeled “String Types” clarifies that the specification indeed envisions “binary” versus “unicode” strings. It describes binary strings with the word “encodingless”.

-FG
Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About