> On Jan 5, 2020, at 5:55 PM, Tony Cook <tony@develop-help.com> wrote: > > On Sun, Jan 05, 2020 at 01:05:55PM -0500, Felipe Gasper wrote: >> >>> On Jan 5, 2020, at 10:07 AM, Zefram <zefram@fysh.org> wrote: >>> >>> Felipe Gasper wrote: >>>> Is there any supported text-decode operation, as per the >>>> input-decode-work-encode-output workflow described in `perlunitut`, >>>> that doesn't set that flag? >>> >>> Yes. Anyone who knows they're decoding Latin-1 is free to do it by *not* >>> calling any decoding function. In general, anything that decodes to >>> any subset of the Latin-1 character repertoire is free to represent its >>> result in the internal Latin-1 encoding rather than the internal UTF-8. >>> Plenty of code that never generates non-Latin-1 characters does yield >>> downgraded results. >> >> The workflow you’re describing--considering a non-decode as equivalent to decoding as Latin-1--violates the workflow that `perlunitut` prescribes. At least, I submit that *most* people who read `perlunitut` would think that document inconsistent with what you’re saying. Moreover, the “implicit” Latin-1 decode you describe mismatches how Perl implements an explicit Latin-1 decode (i.e., adds the UTF8 flag). > > The UTF8 flag doesn't mark a SV (with PV) as a character string. A SV > (with PV) without the UTF8 flag may be a character string. Just to clarify: such a “character string” can only contain code points 0-255, right? Whereas a character string *with* the UTF8 flag may contain any code point? >> Sereal does indeed distinguish explicitly between character and octet strings: >> >> https://github.com/Sereal/Sereal/blob/master/sereal_spec.pod >> >> In this regard it’s consistent with the other such technologies that I mentioned above. > > It calls the BYTES strings "binary/latin1 string", so to me it looks > like it only distinguishes on the encoding, not on whether it's some > special binary type. The paragraph labeled “String Types” clarifies that the specification indeed envisions “binary” versus “unicode” strings. It describes binary strings with the word “encodingless”. -FGThread Previous | Thread Next