develooper Front page | perl.perl5.porters | Postings from January 2020

Re: ???strict??? strings?

Thread Previous | Thread Next
From:
Tony Cook
Date:
January 6, 2020 02:53
Subject:
Re: ???strict??? strings?
Message ID:
20200106025310.GB5228@mars.tony.develop-help.com
On Sun, Jan 05, 2020 at 09:11:37PM -0500, Felipe Gasper wrote:
> 
> > On Jan 5, 2020, at 5:55 PM, Tony Cook <tony@develop-help.com> wrote:
> > 
> > On Sun, Jan 05, 2020 at 01:05:55PM -0500, Felipe Gasper wrote:
> >> 
> >>> On Jan 5, 2020, at 10:07 AM, Zefram <zefram@fysh.org> wrote:
> >>> 
> >>> Felipe Gasper wrote:
> >>>> Is there any supported text-decode operation, as per the
> >>>> input-decode-work-encode-output workflow described in `perlunitut`,
> >>>> that doesn't set that flag?
> >>> 
> >>> Yes.  Anyone who knows they're decoding Latin-1 is free to do it by *not*
> >>> calling any decoding function.  In general, anything that decodes to
> >>> any subset of the Latin-1 character repertoire is free to represent its
> >>> result in the internal Latin-1 encoding rather than the internal UTF-8.
> >>> Plenty of code that never generates non-Latin-1 characters does yield
> >>> downgraded results.
> >> 
> >> The workflow you’re describing--considering a non-decode as equivalent to decoding as Latin-1--violates the workflow that `perlunitut` prescribes. At least, I submit that *most* people who read `perlunitut` would think that document inconsistent with what you’re saying. Moreover, the “implicit” Latin-1 decode you describe mismatches how Perl implements an explicit Latin-1 decode (i.e., adds the UTF8 flag).
> > 
> > The UTF8 flag doesn't mark a SV (with PV) as a character string.  A SV
> > (with PV) without the UTF8 flag may be a character string.
> 
> Just to clarify: such a “character string” can only contain code points 0-255, right? Whereas a character string *with* the UTF8 flag may contain any code point?

Certainly, and an attempt to add a code point over 0xff to a SV
without the flag will upgrade the SV, enabling the flag.

> 
> >> Sereal does indeed distinguish explicitly between character and octet strings:
> >> 
> >> https://github.com/Sereal/Sereal/blob/master/sereal_spec.pod
> >> 
> >> In this regard it’s consistent with the other such technologies that I mentioned above.
> > 
> > It calls the BYTES strings "binary/latin1 string", so to me it looks
> > like it only distinguishes on the encoding, not on whether it's some
> > special binary type.
> 
> The paragraph labeled “String Types” clarifies that the specification indeed envisions “binary” versus “unicode” strings. It describes binary strings with the word “encodingless”.

That's in conflict with the table of Tags.

That paragraph also claims that "STR_UTF8 which is expected to contain
valid canonical UTF8 encoded unicode text data", but encoding a perl
SV that contains non-Unicode code-points appears to succeed.  I wonder
how it handles surrogates (properly paired or not) in Java/Javascript
strings.

I can see some value in a binary type, but I don't believe it would
ever be implemented with the current SVfUTF8 flag, since that isn't
what that flag is for.

For back-compat we'd really need two new flags:

1. indicates that the SV is flagged as one or the other
2. a flag (only valid when 1. is set) which is set for character strings

and then anything that deals with character or binary data can
set/check those flags.

Tony

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About