develooper Front page | perl.perl5.porters | Postings from September 2021

Re: Pre-RFC: Rename SVf_UTF8 et al.

Thread Previous | Thread Next
From:
Felipe Gasper
Date:
September 3, 2021 13:44
Subject:
Re: Pre-RFC: Rename SVf_UTF8 et al.
Message ID:
27978F07-9239-4167-9851-D2475054123D@felipegasper.com


> On Sep 3, 2021, at 2:30 AM, demerphq <demerphq@gmail.com> wrote:
> 
> On Thu, 2 Sept 2021 at 16:39, Dan Book <grinnz@gmail.com> wrote:
> There is way too much written here so I will be responding as I can.
> 
> On Thu, Sep 2, 2021 at 9:21 AM demerphq <demerphq@gmail.com> wrote:
> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com> wrote:
> 
> > On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
> > 
> 
> decoding/encoding: converting a string from one form to the other in a way that transforms the codepoints from one form to a potentially different form. Traditional we speak of decode_utf8() taking a latin1 string containing octets that make up a utf8 encoded string, and returning a string which is UTF8 on which represents the Unicode version of those octets. For well formed input this results in no change to the underlying string, but the flag is flipped on. Vice versa we speak of encode_utf8() which converts its input to a utf8 encoded form, regardless of what form it was represented internally. 
> 
> This is incorrect. Decode converts a string of bytes at the logical level (upgraded or downgraded does not matter) and returns a string of characters at the logical level (upgraded or downgraded does not matter). It may commonly use upgraded or downgraded strings as the input or output for efficiency but this is not required.
> 
> Nope *you* are wrong.  Decoding does not use upgrading or downgrading. Decoding utf8 is logically equivalent to an upgrade operation when the string contains only codepoints 0-127. For any codepoint ABOVE that it does something very different.

Decoding doesn’t *use* upgrading nor downgrading, but it accepts either and may output either.
 
> Decoding most definitely DOES define the internal format of the result string. If you decode utf8 the result is a UTF8 on string. If that string contained utf8 representing codepoints above 127 then the result will be different.

This is wrong. Example:

> perl -MDevel::Peek -e'my $foo = "e"; utf8::decode($foo); Dump $foo'
SV = PV(0x7fdb0a804c70) at 0x7fdb0b00ccd0
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x7fdb0a501cc0 "e"\0
  CUR = 1
  LEN = 10

As an *implementation detail*, utf8::decode *happens* to set the flag when given UTF-8 for code points 128-255:

> perl -MDevel::Peek -e'my $foo = "é"; utf8::decode($foo); Dump $foo'
SV = PV(0x7fd600804c70) at 0x7fd6008162d0
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x7fd6018026e0 "\303\251"\0 [UTF8 "\x{e9}"]
  CUR = 2
  LEN = 10

… but it would be just as valid -- and would print() the same way -- if utf8::decode() modified the PV to contain just \xe9.

> FG: You’re omitting what IMO is the most obvious purpose of the flag: to indicate whether the code points that the PV stores are the plain bytes, or are the UTF-8-decoded code points. This is why you can print() the string in either upgraded or downgraded forms, and it comes out the same.
> 
> Yves: Its hard to say what you are referring to here. If you mean codepoints 0-127, then it is unsurprising as the representation of them is equivalent in ASCII and UTF8 and Latin1. But if you mean a codepoint above the ASCII plane, then no they should not come out the same. If you are piping that data to a file I would expect the octets written to that file to be different. (assuming a binary filehandle with no layers magically transforming things). If your terminal renders them the same then I assume it is doing some magic behind the scenes to deal with malformed utf8.
> 
> DB: Not correct. An upgraded or downgraded string prints identically because you are printing the logical ordinals which do not change by this operation. Whether those ordinals are interpreted as bytes or Unicode characters depends what you are printing to, but in either case the internally-stored bytes are irrelevant to the user except to determine what those logical ordinals are
> 
> Yves: Dude, you keep saying I am not correct when what I have said is easily verifiable.
> 
> If you print chr(0xe9) to a filehandle and it does not contain the octet E9 then there is a problem
> 
> If you print chr(0xe9) to a utf8 terminal it should render a Unicode replacement character for a broken utf8 sequence.
> 
> If you print an encoded chr(0xe9) then it should rendr the glyph for E9.
> 
> If you think anything else is happening then prove it with code. 

These illustrate Dan’s point (assuming a UTF-8 terminal):

> perl -e'my $foo = "\xc3\xa9"; print $foo'
é

> perl -e'my $foo = "\xc3\xa9"; utf8::upgrade($foo); print $foo'
é

Upgraded or downgraded doesn’t change the logical content of the string; the important thing is the codepoints.

The cases you’ve mentioned -- pattern matching, system calls, and the like -- where a string’s internal storage *does* matter, e.g.:

> perl -e'my $foo = "é"; exec "echo", $foo'
é

> perl -e'my $foo = "é"; utf8::upgrade($foo); exec "echo", $foo'
é

... are bugs in Perl. This is why the feature bundles enable the features that fix (some of) those bugs. (And why IMO Sys::Binmode should join them.)


> Show me the code. As far as I know decode operations do not operate on unicode strings. Calling decode_utf8 on a string which is utf8 is a noop.

This isn’t true for either definition of “UTF-8 string”. This shows an upgraded string whose codepoints are UTF-8 being decoded:

> perl -MDevel::Peek -e'my $foo = "\xc3\xa9"; utf8::upgrade($foo); Dump $foo; utf8::decode($foo); Dump $foo;'
SV = PV(0x7f93eb804c70) at 0x7f93ec00c8d0
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x7f93eb5019c0 "\303\203\302\251"\0 [UTF8 "\x{c3}\x{a9}"]
  CUR = 4
  LEN = 10
SV = PV(0x7f93eb804c70) at 0x7f93ec00c8d0
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x7f93eb5019c0 "\303\251"\0 [UTF8 "\x{e9}"]
  CUR = 2
  LEN = 10

> It would be just as legitimate to mutate the PV to store a single octet, 0xe9, and leave the UTF8 flag off.
> 
> Nope. That would mean that Perl would use ASCII/Latin-1 case folding rules on the result, which would be wrong. It should use Unicode case folding rules for codepoint E9 if it was decoded as that codepoint. (Change the example to \x{DF} and you can see these issues in the flesh, \x{DF} should match "ss" in Unicode, but in ASCII/Latin1 it only matches \x{DF}. The lc() version of \x{DF} is "ss" but in Latin-1/Ascii there are no multi-byte case folds).  Even more suggestive that Perl doing this would be wrong is that in fact there is NO valid Unicode encoding of codepoint E9 which is only 1 octet long. So that would be extremely wrong of Perl to use a non Unicode encoding of unicode data dont you think? Also, what would perl do when the codepoint doesn't fit into a single octet?  Your argument might have some merit if you were arguing that Perl could have decoded it into "\x{E9}\0\0\0" and set the UTF-32 flag, but as stated it doesn't make sense.
> 
> Not correct. Under old rules, yes, the UTF8 flag determined whether Unicode rules are used in various operations; this was an abstraction break, and so the unicode_strings feature was added to fix the problem, and enabled in feature bundles since 5.12
> 
> Ah, ok, so if you *change* the default mode of perl it does something different than I described, and that makes my comments "incorrect"? What i described is how "normal" perl without any new features enabled works. If there are features that change what I have said feel free to use them. But it doesnt change that what I said is an accurate version of how the perl internals normally function.

The problem is that Perl’s default behaviour is inconsistent: when outputting to filehandles, computing length() or ord(), comparing strings, etc. all code points are the same regardless of the internal storage format. But when doing pattern-matches Perl treats upgraded/wide/UTF8-flagged strings differently from downgraded/narrow/non-flagged ones.

The latter behaviour is considered a bug.

-FG
Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About