develooper Front page | perl.perl5.porters | Postings from September 2021

Re: Pre-RFC: Rename SVf_UTF8 et al.

Thread Previous | Thread Next
From:
Tom Molesworth via perl5-porters
Date:
September 3, 2021 07:49
Subject:
Re: Pre-RFC: Rename SVf_UTF8 et al.
Message ID:
CAGXhHdkDDjKMGw-G1k23Vbmjsq7o5vxoCfcZJWgGOv_OXWXCBw@mail.gmail.com
On Fri, 3 Sept 2021 at 14:30, demerphq <demerphq@gmail.com> wrote:

> On Thu, 2 Sept 2021 at 16:39, Dan Book <grinnz@gmail.com> wrote:
>
>> On Thu, Sep 2, 2021 at 9:21 AM demerphq <demerphq@gmail.com> wrote:
>>
>>> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com>
>>> wrote:
>>>
>>>>
>>>> What you call “a UTF-8 string” is what I propose we call, per existing
>>>> nomenclature (i.e., utf8::upgrade()) “a PV-upgraded string”, with
>>>> corresponding code changes. Then the term “UTF-8 string” makes sense from a
>>>> pure-Perl context without requiring Perl programmers to worry about
>>>> interpreter internals.
>>>>
>>>
>>>>
>>> No. The flag does not mean "upgraded" it means "unicode semantics, utf8
>>> encoding". Upgrading is one way to get such a string, and it might even be
>>> the most common, but the most important and likely to be correct way is
>>> explicit decoding.
>>>
>>> If we are to rename the flag then we should just rename it as the
>>> UNICODE flag. Would have saved a world of confusion.
>>>
>>
>> This is exactly what we have defined as "upgraded". Decoding does not
>> define the internal format of the resulting string at all. The only
>> internal format which is upgraded is when the UTF8 flag is on.
>>
>
> Your definition is wrong then. You seem to have "upgrading" and "decoding"
> muddled.
>
> Decoding most definitely DOES define the internal format of the result
> string. If you decode utf8 the result is a UTF8 on string. If that string
> contained utf8 representing codepoints above 127 then the result will be
> different.
>

Given this:

perl -e'use Devel::Peek; use Encode; print Dump(Encode::decode("UTF-8",
"example"))'
SV = PV(0x55b88281b2b0) at 0x55b88272e4e0
  REFCNT = 2
  FLAGS = (TEMP,POK,pPOK,UTF8)
  PV = 0x55b8827f45d0 "example"\0 [UTF8 "example"]
  CUR = 7
  LEN = 10

I think the current behaviour is at least inefficient, if perhaps not
outright *wrong*... why would decoding enforce the UTF8 flag?

Put another way, if the resulting string has only codepoints 0..127, why
not leave the flag off so that string operations can be more efficient?

This extends to common cases such as UTF8-safe filter chains:

echo "example" | perl -CSD -lne'use Devel::Peek; s{e$}{es}; print Dump($_)'
SV = PV(0x556c3aab2000) at 0x556c3aaeb3e8
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x556c3aae41b0 "examples"\0 [UTF8 "examples"]
  CUR = 8
  LEN = 24

If that's not taking the faster pure-ASCII path for input, this would seem
like an easy optimisation opportunity. If the behaviour only happened with
the non-validating `utf8` decoding, then maybe it could be explained away
by not wanting to walk the entire length of the string... but then I'd at
least expect it to be different with the "UTF-8" encoding layer:

echo "example" | perl -lne'use Devel::Peek; BEGIN { binmode STDIN,
":encoding(UTF-8)" } s{e$}{es}; print Dump($_)'
SV = PV(0x55d8d2b76000) at 0x55d8d2baf4a8
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x55d8d2bad300 "examples"\0 [UTF8 "examples"]
  CUR = 8
  LEN = 24

So yes, decoding does set the UTF8 flag - but I'd argue that it
*shouldn't*, and the current behaviour is somewhere between a historical
accident and an oversight. To be clear, I'd expect the same non-UTF8 status
in the examples so far, as we see from this:

perl -e'use Devel::Peek; use utf8; my $text = "example"; print Dump($text)'
SV = PV(0x55be3864aff0) at 0x55be3866fe60
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK)
  PV = 0x55be3867f020 "example"\0
  CUR = 7
  LEN = 10
  COW_REFCNT = 1

What am I missing here?

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About