develooper Front page | perl.perl5.porters | Postings from September 2021

Re: Pre-RFC: Rename SVf_UTF8 et al.

Thread Previous | Thread Next
September 2, 2021 13:53
Re: Pre-RFC: Rename SVf_UTF8 et al.
Message ID:
On Mon, 23 Aug 2021 at 01:59, Yuki Kimoto <> wrote:

> Personally, I'm starting to agree on the goal of Felipe.
> 1. Being able to distinguish between Text and Bytes from user

It seems like what you want is to redefine our use of Unicode-string/UTF-8
flag to be "Text" and then to call the other form "Bytes", but that doesn't
make sense. We define non-utf8 data to be implicitly ASCII/Latin-1. ASCII
because of case folding rules. Latin-1 because of conversion to Unicode,
which defines codepoints 0-255 to be equivalent to the codepoints 0.255 in
latin1. And we implicitly assume the equivalency in our operations.

> 2. Text is Unicode code point which is represented by UTF-8

chr(65) returns a latin-1 (eg NON-UTF8 flagged) character/string "A" which
happens to be octet identical but not flag identical to the Unicode
character "A". Are you suggesting that chr() doesn't return Text? Wouldn't
that be weird? And in concatenation what is supposed to happen when you
have Bytes . Text? Is that even legal in your scheme?

Take this further, is an operation like lc() even legal on "Bytes"?
Currently: lc(chr(65)) eq "a". Since chr(65) doesnt return a Unicode
character, and thus is not Text, shouldnt the lc() die? Or would you also
want to change that?

> 3. Perl config has default OS text character set and OS file system
> character set

As far as I know the assumption that all non-Unicode data is Latin-1 is
baked into Perl in a very firm way. So I dont see how this could be related
to the OS.

> 4. Perl standard function(print, open, etc) output string by encoding
> above 3 character set if the string is Text.

I dont see how we could change this. Anyone who cares exactly how data is
emitted to disk or any other "wire" format should be using Encode to
explicitly encode their data.

Perl strings are what perl strings are. I find that the people who have
trouble with them are usually the ones who like to pretend they work
differently than they do, instead of just respecting how they work and
being very explicit when they need to care, which for me personally has
been pretty rarely, eg, specialized output code or processing code.
(Parsing emails is a good place where you can get burned with encoding
issues and learn a lot.)

Having said that I have seen a lot of people for one reason or another get
encoding wrong in various ways, especially with MySQL or other over-wire
situations. Double encoding errors are common (eg where people accidentally
upgrade already encoded but flag-off utf8 data). At work we have a function
called recurse_decode_utf8() which takes a string and does its best to
"reduce" it to its minimal form by repeatedly turning off the utf8 flag,
and then executing decode_utf8() on the string and then downgrade until the
decode operation throws an error. Widespread use of this function o string
data almost completely eliminated all of our utf8 problems. (Ill post the
code in another mail.)


perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About