Front page | perl.perl5.porters |
Postings from January 2020
Re: ???strict??? strings?
January 7, 2020 09:04
Re: ???strict??? strings?
Message ID: CANgJU+W5=FVTzZxPCE5SO3gsdjps2WOt7GxxV4a_B00KtuQadw@mail.gmail.com
On Mon, 6 Jan 2020 at 03:41, Felipe Gasper <email@example.com> wrote:
> >> The UTF8 flag doesn't mark a SV (with PV) as a character string. A SV
> >> (with PV) without the UTF8 flag may be a character string.
> > Just to clarify: such a “character string” can only contain code points 0-255, right? Whereas a character string *with* the UTF8 flag may contain any code point?
> Follow-up question: does any binary/text-aware encoder (CBOR, Sereal, etc.) ever encode a non-UTF8-flag SV as text rather than binary?
I just wanted to very firmly say that the choice of names for tags in
Sereal was fairly arbitrary, and should not be used as part of an
argument about how Perl works.
So, strictly speaking Perl has two types of string data
"binary/latin-1/extended-ascii" and "utf8". Technically the utf8
variants arent even real "unicode" as we do not place certain
restrictions on which code-points the utf8 sequence can contain and
real unicode utf8 data is not allowed to contain certain codepoints.
The UTF8 flag is basically a commitment between different parts of the
software that a given string can be processed properly using utf8
based functions. When it is off it does not necessarily mean "this
text is binary data", it means "this text is not utf8". There is a
tradition in perl internals hacking circles to refer to these two
flavours as "binary" and "utf8" as the alternatives to the term
"binary" are wanting in various ways, eg "NOTUTF8" isnt right because
a non-utf8 buffer can contain utf8 without being flagged as such, and
vice versa, someone can take "real" binary data and accidentally
upgrade it to its equivalent utf8 representation, "LATIN-1" is close,
as codepoint wise it is correct, but we dont apply the full set of
LATIN-1 case-folding semantics for strings that do not have the utf8
flag on so that doesn't entirely fit either, so when you go through
the possible terms "binary" starts sounding appealing.
That something is utf8 does not mean it contains text data, and that
sometjing is not-utf8 does not mean it is not unicode text data, both
are entirely valid scenarios. It simply says something about the text
operations that are valid and legal when processing the data. So for
instance, if you have a sequence of octets that happens to be a valid
utf8 sequence, and you feed it to the regex engine with the utf8 flag
ON the regex engine will use utf8 macros to read the codepoints it
contains and use unicode semantics to apply rules like
case-insensitive matching and what not. If the flag is OFF then it
will use "binary" processing semantics, and treat the sequence of
bytes as a sequence of octets, and it will apply ASCII
case-insensitive matching rules to the codepoint values that it reads.
So we could just as easily called the BINARY tag "TEXT_BYTES" or
"OCTETS" or something like that, the choice of binary was purely a nod
to conventional discussion on this subject in the perl core
development group, in particular that related to working on the regex
engine which is one of the few parts of Perl that really cares about
the distinction any significant way. Most things in perl really dont
care about the utf8 flag beyond "hey this operation is kinda dumb on a
utf8 string", but the parts that have to do case-folding where one
must consider the semantic meaning of a codepoint are the exception,
eg, lc(), uc(), m//i s///i, etc. Most of the rest of perl really
doesn't care, and doesnt think of either strings as "text" but rather
as a buffer where the only consideration of the utf8 flag determines
which codepath to use to decode the data internally.
perl -Mre=debug -e "/just|another|perl|hacker/"