develooper Front page | perl.perl5.porters | Postings from January 2020

=?UTF-8?B?UmU6IOKAnHN0cmljdOKAnSBzdHJpbmdzPw==?=

Thread Previous | Thread Next
From:
demerphq
Date:
January 8, 2020 11:58
Subject:
=?UTF-8?B?UmU6IOKAnHN0cmljdOKAnSBzdHJpbmdzPw==?=
Message ID:
CANgJU+XQLjEGVPi3_n8y1AHXm0iadC3HLtptn7XXp4F0OuKWOg@mail.gmail.com
On Wed, 8 Jan 2020 at 11:01, André Warnier (tomcat/perl) <aw@ice-sa.com> wrote:
>
> On 07.01.2020 17:41, demerphq wrote:
> > On Tue, 7 Jan 2020 at 02:59, Felipe Gasper <felipe@felipegasper.com> wrote:
> >>
> >>
> >>> On Jan 6, 2020, at 7:02 PM, Dan Book <grinnz@gmail.com> wrote:
> >>>
> >>> On Mon, Jan 6, 2020 at 11:07 AM Felipe Gasper <felipe@felipegasper.com> wrote:
> >>>
> >>> Is Sereal::Encode wrong, then? It serializes Perl strings to a format that encodes binary and text as separate types, and the current implementation uses SVfUTF8 to make that distinction.
> >
> > Just wanted to repeat what I said earlier, the choice of BINARY for
> > certain text types in Sereal is merely an accident, we didn't mean
> > "this is NOT text" by saying "BINARY", it actually means "this cannot
> > be assumed to be utf8 encoded".
> >
> > I think people get confused by this subject because they have a broken
> > mental model of what "text" is. Text is just a series of numbers which
> > are given semantic meaning by associating them with a glyph, and it is
> > that glyph which has semantic meaning to humans.
> >
> > In perl internals there are relatively few places that care about the
> > semantic meaning of these numbers, with the predominat case being
> > where case-transformations or case-insensitivity is implemented. Eg,
> > lc().
> >
>
> and length()..
> and index()

Nope. They need to know the encoding, not the semantic value of the contents.

> [...]
>
>  From a practical point of view, the most "annoying" cases are probably when a perl
> program or module gets text data back from a function that belongs to another module,
> without knowing what (if any) encoding/decoding is done or not by this module.
> Even examining the code of the other module often does not clear up things, because
> a) often this other module itself relies on yet another module to provide this data,
> leading to a long recursive investigation
> b) even if the called module would implicitly or explicitly do some encoding to utf8, this
> still does not guarantee that the returned text string would have the utf8 flag on, or ?
>
> But from this already long thread, I have to think that this is, and will remain, one of
> those "quirks" of perl5, because it seems to be a fundamental and difficult-to-change logic.
>
> Let me ask a question to the obviously perl-internals-experts following this thread :
> Is there any way in which a perl program, running as a stand-alone process on a Linux
> platform, calling some builtin or external function which is obviously meant to return a
> "text value", can /insure/ that this text value would come back utf8-encoded, with the
> utf8 flag set ?

> ("utf8-encoded" in this case meaning that a "è" would always be represented by 2 bytes in
> the text variable; and "insure" in this case meaning that I would not have to run a check
> each time I call this function (or another) in order to verify that it does not return a
> text value that does NOT have the utf8 flag set, but where my "è" IS represented by 2 bytes)
> After years of using perl5, I am still not clear about this..


utf8::upgrade();

If the string is already flagged as utf8 then this is a no-op, so it
relies on you not turning the flag on for data that is not utf8.

Yves
-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About