develooper Front page | perl.perl5.porters | Postings from January 2020

=?UTF-8?B?UmU6IOKAnHN0cmljdOKAnSBzdHJpbmdzPw==?=

Thread Previous | Thread Next
From:
demerphq
Date:
January 8, 2020 12:02
Subject:
=?UTF-8?B?UmU6IOKAnHN0cmljdOKAnSBzdHJpbmdzPw==?=
Message ID:
CANgJU+UGEaQHvj3q_Pv_guJiS97Dzc4PNWf77=SRk45a2vLhfw@mail.gmail.com
On Wed, 8 Jan 2020 at 12:58, demerphq <demerphq@gmail.com> wrote:
>
> On Wed, 8 Jan 2020 at 11:01, André Warnier (tomcat/perl) <aw@ice-sa.com> wrote:
> >
> > On 07.01.2020 17:41, demerphq wrote:
> > > On Tue, 7 Jan 2020 at 02:59, Felipe Gasper <felipe@felipegasper.com> wrote:
> > >>
> > >>
> > >>> On Jan 6, 2020, at 7:02 PM, Dan Book <grinnz@gmail.com> wrote:
> > >>>
> > >>> On Mon, Jan 6, 2020 at 11:07 AM Felipe Gasper <felipe@felipegasper.com> wrote:
> > >>>
> > >>> Is Sereal::Encode wrong, then? It serializes Perl strings to a format that encodes binary and text as separate types, and the current implementation uses SVfUTF8 to make that distinction.
> > >
> > > Just wanted to repeat what I said earlier, the choice of BINARY for
> > > certain text types in Sereal is merely an accident, we didn't mean
> > > "this is NOT text" by saying "BINARY", it actually means "this cannot
> > > be assumed to be utf8 encoded".
> > >
> > > I think people get confused by this subject because they have a broken
> > > mental model of what "text" is. Text is just a series of numbers which
> > > are given semantic meaning by associating them with a glyph, and it is
> > > that glyph which has semantic meaning to humans.
> > >
> > > In perl internals there are relatively few places that care about the
> > > semantic meaning of these numbers, with the predominat case being
> > > where case-transformations or case-insensitivity is implemented. Eg,
> > > lc().
> > >
> >
> > and length()..
> > and index()
>
> Nope. They need to know the encoding, not the semantic value of the contents.
>
> > [...]
> >
> >  From a practical point of view, the most "annoying" cases are probably when a perl
> > program or module gets text data back from a function that belongs to another module,
> > without knowing what (if any) encoding/decoding is done or not by this module.
> > Even examining the code of the other module often does not clear up things, because
> > a) often this other module itself relies on yet another module to provide this data,
> > leading to a long recursive investigation
> > b) even if the called module would implicitly or explicitly do some encoding to utf8, this
> > still does not guarantee that the returned text string would have the utf8 flag on, or ?
> >
> > But from this already long thread, I have to think that this is, and will remain, one of
> > those "quirks" of perl5, because it seems to be a fundamental and difficult-to-change logic.
> >
> > Let me ask a question to the obviously perl-internals-experts following this thread :
> > Is there any way in which a perl program, running as a stand-alone process on a Linux
> > platform, calling some builtin or external function which is obviously meant to return a
> > "text value", can /insure/ that this text value would come back utf8-encoded, with the
> > utf8 flag set ?
>
> > ("utf8-encoded" in this case meaning that a "è" would always be represented by 2 bytes in
> > the text variable; and "insure" in this case meaning that I would not have to run a check
> > each time I call this function (or another) in order to verify that it does not return a
> > text value that does NOT have the utf8 flag set, but where my "è" IS represented by 2 bytes)
> > After years of using perl5, I am still not clear about this..
>
>
> utf8::upgrade();

Eg,

my $string= "Ba\x{DF}";
utf8::upgrade($string);

Note that utf8::upgrade() trusts that strings with the UTF8 flag set
ARE actually utf8.

cheers,
yves


-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About