develooper Front page | perl.perl5.porters | Postings from January 2020

=?UTF-8?B?UmU6IOKAnHN0cmljdOKAnSBzdHJpbmdzPw==?=

Thread Previous | Thread Next
From:
=?UTF-8?Q?Andr=c3=a9_Warnier_=28tomcat/perl=29?=
Date:
January 8, 2020 10:00
Subject:
=?UTF-8?B?UmU6IOKAnHN0cmljdOKAnSBzdHJpbmdzPw==?=
Message ID:
d2be5e83-5cb4-5a80-8fce-723cbd9515e0@ice-sa.com
On 07.01.2020 17:41, demerphq wrote:
> On Tue, 7 Jan 2020 at 02:59, Felipe Gasper <felipe@felipegasper.com> wrote:
>>
>>
>>> On Jan 6, 2020, at 7:02 PM, Dan Book <grinnz@gmail.com> wrote:
>>>
>>> On Mon, Jan 6, 2020 at 11:07 AM Felipe Gasper <felipe@felipegasper.com> wrote:
>>>
>>> Is Sereal::Encode wrong, then? It serializes Perl strings to a format that encodes binary and text as separate types, and the current implementation uses SVfUTF8 to make that distinction.
> 
> Just wanted to repeat what I said earlier, the choice of BINARY for
> certain text types in Sereal is merely an accident, we didn't mean
> "this is NOT text" by saying "BINARY", it actually means "this cannot
> be assumed to be utf8 encoded".
> 
> I think people get confused by this subject because they have a broken
> mental model of what "text" is. Text is just a series of numbers which
> are given semantic meaning by associating them with a glyph, and it is
> that glyph which has semantic meaning to humans.
> 
> In perl internals there are relatively few places that care about the
> semantic meaning of these numbers, with the predominat case being
> where case-transformations or case-insensitivity is implemented. Eg,
> lc().
> 

and length()..
and index()

[...]

 From a practical point of view, the most "annoying" cases are probably when a perl 
program or module gets text data back from a function that belongs to another module, 
without knowing what (if any) encoding/decoding is done or not by this module.
Even examining the code of the other module often does not clear up things, because
a) often this other module itself relies on yet another module to provide this data, 
leading to a long recursive investigation
b) even if the called module would implicitly or explicitly do some encoding to utf8, this 
still does not guarantee that the returned text string would have the utf8 flag on, or ?

But from this already long thread, I have to think that this is, and will remain, one of 
those "quirks" of perl5, because it seems to be a fundamental and difficult-to-change logic.

Let me ask a question to the obviously perl-internals-experts following this thread :
Is there any way in which a perl program, running as a stand-alone process on a Linux 
platform, calling some builtin or external function which is obviously meant to return a 
"text value", can /insure/ that this text value would come back utf8-encoded, with the 
utf8 flag set ?
("utf8-encoded" in this case meaning that a "è" would always be represented by 2 bytes in 
the text variable; and "insure" in this case meaning that I would not have to run a check 
each time I call this function (or another) in order to verify that it does not return a 
text value that does NOT have the utf8 flag set, but where my "è" IS represented by 2 bytes)
After years of using perl5, I am still not clear about this..

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About