develooper Front page | perl.perl5.porters | Postings from April 2021

Re: Perl 7: Fix string leaks?

Thread Previous | Thread Next
From:
Felipe Gasper
Date:
April 1, 2021 12:06
Subject:
Re: Perl 7: Fix string leaks?
Message ID:
175C4EA5-3F9A-4944-A7D7-89A4B70209A6@felipegasper.com


> On Mar 31, 2021, at 1:55 PM, Salvador Fandiño <sfandino@gmail.com> wrote:
> 
> On 31/3/21 5:02, Felipe Gasper wrote:
>>> On Mar 30, 2021, at 4:48 PM, Salvador Fandiño <sfandino@gmail.com> wrote:
>>> 
>>> On 30/3/21 20:53, Felipe Gasper wrote:
>>>>> On Mar 30, 2021, at 12:16 PM, Salvador Fandiño <sfandino@gmail.com> wrote:
>>>>> 
>>>>> Encoding/decoding should be done at the boundaries (syscall, XS, etc.) using sensible defaults and/or allowing the user to set them (as in PerlIO). That would IMO fix most of the problems.
>>>> “Encoding at the boundaries” is essentially what Sys::Binmode achieves for POSIX OSes, FWIW.
>>> 
>>> Yeah, that part is right! what is wrong is that it uses the wrong encoding (latin1) most of the time!!!
>>> 
>>> Actually, I think it would be pretty easy to modify your module so that it gets the encoding from the environment or alternatively taking it as a parameter at least on UNIX and alikes:
>>> 
>>>  use Sys::Binmode "latin5";
>>> 
>>> Windows is a different story...
>> I’d be curious to see an implementation of what you have in mind.
> 
> I have forked your module on GitHub:
> 
>  https://github.com/salva/p5-Sys-Binmode
> 
> Note that this is not something to incorporate in your version, just a proof of concept for experimentation.
> 
> With my modified version you can say:
> 
>  use Sys::Binmode "latin3";
> 
> And it would encode data in the "latin3" encoding before doing any IO.
> 
> Also, if you say:
> 
>  use Sys::Binmode;
> 
> It inspects your environment (LC_TYPE, LC_ALL, etc.), and sets the encoding to utf8 or latin1 depending on what it founds there.

Interesting!

I did consider an “auto-encode” feature like this but decided just to fix the built-ins’ immediate undefined-behaviour bug. What you’re talking about would be an analogue to PerlIO.

I’d have no issue with adding something like this, but I wouldn’t want the environment-encoding to be default. I’d probably favour something like `use Sys::Binmode ':env'` or such.

One issue here is that, as I understand things, quite a lot of Perl code out there forgoes character decoding entirely. (My own $work, for example.) This works fine--and is documented to work--as long as you treat all the streams as binary, which can include such things as munging/creating HTTP headers.

> 5) Also, I have taken a look at some of the Windows code, and it is pretty clear to me than the only way to get this working on Windows is doing it at a lower level.

Yeah, Perl has to know what a “character” is and what a “byte” is. Right now it just treats all strings as character strings.

> 
>> “Latin-1” is kind of a funny term; as Perl uses it internally it’s not really an “encoding” so much as “just bytes”. So I disagree (of course) that Sys::Binmode uses “the wrong encoding” because the “encoding” that it gives you is whatever bytes the string stores. It’s the same “encoding” that SvPVbyte provides for XSUBs.
> 
> That's just how you decide to reason about bytes, encodings, characters, etc. There are several points of view and all of them can be valid and have their advantages.
> 
> But the issue I see here is that if I have a variable in Perl say, for instance, $fn then doing...
> 
>  open $f, ">", $fn;
> 
> should create a file with the correct name, without me, the programmer having to worry about encoding issues.
> 
> We are in 2021, almost every operating system released on the last decade uses some form of Unicode by default. So, I don't think that something that by default just uses an encoding (or a no-encoding) that doesn't match what your OS uses could be a good idea.

I actually agree, except Perl itself uses a “no-encoding” to talk to the OS. So Sys::Binmode--even as it stands--just makes that behaviour consistent regardless of the SV internals.

If we could, though, teach Perl to differentiate a “character string” from a “byte string”, then we could do something like:

my $foo_utf8_bytes = 'épée';

my $foo_char = text::decode($foo_utf8_bytes);

mkdir $foo_char;   # auto-encodes per the environment

> 
> And I already understand that the programmer can call Encode::encode("utf8", $fn) before calling open. Actually, if he does that, he doesn't need your module at all!

... as long as that programmer assumes Perl stores that encoded string downgraded .. which AFAIK isn’t guaranteed.

-F
Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About