develooper Front page | perl.perl5.porters | Postings from August 2021

Re: "use v5.36.0" should imply UTF-8 encoded source

Thread Previous | Thread Next
From:
Felipe Gasper
Date:
August 1, 2021 00:54
Subject:
Re: "use v5.36.0" should imply UTF-8 encoded source
Message ID:
E58800AA-9374-4706-B2CB-AA90DE54E848@felipegasper.com


> On Jul 31, 2021, at 5:18 PM, Dan Book <grinnz@gmail.com> wrote:
> 
> On Sat, Jul 31, 2021 at 3:33 PM Darren Duncan <darren@darrenduncan.net> wrote:
> On 2021-07-31 12:17 p.m., Darren Duncan wrote:
> > Now conversely, I don't have a problem with actually waiting until v5.38 to 
> > fully implement the change IF 5.36 contained some kind of precursor to prepare 
> > the way, such as that 5.36 would issue warnings for code with a "use 5.36" that 
> > wasn't valid UTF-8, saying that this code might parse differently under "use 
> > 5.38".  That would let users know in a transitional version what might be a 
> > problem before it is.
> 
> So to clarify, I have a very specific proposal:
> 
> 1.  That a "use 5.36;" will behave the same with respect to the uft8 stuff as 
> "use 5.34;", but that if the source file / input stream is not entirely valid 
> UTF-8 under a strict interpretation, the Perl interpreter will issue a warning 
> saying so and why it matters.
> 
> 2.  That a "use 5.38;", if the source file / input stream is not entirely valid 
> UTF-8 under a strict interpretation, the Perl interpreter will issue a fatal 
> error / die saying so and why it matters, and that as a result the parsing has 
> failed.
> 
> So a key thing is that the UTF-8 mode triggered by 5.36/5.38 is strict, doesn't 
> use substitution characters or delete characters, it either passes the input 
> unchanged as valid UTF-8 or it complains.  If "use utf8;" already does this then 
> its the same, and otherwise it is stricter.
> 
> Since this isn't spelled the same as "use utf8;" the new feature doesn't need to 
> be identical in every way, we don't have to limit ourselves to that and the 
> issues of silent corruption from substitution/deleting being the implicit 
> operation, if that is what it used to do.
> 
> On a further point, unlike a lot of the other "use" statements, I assume there 
> is no good reason for a single file to be a mixture of literal encodings, and so 
> having multiple "use encoding" statements in a file, either explicit or implied 
> by a "use 5.38" etc, should be considered an error, and any occurrence of one 
> would be expected to describe the entire file and not just the lexical scope it 
> appears in, unlike strict/warnings/etc, its not flipped on or off mid-file.
> 
> You seem to be interpreting the major problem here as "source code which is not valid or intended as UTF-8". This is not a significant issue and its failure mode is rather obvious. There isn't a further discussion to be had there.
> 
> The subtle issue is that "use utf8" changes (valid UTF-8) non-ascii literal strings in the code to have different contents. Literal strings *must* be used differently depending whether "use utf8" was active where they were written. Without "use utf8", it's a byte string; with "use utf8", it's a character string.

Another way to look at it: the content of the parsed strings actually differs between the two:

my $x = do { no utf8; "éé" };
my $y = do { use utf8; "éé" };

In the above, $x is a sequence of 4 code points (195, 169, 195, 169), whereas $y is a sequence of 2 code points (233, 233). That’s it; there is no other difference between $x and $y. Perl doesn’t know that $x is a “byte string” and $y is a “character string”; it just knows the code points.

This would, I think, easily be the most disruptive, potentially “surprising” change yet introduced to a feature bundle.

-FG
Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About