develooper Front page | perl.perl5.porters | Postings from July 2021

Re: "use v5.36.0" should imply UTF-8 encoded source

Thread Previous | Thread Next
From:
Dan Book
Date:
July 31, 2021 21:18
Subject:
Re: "use v5.36.0" should imply UTF-8 encoded source
Message ID:
CABMkAVVdtZfzNwDrf4NGkwmRG2YfHo4DuZ6+5mHWtc8Lc+osGQ@mail.gmail.com
On Sat, Jul 31, 2021 at 3:33 PM Darren Duncan <darren@darrenduncan.net>
wrote:

> On 2021-07-31 12:17 p.m., Darren Duncan wrote:
> > Now conversely, I don't have a problem with actually waiting until v5.38
> to
> > fully implement the change IF 5.36 contained some kind of precursor to
> prepare
> > the way, such as that 5.36 would issue warnings for code with a "use
> 5.36" that
> > wasn't valid UTF-8, saying that this code might parse differently under
> "use
> > 5.38".  That would let users know in a transitional version what might
> be a
> > problem before it is.
>
> So to clarify, I have a very specific proposal:
>
> 1.  That a "use 5.36;" will behave the same with respect to the uft8 stuff
> as
> "use 5.34;", but that if the source file / input stream is not entirely
> valid
> UTF-8 under a strict interpretation, the Perl interpreter will issue a
> warning
> saying so and why it matters.
>
> 2.  That a "use 5.38;", if the source file / input stream is not entirely
> valid
> UTF-8 under a strict interpretation, the Perl interpreter will issue a
> fatal
> error / die saying so and why it matters, and that as a result the parsing
> has
> failed.
>
> So a key thing is that the UTF-8 mode triggered by 5.36/5.38 is strict,
> doesn't
> use substitution characters or delete characters, it either passes the
> input
> unchanged as valid UTF-8 or it complains.  If "use utf8;" already does
> this then
> its the same, and otherwise it is stricter.
>
> Since this isn't spelled the same as "use utf8;" the new feature doesn't
> need to
> be identical in every way, we don't have to limit ourselves to that and
> the
> issues of silent corruption from substitution/deleting being the implicit
> operation, if that is what it used to do.
>
> On a further point, unlike a lot of the other "use" statements, I assume
> there
> is no good reason for a single file to be a mixture of literal encodings,
> and so
> having multiple "use encoding" statements in a file, either explicit or
> implied
> by a "use 5.38" etc, should be considered an error, and any occurrence of
> one
> would be expected to describe the entire file and not just the lexical
> scope it
> appears in, unlike strict/warnings/etc, its not flipped on or off mid-file.
>

You seem to be interpreting the major problem here as "source code which is
not valid or intended as UTF-8". This is not a significant issue and its
failure mode is rather obvious. There isn't a further discussion to be had
there.

The subtle issue is that "use utf8" changes (valid UTF-8) non-ascii literal
strings in the code to have different contents. Literal strings *must* be
used differently depending whether "use utf8" was active where they were
written. Without "use utf8", it's a byte string; with "use utf8", it's a
character string.

-Dan

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About