Front page | perl.perl5.porters |
Postings from March 2022
Re: Karl’s auto-detect WAS Re: tightening up source code encoding semantics
Thread Previous
|
Thread Next
From:
Karl Williamson
Date:
March 1, 2022 16:20
Subject:
Re: Karl’s auto-detect WAS Re: tightening up source code encoding semantics
Message ID:
3794a608-977c-16a4-9104-bb37392952eb@khwilliamson.com
On 2/28/22 12:17, Dan Book wrote:
> On Mon, Feb 28, 2022 at 1:59 PM Karl Williamson <public@khwilliamson.com
> <mailto:public@khwilliamson.com>> wrote:
>
> On 2/27/22 17:51, Felipe Gasper wrote:
> >
> >> On Feb 27, 2022, at 14:06, Karl Williamson
> <public@khwilliamson.com <mailto:public@khwilliamson.com>> wrote:
> >>
> >> On 2/27/22 08:51, Felipe Gasper wrote:
> >>>> On Feb 23, 2022, at 09:18, Karl Williamson
> <public@khwilliamson.com <mailto:public@khwilliamson.com>> wrote:
> >>>>
> >>>> An option to think about is that it's possible to pretty
> reliably guess the encoding upon encountering the first line
> containing non-ASCII. Pod::Simple does this successfully and the
> choices are UTF-8 vs Windows CP1252, which is quite a bit harder to
> distinguish from UTF-8 than our alternative, Latin1. There have
> been no reports of problems with its technique since I beefed it up
> some years ago.
> >>>>
> >>>> The confusables for the Latin1 vs UTF-8 case all look like a
> Latin1 letter or the multiplication sign or division sign, followed
> by one or more Latin1 punctuation/symbols or C1 controls. If you
> look at their graphics, they all look like mojibake. Hence I'm
> confident, even without the Pod::Simple experience, that it is
> extremely unlikely we would guess wrong.
> >>>>
> >>>> Here's how it could work.
> >>>>
> >>>> You wouldn't need an encoding declaration in your file unless
> >>>> 1) the very unlikely case where we guessed wrong
> >>>> 2) you want to forbid non-ASCII in your file, as the original
> email thread discussed.
> >>>>
> >>>> Absent such a declaration, Perl would parse the file like it
> does today. When it encounters the first line containing a
> non-ASCII, it would make its guess, and if the guess is UTF-8, raise
> a warning, if enabled.
> >>>>
> >>>> 'no utf8' would be the way to say "Don't guess UTF-8'. It
> would throw an error if we had already seen what we took as UTF-8.
> >>>>
> >>>> 'use ascii' (however it is spelled) would cause an error to be
> thrown if a non-ASCII is encountered within its scope.
> >>>>
> >>>> 'no ascii' would be a no-op outside the scope of 'use ascii'.
> Otherwise it would restore the behavior to whatever it was when the
> 'use ascii' was encountered.
> >>>>
> >>>> I believe the only existing programs this scenario would
> effect are ones that (most likely, unsafely) mix UTF-8 and Latin1.
> >>>>
> >>>> An advantage is that a 'use utf8' would no longer be required
> in almost all circumstances.
> >>> I’m assuming by this that you mean to preserve utf8.pm
> <http://utf8.pm>’s auto-decode behaviour.
> >>
> >> I don't know what 'auto-decode' behavior means, but I suspect it
> is orthogonal to my proposal.
> >>
> >> And I'll say, that I never understand it when someone uses the
> word 'decode'. In just the context of characters, they are always
> encoded as something or other. So it is impossible to un-encode
> them, which to me the term decode implies. It is possible to change
> the encoding from 'this' to 'that'. But it makes no sense to say
> something is decoded.
> >
> > This is a bit off-topic, but: Would you argue, then, that all of
> Perl’s standard library and C API interfaces that use this
> terminology are improperly named?
> >
> > “Decode” makes sense (IMO) insofar as there are “blobs” (aka byte
> strings) versus “text strings”, the former being “decoded” into the
> latter. Perl, of course, “unhappily confuses” blobs with having
> “decoded” bytes from Latin-1. Much good would arise from a clearer
> distinction between these in Perl.
> >
> >>> I respectfully think this is just as bad as making utf8.pm
> <http://utf8.pm> part of the feature bundle. It’ll aggravate, not
> mitigate, the situation’s complexity.
> >>> It’ll also invalidate Vadim’s program:
> >>> perl -e'print "привет\n";'
> >>> … but at least that will warn. This one:
> >>> perl -e'print "¡Hola!\n";'
> >>> … won’t even afford that convenience; it’ll just spit out
> invalid UTF-8.
> >>
> >> Again I don't understand these, or how they are different, or
> how it invalidates one, or what warning you are referring to.
> >>
> >> (For those of you who don't know these languages, привет is
> Russian and similar languages and is pronounced something like preev
> yet. Hola is Spanish and pronounced O la. Both words mean hello in
> their respective languages)
> >>
> >> If your referring to the warning I am proposing adding, I don't
> see how the two examples differ. Both are in UTF-8, and don't say
> that, so both would warn. If you meant for the ¡Hola! to be in
> Latin1 (which it could be, unlike привет), it wouldn't be converted
> to UTF-8, and hence wouldn't spit out UTF-8.
> >
> > The warning I referred to is the wide-character warning that
> `print "привет\n"` will throw in Perl’s default configuration.
> >
> > Is the idea that the encoding detection logic you propose would
> always trigger a warning? If so, that seems reasonable.
> >
> > -F
>
> It would trigger a warning iff it interpreted the source as UTF-8 when
> no declaration to that effect had been made. So, in both your
> cases, it
> would warn. I don't understand how it "invalidates" things. I'm
> thinking the warning actually improves things from the current
> situation.
>
> If ¡Hola! were in Latin1, it would treat it as being a single byte
> encoding, and print it as such.
>
>
> The warning is certainly important and useful, but it would also be a
> behavior change - the bytes would now be interpreted as UTF-8 rather
> than the default of ISO-8859-1, so the code would contain different
> strings once parsed. This is what Felipe was referring to with the
> oneliner examples which represent a huge magnitude of existing code that
> expects to take bytes and output those same bytes. So regardless of the
> merits or benefits this would need to be opt-in; at which point (IMO) we
> might as well make it simple and explicit with no possibility of
> incorrect guesses.
>
> -Dan
Thanks, that is evidence of unwanted behavior change from my proposal.
But the original proposal also has the same issue. To remind you:
you must declare source encoding before any non-ASCII byte is encountered
These one liners would all fail to compile under that proposal.
Thread Previous
|
Thread Next