develooper Front page | perl.perl5.porters | Postings from March 2022

Re: Karl’s auto-detect WAS Re: tightening up source code encoding semantics

Thread Previous | Thread Next
Karl Williamson
March 1, 2022 16:20
Re: Karl’s auto-detect WAS Re: tightening up source code encoding semantics
Message ID:
On 2/28/22 12:17, Dan Book wrote:
> On Mon, Feb 28, 2022 at 1:59 PM Karl Williamson < 
> <>> wrote:
>     On 2/27/22 17:51, Felipe Gasper wrote:
>      >
>      >> On Feb 27, 2022, at 14:06, Karl Williamson
>     < <>> wrote:
>      >>
>      >> On 2/27/22 08:51, Felipe Gasper wrote:
>      >>>> On Feb 23, 2022, at 09:18, Karl Williamson
>     < <>> wrote:
>      >>>>
>      >>>> An option to think about is that it's possible to pretty
>     reliably guess the encoding upon encountering the first line
>     containing non-ASCII. Pod::Simple does this successfully and the
>     choices are UTF-8 vs Windows CP1252, which is quite a bit harder to
>     distinguish from UTF-8 than our alternative, Latin1.  There have
>     been no reports of problems with its technique since I beefed it up
>     some years ago.
>      >>>>
>      >>>> The confusables for the Latin1 vs UTF-8 case all look like a
>     Latin1 letter or the multiplication sign or division sign, followed
>     by one or more Latin1 punctuation/symbols or C1 controls.  If you
>     look at their graphics, they all look like mojibake.  Hence I'm
>     confident, even without the Pod::Simple experience, that it is
>     extremely unlikely we would guess wrong.
>      >>>>
>      >>>> Here's how it could work.
>      >>>>
>      >>>> You wouldn't need an encoding declaration in your file unless
>      >>>> 1) the very unlikely case where we guessed wrong
>      >>>> 2) you want to forbid non-ASCII in your file, as the original
>     email thread discussed.
>      >>>>
>      >>>> Absent such a declaration, Perl would parse the file like it
>     does today.  When it encounters the first line containing a
>     non-ASCII, it would make its guess, and if the guess is UTF-8, raise
>     a warning, if enabled.
>      >>>>
>      >>>> 'no utf8' would be the way to say "Don't guess UTF-8'.  It
>     would throw an error if we had already seen what we took as UTF-8.
>      >>>>
>      >>>> 'use ascii' (however it is spelled) would cause an error to be
>     thrown if a non-ASCII is encountered within its scope.
>      >>>>
>      >>>> 'no ascii' would be a no-op outside the scope of 'use ascii'. 
>     Otherwise it would restore the behavior to whatever it was when the
>     'use ascii' was encountered.
>      >>>>
>      >>>> I believe the only existing programs this scenario would
>     effect are ones that (most likely, unsafely) mix UTF-8 and Latin1.
>      >>>>
>      >>>> An advantage is that a 'use utf8' would no longer be required
>     in almost all circumstances.
>      >>> I’m assuming by this that you mean to preserve
>     <>’s auto-decode behaviour.
>      >>
>      >> I don't know what 'auto-decode' behavior means, but I suspect it
>     is orthogonal to my proposal.
>      >>
>      >> And I'll say, that I never understand it when someone uses the
>     word 'decode'.  In just the context of characters, they are always
>     encoded as something or other.  So it is impossible to un-encode
>     them, which to me the term decode implies.  It is possible to change
>     the encoding from 'this' to 'that'.  But it makes no sense to say
>     something is decoded.
>      >
>      > This is a bit off-topic, but:  Would you argue, then, that all of
>     Perl’s standard library and C API interfaces that use this
>     terminology are improperly named?
>      >
>      > “Decode” makes sense (IMO) insofar as there are “blobs” (aka byte
>     strings) versus “text strings”, the former being “decoded” into the
>     latter. Perl, of course, “unhappily confuses” blobs with having
>     “decoded” bytes from Latin-1. Much good would arise from a clearer
>     distinction between these in Perl.
>      >
>      >>> I respectfully think this is just as bad as making
>     <> part of the feature bundle. It’ll aggravate, not
>     mitigate, the situation’s complexity.
>      >>> It’ll also invalidate Vadim’s program:
>      >>> perl -e'print "привет\n";'
>      >>> … but at least that will warn. This one:
>      >>> perl -e'print "¡Hola!\n";'
>      >>> … won’t even afford that convenience; it’ll just spit out
>     invalid UTF-8.
>      >>
>      >> Again I don't understand these, or how they are different, or
>     how it invalidates one, or what warning you are referring to.
>      >>
>      >> (For those of you who don't know these languages,  привет is
>     Russian and similar languages and is pronounced something like preev
>     yet.  Hola is Spanish and pronounced O la.  Both words mean hello in
>     their respective languages)
>      >>
>      >> If your referring to the warning I am  proposing adding, I don't
>     see how the two examples differ.  Both are in UTF-8, and don't say
>     that, so both would warn.  If you meant for the ¡Hola! to be in
>     Latin1 (which it could be, unlike привет), it wouldn't be converted
>     to UTF-8, and hence wouldn't spit out UTF-8.
>      >
>      > The warning I referred to is the wide-character warning that
>     `print "привет\n"` will throw in Perl’s default configuration.
>      >
>      > Is the idea that the encoding detection logic you propose would
>     always trigger a warning? If so, that seems reasonable.
>      >
>      > -F
>     It would trigger a warning iff it interpreted the source as UTF-8 when
>     no declaration to that effect had been made.  So, in both your
>     cases, it
>     would warn.  I don't understand how it "invalidates" things.  I'm
>     thinking the warning actually improves things from the current
>     situation.
>     If ¡Hola! were in Latin1, it would treat it as being  a single byte
>     encoding, and print it as such.
> The warning is certainly important and useful, but it would also be a 
> behavior change - the bytes would now be interpreted as UTF-8 rather 
> than the default of ISO-8859-1, so the code would contain different 
> strings once parsed. This is what Felipe was referring to with the 
> oneliner examples which represent a huge magnitude of existing code that 
> expects to take bytes and output those same bytes. So regardless of the 
> merits or benefits this would need to be opt-in; at which point (IMO) we 
> might as well make it simple and explicit with no possibility of 
> incorrect guesses.
> -Dan

Thanks, that is evidence of unwanted behavior change from my proposal. 
But the original proposal also has the same issue.  To remind you:

you must declare source encoding before any non-ASCII byte is encountered

These one liners would all fail to compile under that proposal.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About