On Sat, Jul 31, 2021 at 3:33 PM Darren Duncan <darren@darrenduncan.net> wrote: > On 2021-07-31 12:17 p.m., Darren Duncan wrote: > > Now conversely, I don't have a problem with actually waiting until v5.38 > to > > fully implement the change IF 5.36 contained some kind of precursor to > prepare > > the way, such as that 5.36 would issue warnings for code with a "use > 5.36" that > > wasn't valid UTF-8, saying that this code might parse differently under > "use > > 5.38". That would let users know in a transitional version what might > be a > > problem before it is. > > So to clarify, I have a very specific proposal: > > 1. That a "use 5.36;" will behave the same with respect to the uft8 stuff > as > "use 5.34;", but that if the source file / input stream is not entirely > valid > UTF-8 under a strict interpretation, the Perl interpreter will issue a > warning > saying so and why it matters. > > 2. That a "use 5.38;", if the source file / input stream is not entirely > valid > UTF-8 under a strict interpretation, the Perl interpreter will issue a > fatal > error / die saying so and why it matters, and that as a result the parsing > has > failed. > > So a key thing is that the UTF-8 mode triggered by 5.36/5.38 is strict, > doesn't > use substitution characters or delete characters, it either passes the > input > unchanged as valid UTF-8 or it complains. If "use utf8;" already does > this then > its the same, and otherwise it is stricter. > > Since this isn't spelled the same as "use utf8;" the new feature doesn't > need to > be identical in every way, we don't have to limit ourselves to that and > the > issues of silent corruption from substitution/deleting being the implicit > operation, if that is what it used to do. > > On a further point, unlike a lot of the other "use" statements, I assume > there > is no good reason for a single file to be a mixture of literal encodings, > and so > having multiple "use encoding" statements in a file, either explicit or > implied > by a "use 5.38" etc, should be considered an error, and any occurrence of > one > would be expected to describe the entire file and not just the lexical > scope it > appears in, unlike strict/warnings/etc, its not flipped on or off mid-file. > You seem to be interpreting the major problem here as "source code which is not valid or intended as UTF-8". This is not a significant issue and its failure mode is rather obvious. There isn't a further discussion to be had there. The subtle issue is that "use utf8" changes (valid UTF-8) non-ascii literal strings in the code to have different contents. Literal strings *must* be used differently depending whether "use utf8" was active where they were written. Without "use utf8", it's a byte string; with "use utf8", it's a character string. -DanThread Previous | Thread Next