develooper Front page | perl.perl5.porters | Postings from May 2018

[perl #133183] RFE: use heuristic for utf8 usage w/-Mutf8 inPERL5OPT

Thread Previous
From:
Linda Walsh
Date:
May 7, 2018 22:48
Subject:
[perl #133183] RFE: use heuristic for utf8 usage w/-Mutf8 inPERL5OPT
Message ID:
rt-4.0.24-4357-1525733308-90.133183-75-0@perl.org
# New Ticket Created by  Linda Walsh 
# Please include the string:  [perl #133183]
# in the subject line of all future correspondence about this issue. 
# <URL: https://rt.perl.org/Ticket/Display.html?id=133183 >



This is a bug report for perl from perl-diddler@tlinx.org,
generated with the help of perlbug 1.39 running under perl 5.16.3.


-----------------------------------------------------------------
[Please describe your issue here]

For some time I had an odd output in one of my programs
where I tried to use a right-pointing double angle quotation mark
U+00BB (»).  It always came out as "»".  I had "use utf8;" in 
my source, even had use utf8::all; in some, but most of all,
thought I was safe with "-Mutf8 -CSA" in PERL5OPT.

Once I'd finished development on older module, I simply
used it.  If I ran the module as a prog under the debugger,
it seemed to work -- problem was that I simply wanted
perl to assume modern sources should be treated as
utf8, or at worst to output the same bytes as on input.
bash does this:

> a="»"
> printf "%s\n" "$a"
»
> printf "%s\n" "$a"|hexdump -C
00000000  c2 bb 0a
---

C does this:

#include <stdio.h>
int main(int argc, char *argv[]) {
	char arr[3]="»";
	printf("%s\n", arr);
}

> gcc ar.c -o ar
> ar
»

I can't think of any language that forces
0x80-0xff into a different encoding in source or input 
than it outputs.

*Ideally*, perl wouldn't either.  However, some would complain
of compat probs (though didn't seem to cause end of the world
for bash or C doing it that I'm aware of).

BUT, at the very least... a compromise heuristic could
be used.  A first level heuristic would be:

1) if 0xc2 or 0xc3 followed by another hex byte in the range
0x80-0xff, occurs in source, presume it is utf8 encoded.

For some though, that would still let too much incompat slip
through.

To that I say, add: 

2) if the ENV var PERL5OPT has -Mutf8 in it -- AND if "1"
then assume source is utf8.  It might not be 100% 
compatible, BUT, it lets local user set a presumption
for their system.  If they run into a module that
doesn't work -- they can work around it.  Alternatively,
have perl access a site config file (I think it can be 
configured to use one in /etc/?)  where they flag can
specify it.

if more safety was wanted, 
as a addon step to 1 or 2 -- 
2) or 3) put out a one-time warning with the first byte combo that
triggers utf8 encoding on a per-module basis.  That way,
either the user could silence the warning, or simply
add 'use utf8' to the beginning of that module (the
latter being more logical).

-----------------

Tangential, but related: Additionally, if a config file is
used -- it should be possible to specify stdin/out/err as
defaulting to the locale -- the assumption being that
streamed I/O is not how one would normally access binary
data.  The idea being to have perl be [mostly] binary clean
in regards to streamed input & output (I realize some want
to flag errors on invalid utf8 -- not my first choice, but
I don't see a problem with that in streamed i/o as the
assumption is one wouldn't use a variable length encoding
for storing binary data.

This might assist in putting the infamous perl utf8 bug
to rest (at least for the most part).  It also introduces
the idea of trying to give or do what the user wants based on
increasing levels of evidence.  Admittedly an imperfect
science, but better than using rigid standards when it comes
to humans.  Perl should "just be smarter".

This isn't version related as it happens under perl 5.24.0
as well as 5.16.3.


[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
    category=core
    severity=wishlist
---
Site configuration information for perl 5.16.3:

Configured by law at Wed Jan 22 12:58:58 PST 2014.

Summary of my perl5 (revision 5 version 16 subversion 3) configuration:
   
  Platform:
    osname=linux, osvers=3.12.0-isht-van, archname=x86_64-linux-thread-multi-ld
    uname='linux ishtar 3.12.0-isht-van #1 smp preempt wed nov 13 16:50:51 pst 2013 x86_64 x86_64 x86_64 gnulinux '
    config_args=''
    hint=previous, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=define
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-g -O2',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
    ccversion='', gccversion='4.8.1 20130909 [gcc-4_8-branch revision 202388]', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='long double', nvsize=16, Off_t='off_t', lseeksize=8
    alignbytes=16, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags ='-g -fstack-protector -fPIC'
    libpth=/usr/lib64 /lib64
    libs=-lnsl -lndbm -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc -lgdbm_compat
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
    libc=/lib/libc-2.18.so, so=so, useshrplib=true, libperl=libperl-5.16.3.so
    gnulibc_version='2.18'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/home/perl/perl-5.16.3/lib/x86_64-linux-thread-multi-ld/CORE'
    cccdlflags='-fPIC', lddlflags='-shared -g -O2 -fstack-protector -fPIC'

Locally applied patches:
    

---
@INC for perl 5.16.3:
    /home/law/bin/lib
    /home/perl/perl-5.16.3/lib/site/x86_64-linux-thread-multi-ld
    /home/perl/perl-5.16.3/lib/site
    /home/perl/perl-5.16.3/lib/x86_64-linux-thread-multi-ld
    /home/perl/perl-5.16.3/lib
    .

---
Environment for perl 5.16.3:
    HOME=/home/law
    LANG (unset)
    LANGUAGE (unset)
    LC_COLLATE=C
    LC_CTYPE=en_US.UTF-8
    LC_MESSAGES=C
    LC_MONETARY=C
    LC_NUMERIC=C
    LC_TIME=C
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/perl/perl-5.24/usr/bin:.:/sbin:/home/law/bin/lib:/home/law/bin:/usr/local/bin:/usr/bin:/bin:/opt/kde3/bin:/usr/sbin:/etc/local/func_lib:/home/law/lib
    PERL5OPT=-Mutf8 -CSA -I/home/law/bin/lib
    PERL_BADLANG (unset)
    SHELL=/bin/bash


Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About