Front page | perl.perl5.porters |
Postings from October 2011
Re: [perl #95160] Unicode readdir bugs
Thread Previous
|
Thread Next
From:
Brian Fraser
Date:
October 23, 2011 18:26
Subject:
Re: [perl #95160] Unicode readdir bugs
Message ID:
CA+nL+nau=KFKzfdOg0Tb8cChm03kHhZaAX4urk8Y17o66n2heg@mail.gmail.com
On Sun, Oct 23, 2011 at 7:23 PM, Father Chrysostomos via RT <
perlbug-followup@perl.org> wrote:
> On Tue Oct 04 09:02:13 2011, aristotle wrote:
> > * Eric Brine <ikegami@adaelis.com> [2011-09-19 03:20]:
> > > File names are meant to be read as text, so one can't really claim
> > > they're just octet sequences. So the real question is what should we
> > > do when readdir encounters a file name that doesn't cleanly decode
> > > using the encoding it's expected to be encoded with
>
> If that happens, then it’s not really text, is it?
>
> > > (e.g. a file name
> > > that's not valid UTF-8 on a box with a UTF-8 locale).
>
> No, no, please don’t start using the locale to determine what the file
> names are. That would mean that a change to an environment variable
> would cause configuration files to start referring to other
> ‘nonexistent’ files (which exist when the locale is set correctly). We
> should *only* support Unicode file names when the file system itself has
> encoding information.
>
> Mac OS X, for instance, stores the encoding in the file system (so each
> volume could theoretically use a different encoding), but the low-level
> drivers that read the volume translate everything to UTF-8. If you try
> to create a file whose name is an invalid UTF-8 sequence, you get an
> ‘Invalid argument’ error.
>
> On the other hand, if we keep things completely consistent on a given
> platform (treat Linux as UTF-8, for instance, regardless of any
> environment settings), then we could follow Aristotle’s suggestion below
> for platforms that do not have an inherent file name encoding system.
>
> Also, nobody has answered my question: What do we call the pragma?
> unicode::filenames? I suppose we need to make a list first of which
> functions will be affected, so here goes:
>
> dbmopen -X chdir chmod chown chroot fcntl glob link lstat mkdir open
> opendir readlink rename rmdir stat symlink sysopen umask unlink utime do
> require use
>
> Those are all file name functions.
>
> But what about user and group names?
>
> exec, system, syscall, readpipe, bind, connect, getsockopt, shmwrite and
> the various network functions (e.g., getservbyname) should produce ‘Wide
> character’ warnings. (Someone who understands non-ASCII domain names
> should speak up now.)
>
>
(Reading the Python thread is still on my TODO list, so I'm not commenting
on that yet)
There's a couple of things here being grouped as one. Ignoring
require/use/do for a moment, most of those functions already have bug
reports on them because, let me quote tchrist here,
*Who* told Perl it was ok to let me blithely use wide characters in
> creat but then forbad me from using them in readdir? That's stupid.
> Perl should forbid unencoded wide characters in syscalls. It already
> does in syswrite.
So, first thing: Be like syswrite. -All- syscalls, sans for
say/print/printf/warn/die which already have exceptions, should croak if
passed non-downgradeable scalars. This needn't be a backwards-incompatible
nightmare -- Save for exec and system, Classic::Perl could override them to
do something like
require Encode;
*CORE::GLOBAL::rename = sub ($$) { Encode::SvUTF8_off($_[0]); goto
&CORE::rename };
And there you go. You get Perl's previous ultralax behavior.
Second, there should be a way to avoid doing an encode/decode on every
syscall. Since I haven't read the Python thread yet I can't say much on
this, but for a while I've had a open-like pragma for this in mind, eg
use syscalls IN => ":encoding(...)", OUT => ":encoding(...)";
or
use syscalls :dir => { IN => ":encoding(...)", OUT => ":encoding(...)" }
Or somesuch, which won't solve problems in, say, Windows, but hopefully it
won't make them any worse. Then you could implement unicode::filenames as a
wrapper around that, and if you want to grab that layer from a locale
setting, that's entirely up to you (just don't ask me to debug it later).
Third, require/use/do. I recall Python having some problems with this (if
the thread that I've neglected reading touches this, I apologize) -- And
actually, I don't know any language that supports it without issues, though
pointers are of course welcome.
Zefram had a great idea for this a while ago -- If a module has Unicode in
its path, it should get an alias, reachable through some escaping scheme or
another. So if I had a module Eeyup::\x{30cb}::Bothersome, Bothersome.pm
would be reachable through Eeyup/\x{30cb}/, and, failing that,
unialias/Eeyup/130cb/
Here's the nicest thing -- I implemented 1 and a prototype of 2 in a couple
of hours, so it's certainly doable, though I haven't touched that in a while
because I can't figure out a way to test 2 portably.
Thread Previous
|
Thread Next