develooper Front page | perl.perl5.porters | Postings from October 2011

[perl #95160] Unicode readdir bugs

Thread Next
From:
Father Chrysostomos via RT
Date:
October 23, 2011 14:24
Subject:
[perl #95160] Unicode readdir bugs
Message ID:
rt-3.6.HEAD-31297-1319405032-1973.95160-15-0@perl.org
On Tue Oct 04 09:02:13 2011, aristotle wrote:
> * Eric Brine <ikegami@adaelis.com> [2011-09-19 03:20]:
> > File names are meant to be read as text, so one can't really claim
> > they're just octet sequences. So the real question is what should we
> > do when readdir encounters a file name that doesn't cleanly decode
> > using the encoding it's expected to be encoded with
 
If that happens, then it’s not really text, is it?

> > (e.g. a file name
> > that's not valid UTF-8 on a box with a UTF-8 locale).

No, no, please don’t start using the locale to determine what the file
names are.  That would mean that a change to an environment variable
would cause configuration files to start referring to other
‘nonexistent’ files (which exist when the locale is set correctly).  We
should *only* support Unicode file names when the file system itself has
encoding information.

Mac OS X, for instance, stores the encoding in the file system (so each
volume could theoretically use a different encoding), but the low-level
drivers that read the volume translate everything to UTF-8.  If you try
to create a file whose name is an invalid UTF-8 sequence, you get an
‘Invalid argument’ error.

On the other hand, if we keep things completely consistent on a given
platform (treat Linux as UTF-8, for instance, regardless of any
environment settings), then we could follow Aristotle’s suggestion below
for platforms that do not have an inherent file name encoding system.

Also, nobody has answered my question:  What do we call the pragma?
unicode::filenames? I suppose we need to make a list first of which
functions will be affected, so here goes:

dbmopen -X chdir chmod chown chroot fcntl glob link lstat mkdir open
opendir readlink rename rmdir stat symlink sysopen umask unlink utime do
require use

Those are all file name functions.

But what about user and group names?

exec, system, syscall, readpipe, bind, connect, getsockopt, shmwrite and
the various network functions (e.g., getservbyname) should produce ‘Wide
character’ warnings.  (Someone who understands non-ASCII domain names
should speak up now.)

> One could take a page from Python here and use its surrogate escape
> error handling. There was a subthread about it a while ago:
> http://www.nntp.perl.org/group/perl.perl5.porters/;msgid=A8767ACF-
> E6A0-498A-B402-54A12D26523B@activestate.com
> 
> What this approach effectively does is allow strings to unambiguously
> represent a mixture of bytes and characters, which in a roundabout way
> essentially solves the problem that Perl only has a single string
> type.
> But do note the later message about the security implications. It will
> take some thought to get this clean, but there is a lot of potential
> in
> it.
> 
> I love the idea and it is one of my todos to add this to Encode should
> no one else get there first. The core could then use this method to
> provide clean and nice interfaces to any OS APIs which are textual in
> intent but binary in practice – as Python does.
> 
> It would be a major step forward for Perl.
> 
> Regards,




Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About