develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs

Ben Morrow
May 20, 2008 03:03
Re: on broken manpages, trolling, inconsistent implementation and the difficulty to fix bugs
Message ID:

Quoth (Tom Christiansen):
> But I still think that you are asking a lot if you want to make the
> claim that filenames as used to access the system's underlying files 
> VIA ITS OWN INTERFACES are data rather than metadata.  And I don't think
> that filesystem metadata is reliably treated as anything but bytes, at
> least on systems with which I am conversant.

But this is exactly where the thread started... Win32 (and, IIUC, other
systems such as VMS) don't treat filenames as (sequences of) bytes, but
as sequences of Unicode characters. Win32 at least also has two sets of
APIs: one takes parameters in some currently-selected encoding and
converts to Unicode for you (the 'ANSI' API), and one which takes
arguments in Unicode (the 'Unicode' API).

This leaves three possibilities. At the moment (I think), all IO happens
through the ANSI API, which leaves Perl in the unfortunate position of
being unable to open files with names that don't fit in the current
character set.

I believe what Jan was suggesting (please correct me if I've
misunderstood) was

    - filenames which are !SvUTF8 use the ANSI    API,
    - filenames which are SvUTF8  use the Unicode API.

However, since this would mean that

    my $fn = "\xe0";
    open my $F, '<', $fn;


    my $fn = substr "\xe0\x{100}", 0, 1;
    open my $F, '<', $fn;

potentially opened different files, Perl's auto-upgrading on Win32 would
also have to be changed to use the current ANSI encoding instead of
ISO8859-1. My complaint with this is that it means that when a string is
upgraded, the values reported by 'chr' (for instance) will mysteriously
and silently change.

The potential alternative I was proposing was that all filenames should
be upgraded to SvUTF8 (using ISO8859-1, as currently) and then passed to
the Unicode API. This has the advantage of maintaining current in-Perl
string semantics, and the disadvantage of breaking all Win32 programs
that currently use non-ASCII filenames.

I don't think there's any way forward without breaking *something*. The
question is what will cause least damage.


Like all men in Babylon I have been a proconsul; like all, a slave ... During
one lunar year, I have been declared invisible; I shrieked and was not heard,
I stole my bread and was not decapitated.
~ ~                   Jorge Luis Borges, 'The Babylon Lottery' Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About