develooper Front page | perl.perl5.porters | Postings from August 2018

Re: [perl #130831] Perl's open() has broken Unicode file namesupport

Thread Next
From:
pali
Date:
August 20, 2018 08:48
Subject:
Re: [perl #130831] Perl's open() has broken Unicode file namesupport
Message ID:
15300_1534754879_5B7A803E_15300_22_1_20180820084749.vjnfmkplxj42qiyc@pali
On Tuesday 28 February 2017 00:35:45 Leon Timmermans wrote:
> On Mon, Feb 27, 2017 at 10:21 PM,  <pali@cpan.org> wrote:
> > Windows has two sets of functions for accessing files. First with -A
> > suffix which takes file names in encoding of current 8bit codepage.
> > Second with -W suffix which takes file names in Unicode (more precisely
> > in Windows variant of UTF-16). With -A functions it is possible to
> > access only those files which file names contains only characters
> > available in current 8bit codepage. Internally are all file names stored
> > in Unicode. So -W functions must be used to have access to any file
> > name. And therefore for Windows we need Unicode file name in perl open()
> > function to have access to any file stored on disk.
> >
> > Linux stores file names in binary octets, there is no encoding or
> > requirement for Unicode. Therefore to access any file on Linux, Perl's
> > open() function should takes downgraded/non-Unicode file name.
> >
> > Which means there is no way to have uniform and same multiplaform
> > support for file access without hacks.
> 
> Correct observations. Except OS X makes this more complicated still:
> it uses UTF-8 encoded bytes, normalized using a non-standard variation
> of NFD.

For completeness:

Windows uses UCS-2 for file names and also in corresponding WinAPI -W
functions which operates with file names. It is not UTF-16 as file names
may really have unpaired surrogates.

OS X uses non-standard variant of Unicode NFD encoded in UTF-8.

Linux use just binary octets.


Idea how to handle file names in Perl:

Store file names in extended Perl's Unicode (with code points above
U+1FFFFF). Non-extended code points would represent normal Unicode code
points. And code points above U+1FFFFF would represent parts of file
name which cannot be unambiguously represented in Unicode.

On Linux, take file name (which is char*) and start decoding it from
UTF-8. Sequence of bytes which cannot be decoded as UTF-8 would be
decoded as sequence of extended code points (e.g. U+200000 - U+2000FF).
This operation has inverse therefore can be used for conversion of any
file name stored on Linux system. Plus it is UTF-8 friendly, if
filenames in VFS are stored in UTF-8 (which is now common), then perl's
say function can correctly print them.

On OS X, take file name (which is char* but in UTF-8) and just decode it
from UTF-8. For conversion from Perl's Unicode to char* just do that
non-standard NFD normalization and encode to UTF-8.

On Windows, take file name (wchar_t* which is uint16_t*) compatible for
-W WinAPI function which represents UCS-2 sequence and decode it to
Unicode. There can be unpaired surrogates and represents it either as
Unicode surrogate code points, or use extended Perl's code points (bove
U+1FFFFF). Reverse process (from perl's Unicode to wchar_t*/uint16_t*)
is obvious.

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About