Front page | perl.perl5.porters |
Postings from May 2008
Re: on the almost impossibility to write correct XS modules
From: Marc Lehmann
May 19, 2008 12:40
Re: on the almost impossibility to write correct XS modules
Message ID: 20080519193644.GG28206@schmorp.de
On Mon, May 19, 2008 at 11:45:49AM -0700, Jan Dubois <firstname.lastname@example.org> wrote:
> I guess this is where the confusion originates. I'm not talking about
> the Perl language level at atll,
Thats bad, because you would break it at the perl language level.
> internals. I thought this was obvious from the reference to SvUTF8,
> which doesn't exist at the Perl language level.
It doesn't matter what *you* are altking about. The change you propose would
breka Perl, the language.
> Inside the Perl internals there is an implicit assumption about the
Inside Perl, there is no such assumption.
> at least on operating systems that care about this stuff at
> the low level (e.g. Windows).
Yes, and there this assumption is simply wrong. It doesn't matter why the
win32 parts expects that, but the assumption is wrong.
> On Windows all 8-bit APIs to the filesystem expect filenames to be
> encoded in the CP_ACP (ANSI) codepage (yes, I know you can switch this
> for some APIs to CP_OEM, but please let's ignore that).
Yes, but Perl isn't win32-specific. You cann change the perl character
handling unilaterally on win32 only and thus break portability to other
systems who don't mangle codepoints on upgrades.
> Since Perl (interpreter internals) doesn't perform any encoding changes
> for filenames passed to and from the operating system APIs,
It doesn't, but it silently upgrades and probably also downgrades them.
> therefore follows that strings internally are assumed to be ANSI
> encoded. If you call
This assumption is untrue. the win32 parts assume in amyn places that a
string is ANSI encoded *if* the utf8 flag is not set, and unicode otherwise
> open(my $fh, "<", $filename) or die;
> the octets stored in $filename are interpreted according to the ANSI
> codepage. This is documented in perluniintro.pod as using "native"
> encoding for 8-bit strings.
The problem is not open, but e.g. functions like Win32::SetCwd and lots of
other places where the filename is interpreted as ANSI if SvUTF8 is false,
and unicode otherwise.
This is broken, because it forces the interpretation of the utf-8 flag to be
an encoding for the perl string, which it isn't.
Again: Perl doesn't attach an encoding to a scalar string.
> from the native encoding with the UTF8 encoding. Since UTF8 is able to encode
> all code points encoded by the native encoding,
This is wrong in general.
> but not vice versa, we
This is also wrong in general.
> have to re-encode the native strings into UTF8 before we can perform
> operations like concatenation. This re-encoding cannot be done without
> knowing (or assuming) the encoding of the native strings.
Yes. And your change would force the encoding to be whatever the local
codepage is, and this is simply wrong.
> Currently Perl doesn't have any code to treat the native encoding on
> Windows correctly and therefore (incorrectly) assumes that all 8-bit
> strings are Latin1 encoded.
Thats where you are confused: perl doesn't assume anythjing about latin1 or
unicode. perl uses utf-8 to store codepoints >255 internally, but it does not
treat those strings as unicode, unless forced by usage.
> This is the part that needs to be fixed if strings with SvUTF8 are going
> to be correct.
This doesn't change what I said before, unfortunately. You are simply wrong
in your assumptions in that you believe perl stores or attaches an encodind
to its strings: it does so neither externally (Perl level) nor internally
What it does is store strings with codepoints >255 encoded in utf-8,
regardless of the encoding they are in. It also has the ability to store
strings with codepoints <256 in a more space-efficient octet form.
Nowhere in perl is there an assumption about the actual encoding of perl
scalars themselves. Only in certain places (regex matching would be one,
although I heard there are problems, another one is open, and yet another
one is Win32::SetCwd) does it assume so, and this is only after the user
tells perl to do so.
As long as I don't use a function in perl (such as open) that expects some
encoding, perl does not attach encodings to strings.
As such, your proposal to force the encoding to ANSI would completely
> > I really don't have the time to lecture on unicode again and again to
> > the amsses who don't understand it.
> I guess I don't understand why you are engaging in a discussion about
> Unicode implementation on perl5-porters if you don't have the time to
> actually argue your points instead of your general handwaving about
Because I already have argued my points. Also, I *expect* from peopel enaging
in a disucssion about bugs or problems in the current implementation to have
a working understanding of how perl treats strings.
The fact that you continue to make ourragously wrong claims such as perl
assuming octet-encoded strings would be latin1-encoded proves that you
a) havent' researched the topic (see mailinglist archives)
b) don't know how perl stores strings and how it interprets them
> how things should work at the high level and then calling people who
> point out problems in the actual implementation clueless. Are you
> just trolling?
What do you call what you are doing by coming into a discussion without
having working knowledge of perl string handlign, despite this being
discussed many times over the past years.
Again: perl does *not*, neither internally or externally, attach an
encoding to scalars.
I and others have explained this a number of times. I can explain it to
you privately again if you so desire, but this threadis about practical
problems and bugs, and not about how to break perl in incompatible ways on
windows or about spreading fud about like "perl interprets octet-encoded
strings as latin1" or other obviously wrong stuff.
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / email@example.com