develooper Front page | perl.perl5.porters | Postings from February 2001

Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2)

From:
nick
Date:
February 17, 2001 09:51
Subject:
Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2)
Message ID:
E14UBSj-0003nH-00@roam1
Jarkko Hietaniemi <jhi@iki.fi> writes:
>On Fri, Feb 16, 2001 at 09:47:39PM +0000, nick@ing-simmons.net wrote:
>> Jarkko Hietaniemi <jhi@iki.fi> writes:
>> >> Given transparency, you do not need such a thing.
>> >
>> >Wrong.  
>> 
>> I am (as you know already) on Ilya's "side" here...
>
>I have no problem with someone producing a better model or a new
>implementation.  I do have somewhat of a problem (because there's only
>one me :-) with people suggesting such things and showing neither the
>code, nor detailed description of the new model (otherwise discussing
>the new model is bumping in the dark).  Write down two "simple" things:
>the model and the code.

The Ilya/Nick model is in fact the "old" model, what we have in 5.7+ is 
"our" model with extra prommises.
What we both gripe about is expensive scans of strings just to keep
things "tidy".

Nothing new below - just me re-stating it all again ...

The fundamental tenet of the model is that representation is supposed to 
be transparent to perl code. Perl strings are sequences of characters
taken from the UNICODE repertoire. Any other representation is the 
responsibility of XS code (e.g. Encode) or the IO system. The old pre 
perl5.6 behaviour is a strict subset of this (on ASCII-ish machines) 
so we should be 100% backward compatible. 

The model says that perl can represent characters as either bytes
or UTF-8 encoded sequences. Perl will indicate which a particular SV is 
by setting the SvUTF8 bit. Perl is a liberty to change the representation
as it manipulates/copies/passes to XS etc. All C code in perl and XS 
must be prepared to accept strings in either form. C and XS code
may generate strings in either form. API routines are 
provided to convert a string from one form to the other, C code should
call these if passed a string in the form it does not want.

The two spanners in the works are: 

A. Camel-III's mention of 'use bytes' - which exposes the internal 
   representation, which has lead to an expectation that representation
   be predictable. That is only bad when it is expensive.

B. EBCDIC. EBCDIC machines legacy use of chr(), ord() etc. violate the 
   sequences of UNICODE codepoint premise.
   So applying Nick/Ilya model strictly will break legacy EBCDIC code.
   So we have a Simon et. al. EBCDIC model where two representations
   are instead ibm-1047 code page, or UTF-8 encoded Unicode.
   Semantics of chr/ord are unclear. My guess is that chr of 0..255 produce
   characters according to IBM-1047, characters above that are Unicode.
   This should be "safe" iff IBM-1047 is one-to-one bi-directional mapping  
   to iso8859-1 (i.e. LS 256 Unicode code points).

My personal axe to grid is that tk8.1+ (the unicode aware one) want and expects 
UTF-8. So continually normalizing 128..255 back to bytes is a pain in the 
neck.


-- 
Nick Ing-Simmons




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About