Jarkko Hietaniemi <jhi@iki.fi> writes: >On Fri, Feb 16, 2001 at 09:47:39PM +0000, nick@ing-simmons.net wrote: >> Jarkko Hietaniemi <jhi@iki.fi> writes: >> >> Given transparency, you do not need such a thing. >> > >> >Wrong. >> >> I am (as you know already) on Ilya's "side" here... > >I have no problem with someone producing a better model or a new >implementation. I do have somewhat of a problem (because there's only >one me :-) with people suggesting such things and showing neither the >code, nor detailed description of the new model (otherwise discussing >the new model is bumping in the dark). Write down two "simple" things: >the model and the code. The Ilya/Nick model is in fact the "old" model, what we have in 5.7+ is "our" model with extra prommises. What we both gripe about is expensive scans of strings just to keep things "tidy". Nothing new below - just me re-stating it all again ... The fundamental tenet of the model is that representation is supposed to be transparent to perl code. Perl strings are sequences of characters taken from the UNICODE repertoire. Any other representation is the responsibility of XS code (e.g. Encode) or the IO system. The old pre perl5.6 behaviour is a strict subset of this (on ASCII-ish machines) so we should be 100% backward compatible. The model says that perl can represent characters as either bytes or UTF-8 encoded sequences. Perl will indicate which a particular SV is by setting the SvUTF8 bit. Perl is a liberty to change the representation as it manipulates/copies/passes to XS etc. All C code in perl and XS must be prepared to accept strings in either form. C and XS code may generate strings in either form. API routines are provided to convert a string from one form to the other, C code should call these if passed a string in the form it does not want. The two spanners in the works are: A. Camel-III's mention of 'use bytes' - which exposes the internal representation, which has lead to an expectation that representation be predictable. That is only bad when it is expensive. B. EBCDIC. EBCDIC machines legacy use of chr(), ord() etc. violate the sequences of UNICODE codepoint premise. So applying Nick/Ilya model strictly will break legacy EBCDIC code. So we have a Simon et. al. EBCDIC model where two representations are instead ibm-1047 code page, or UTF-8 encoded Unicode. Semantics of chr/ord are unclear. My guess is that chr of 0..255 produce characters according to IBM-1047, characters above that are Unicode. This should be "safe" iff IBM-1047 is one-to-one bi-directional mapping to iso8859-1 (i.e. LS 256 Unicode code points). My personal axe to grid is that tk8.1+ (the unicode aware one) want and expects UTF-8. So continually normalizing 128..255 back to bytes is a pain in the neck. -- Nick Ing-Simmons