develooper Front page | perl.perl5.porters | Postings from February 2001

Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2)

Thread Previous | Thread Next
From:
nick
Date:
February 17, 2001 11:11
Subject:
Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2)
Message ID:
E14UChj-0004IS-00@roam1
Jarkko Hietaniemi <jhi@iki.fi> writes:
>> A. Camel-III's mention of 'use bytes' - which exposes the internal 
>>    representation, which has lead to an expectation that representation
>>    be predictable. That is only bad when it is expensive.
>
>If we have a strong case we can convince Larry that use bytes is bad.

But there are lots of paper copies which folk will believe.

>
>If the need to produce explicitly UTF-8 in compile time is found to be
>false, we can do away with qu.

In the strict Ilya/Nick model you never _need_ qu// or UTF-8 at compile
time other than as an optimization hint. 

>
>> B. EBCDIC. EBCDIC machines legacy use of chr(), ord() etc. violate the 
>>    sequences of UNICODE codepoint premise.
>>    So applying Nick/Ilya model strictly will break legacy EBCDIC code.
>>    So we have a Simon et. al. EBCDIC model where two representations
>>    are instead ibm-1047 code page, or UTF-8 encoded Unicode.
>>    Semantics of chr/ord are unclear. My guess is that chr of 0..255 produce
>>    characters according to IBM-1047, characters above that are Unicode.
>
>Unless I'm mistaken that is what happens now (just like what happened
>in pre-Unicode).  It's not only chr/ord, we have also to decide what
>happens with \xHH, \x{HH}, and \Oooo.  

I was using chr/ord as short hand, all the mechanisms should be self 
consistent. This is why we ended up with qu - \x{HH} got defined to mean
same as \xHH for legacy stuff so there was no way to specify U+0041 
if one knew the UNICODE value. That is qu// was a sop to the Unicode purists
like me.

>Ditto for vNNN (though that
>isn't so important for EBCDIC folks since the vNNN was introduced in
>5.6, and 5.6 was broken for them in so many ways the the vNNN is
>irrelevant for them).  And how about pack/unpack("U"/"C", ...)?

pack/unpack should be consistent with other ways of doing the same thing.

Thus C (which is legacy) has to do same thing as chr/ord.

perldoc U explicitly says it is using Unicode (code points) - it also blathers
on about UTF-8 which should be a don't care at perl level. Defaulting
to UTF-8 probably makes sense, but it could just as well be byte if 
that made sense in context.

>
>But these all are details from the greater picture.

Details are all that are left. (At least in ASCII world.)

>
>Maybe I'm slow but I need to see the whole picture before I can
>start doing it.  

The big picture is done - you did a great job ;-)

>Listing nits here and nits there just makes me
>trash and do nothing.

Fine by me. Nits without patches and which are not biting you are 
ignorable. My nit problem is currently solved by not working on Tk.
What I don't want to happen (as has at least once) is for me to "fix" 
a nit only to have (say) Simon effectively "revert" it so that 
representation-is-predictable.

So if we need anything from the Pumpking it is rulings (or apeals to Larry),
as to whether time-efficiency or predicatable-representation takes
precedence in the general case, and if representation should be transparent
to perl code.

My own view is that time-efficiency should dominate and that representation
should be invisible to perl code.

>
>If we choose to make chr(65) to produce 'A' and ord('A') to produce 65
>in all platforms, including EBCDIC, one possible kludge to keep the
>EBCDIC legacy apps happy would be to have 'use ebcdic' which would
>effectively make chr/ord/\x/etc to bypass the mapping to Unicode and
>back and use instead the raw EBCDIC bytes.

Peter/Simon et. al. seem to be (almost) happy with what they have.
All I want from them really is some documentation on what that is,
so that I can make Encode and PerlIO do the right thing.
(It may even be there I have not gone an looked...)
In particular it is madness to have a2e/e2a tables in the core _AND_ 
in Encode.

>
>>    This should be "safe" iff IBM-1047 is one-to-one bi-directional mapping  
>>    to iso8859-1 (i.e. LS 256 Unicode code points).
>>
>> My personal axe to grid is that tk8.1+ (the unicode aware one) want
>> and expects UTF-8. So continually normalizing 128..255 back to bytes
>> is a pain in the neck.

The example that pained me as I recall was a string 'tied' to a Tk widget.

$string .= 'no-high';  # or whatever 

carefuly scanned the whole string, found no chars > 256 and so downgraded it.
So Tk had to upgrade it again so its font lookups worked.

Doing this on a N 1000 line "file" in a Text widget is tedious.

If the decision goes against speed in favour of predicatbility I just 
need to go and re-write Tk's internals to accept the dual form - which is 
not what I want to do - but so be it. (Tcl is UTF-8 only internals.)


-- 
Nick Ing-Simmons


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About