develooper Front page | perl.perl5.porters | Postings from February 2001

Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2)

February 16, 2001 13:51
Re: IV preservation (was Re: [PATCH 5.7.0] compiling on OS/2)
Message ID:
Jarkko Hietaniemi <> writes:
>> Given transparency, you do not need such a thing.

I am (as you know already) on Ilya's "side" here...

>There are standards and protocols, and other pieces of
>software, out there that *require* producing UTF-8.  IIRC LDAP
>is one of those outside bits.  Java would be another [*].
>Perl must be able to interface with the outside world.

Quite. But such things will always be either:
A. XS code (which Ilya has already said has to be SvUTF8 aware.
B. IO which has/should-have its own ways of dealing with this and 
   indeed must be able to cope with SVs arriving in either form.
C. Just be treating the things as sequences of bytes - in which case
   the bytes themselves can be represented either way ;-)

Case C was Graham's LDAP case. It relied on perl producing UTF8 encoded 
form for 128..255 and then did 'use byte' to peak at it.
It broke when 5.6+ decided to keep 128...255 as 'byte' anyway.
The right way to do this is to export the trivial XS code which 
does an upgrade and then turns off the flag. (As current Encode does.)

>[*] Though don't get me started on how Java's readUTF8() and writeUTF8()
>do not do real UTF-8 as defined by the RFC :-)
>> ord('A') should be the same on all the systems, unless use locale or
>It isn't.
>In EBCDIC that produces 0xC1, or 193.
>It might be nice if it did.
>Changing it would break existing code.

We _still_ have not got a definition from EBCDIC folk on what the 
backward compatible version _does_.

>Your sentence is essentially saying that utf8-marking is a hint (that
>might be false) that it the string might contain chars above 127,
>instead of the current implementation where it is a guarantee of that.

Hmm, last time I relied on it being more than a hint perl let me down.
What current perl attempts to do is say that UTF8 bit means there 
are chars above _255_ - that is it tries to turn the bit off and downgrade
for chars 128..255. But this is expensive to get right. If I remove 
the last remaining big char from a 16M string you have to scan whole thing 
to find out... 

Nick Ing-Simmons Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About