develooper Front page | perl.perl5.porters | Postings from February 2001

Re: The State of The Unicode

From:
Nick Ing-Simmons
Date:
February 20, 2001 03:14
Subject:
Re: The State of The Unicode
Message ID:
200102201114.LAA23421@mikado.tiuk.ti.com
Jarkko Hietaniemi <jhi@iki.fi> writes:
>> > My kingdom for one example.
>
>Socket I/O?
>
>Protocols: if all I know is that my output is 500 Unicode characters
>long, how am I to print out Content-Length?

By deciding what encoding you want to use in the transaction, 
Encode-ing it to that form and using length() on the result.
The answer is different for 

Content-Type: text/plain; charset=utf8
Content-Length: 

vs 

Content-Type: text/plain; charset=iso-8859-1
Content-Length: 

vs 

Content-Type: text/plain; charset=big5
Content-Length: 


>
>If I have a scalar which according to length() is 10E7 Unicode characters,
>will it fit within my disk quota of which I have 20E7 bytes left?

Depends how you encode it on disk - as UCS-4 no it won't,
as UCS-2 it will, as UTF-8 it depends.

This is a good case for exporting utf8_length() from Encode.

>
>> But you don't have to go that low level. uuencode & base64 work with 8-bit
>> bytes. Taking your Unicode string, looking at it as bytes, uuencode it,
>> send it, receive it, uudecode it and looking at it again as Unicode will
>> work - as long as you can get to the bytes representation.
>
>Any encoding which hasn't yet been encoded in Encode?

Er, you can do Encode just fine with what we have - until recently 
Encode was written in perl. What we had was a hash keyed by the Unicode
string with value being the byte sequence in the encoded form,
or a hash keyed by the encoded form values being the Unicode strings.

The problem _was_ that hash keys were not "transparent" so that code point
in 128..255 range sometimes failed to lookup. As far as I know this 
is fixed now.

>
>> A lot of existing compression and encryption software just look at the
>> data to be compressed or encrypted as bit or byte streams. There is no
>> reason to create Unicode aware versions of those tools before they can
>> be used on Unicode data. But to create Perl programs that compresses or
>> encrypts data that can be decompressed or decrypted with the existing
>> tools, your Perl program needs to be able to look at the data as a
>> sequence of bytes.
>> 
>> When in Rome....

When in Rome you speak Italian these days, but once latin was required.

You still have to "commit" to a representation before you can compress
or encrypt it. But no big deal:

open(FOO,">:gzip:utf8")
print FOO "My string with \X{1234} muddles";

or if you haven't got Nick C's module:

open(FOO,"|-:utf8","gzip -9");

-- 
Nick Ing-Simmons <nik@tiuk.ti.com>
Via, but not speaking for: Texas Instruments Ltd.




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About