Jarkko Hietaniemi <jhi@iki.fi> writes: >> > My kingdom for one example. > >Socket I/O? > >Protocols: if all I know is that my output is 500 Unicode characters >long, how am I to print out Content-Length? By deciding what encoding you want to use in the transaction, Encode-ing it to that form and using length() on the result. The answer is different for Content-Type: text/plain; charset=utf8 Content-Length: vs Content-Type: text/plain; charset=iso-8859-1 Content-Length: vs Content-Type: text/plain; charset=big5 Content-Length: > >If I have a scalar which according to length() is 10E7 Unicode characters, >will it fit within my disk quota of which I have 20E7 bytes left? Depends how you encode it on disk - as UCS-4 no it won't, as UCS-2 it will, as UTF-8 it depends. This is a good case for exporting utf8_length() from Encode. > >> But you don't have to go that low level. uuencode & base64 work with 8-bit >> bytes. Taking your Unicode string, looking at it as bytes, uuencode it, >> send it, receive it, uudecode it and looking at it again as Unicode will >> work - as long as you can get to the bytes representation. > >Any encoding which hasn't yet been encoded in Encode? Er, you can do Encode just fine with what we have - until recently Encode was written in perl. What we had was a hash keyed by the Unicode string with value being the byte sequence in the encoded form, or a hash keyed by the encoded form values being the Unicode strings. The problem _was_ that hash keys were not "transparent" so that code point in 128..255 range sometimes failed to lookup. As far as I know this is fixed now. > >> A lot of existing compression and encryption software just look at the >> data to be compressed or encrypted as bit or byte streams. There is no >> reason to create Unicode aware versions of those tools before they can >> be used on Unicode data. But to create Perl programs that compresses or >> encrypts data that can be decompressed or decrypted with the existing >> tools, your Perl program needs to be able to look at the data as a >> sequence of bytes. >> >> When in Rome.... When in Rome you speak Italian these days, but once latin was required. You still have to "commit" to a representation before you can compress or encrypt it. But no big deal: open(FOO,">:gzip:utf8") print FOO "My string with \X{1234} muddles"; or if you haven't got Nick C's module: open(FOO,"|-:utf8","gzip -9"); -- Nick Ing-Simmons <nik@tiuk.ti.com> Via, but not speaking for: Texas Instruments Ltd.