develooper Front page | perl.perl5.porters | Postings from February 2001

Re: The State of The Unicode

From:
Nick Ing-Simmons
Date:
February 20, 2001 04:06
Subject:
Re: The State of The Unicode
Message ID:
200102201206.MAA23617@mikado.tiuk.ti.com
Nathan Torkington <gnat@frii.com> writes:
>Andrew Pimlott writes:
>> > Protocols: if all I know is that my output is 500 Unicode characters
>> > long, how am I to print out Content-Length?
>> 
>> As I said to abigail, I would love a concrete explanation of what
>> you have in mind.  In particular, what is your mechanism for
>> ensuring that perl is representing $output as utf8?
>
>I'm not sure what your question means.  Here's are some situations in
>more detail.
>
>I'm writing a module that encodes things in base64.  I get a string to
>encode.  I need to process it byte-by-byte to produce a base64
>encoding.  How do I do that?

Your module had better be passed a string which consists of characters 
in the range 0..255 then, because base64 is not defined for anything else.

Then you just do the same as you always have, substr or whathave you.

It is up to whatever calls you code to make sure it has transformed 
the Unicode code points into a sequence of bytes. Only the layer 
above you knows if the bytes you get are utf8, big5, shift-jis or whatever.
Of course if you are base64 encoding a JPEG, then the bytes are just 
bytes and nobody needs to mess with them at all.

The problem area would seem to be that according to Jarkko's 
table pack/unpack(C,...) does NOT downgrade automatically - so if your
legacy app. unpack(C) you would accidentally get the UTF-8 version of 
chr(0xFF) rather than 0xFF.

>
>I'm writing a network server where part of a response is the number of
>octets to expect in the message body.  If the subroutine that sends
>the response gets a string encoded in UTF-8, how does it calculate
>the number of octets?

If it gets the encodeded string it uses length($encoded),
it it gets the string it has to encode it first.

> 
>> Let me show you what I would fancy (modulo syntax, which I haven't
>> been following):
>> 
>>     $eh = new EncodingHandler 'UTF-8';
>>     $out = new IO::Socket {
>>         output_discipline => $eh->output_discipline, ... };
>>     print $out "Content-length: " . $eh->length($output);
>>     print $out $output;
>
>What's an Encoding Handler?  I must have been asleep when this
>discussion took place :-) 

perldoc Encode

needs to be cleaned up to explain this stuff.


>That is redolent of all the OO stuff that
>makes Java so unpleasant.  I'm not sure how Perl's making an easy
>thing easy here.

open(TMP,"+>:base64:encoding(big5)") || die;
print TMP "String with chinnese";
my $length = tell(TMP);
seek(TMP,0,0);
print MAIL "Content-Length: $length\n")
print MAIL "Content-Transfer-Encoding: base64\n\n");
print MAIL <TMP>;
close(TMP);

The above hides all the mess in TMP. The need for these TMP 
handles is why I keep suggesting  

open(TMP,"+>...",undef);     # anon but real file 
open(TMP,"+>...",\$buffer);  # puts data in the $buffer

With latter you can do: 

open(my $body,"+>:base64:gzip",\$buffer);  # puts data in the $buffer
print $body Whatever();
print MAIL "Content-Length: ",tell($body)\n\n,$buffer")


>
>Aha, I think I see your point here.  Any time you send something
>outside your program, there are potentially subroutines munging your
>data before it escapes.  Therefore you shouldn't try to calculate the
>size of the output yourself, as you don't know what the size of your
>output will be.  Therefore, you have to ask whatever's munging your
>output how long a string will be when it's output.

I think he's got it ;-)

>
>Someone gives me a UTF-8 string and says "produce Martian Normal Form
>from this!"  Now I think about this one more, I think it isn't a
>problem after all.  If I know I'm getting UTF-8 given to me, then I
>will only be concerned with characters, not bytes.  If I don't know
>what I'm being given, I need to convert it to UTF-8 or some other
>known character encoding first, and then deal with it char-by-char.

Right you "deccode" the incoming UTF-8 into "characters"
then just frolick through the Unicode code point sequence spewing 
bytes of your choice.

>
>So I guess all the things one would want to do will still be
>available, I'm just worried that they'll require lots of OO crap
>to do.
>
>Perhaps length() etc. should work with the currently selected
>filehandle?  Oh man, what misery this opens.

tell()

>
>Nat
-- 
Nick Ing-Simmons <nik@tiuk.ti.com>
Via, but not speaking for: Texas Instruments Ltd.




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About