develooper Front page | perl.perl5.porters | Postings from February 2001

Re: The State of The Unicode

Nathan Torkington
February 19, 2001 17:48
Re: The State of The Unicode
Message ID:
Andrew Pimlott writes:
> > Protocols: if all I know is that my output is 500 Unicode characters
> > long, how am I to print out Content-Length?
> As I said to abigail, I would love a concrete explanation of what
> you have in mind.  In particular, what is your mechanism for
> ensuring that perl is representing $output as utf8?

I'm not sure what your question means.  Here's are some situations in
more detail.

I'm writing a module that encodes things in base64.  I get a string to
encode.  I need to process it byte-by-byte to produce a base64
encoding.  How do I do that?

I'm writing a network server where part of a response is the number of
octets to expect in the message body.  If the subroutine that sends
the response gets a string encoded in UTF-8, how does it calculate
the number of octets?
> Let me show you what I would fancy (modulo syntax, which I haven't
> been following):
>     $eh = new EncodingHandler 'UTF-8';
>     $out = new IO::Socket {
>         output_discipline => $eh->output_discipline, ... };
>     print $out "Content-length: " . $eh->length($output);
>     print $out $output;

What's an Encoding Handler?  I must have been asleep when this
discussion took place :-) That is redolent of all the OO stuff that
makes Java so unpleasant.  I'm not sure how Perl's making an easy
thing easy here.

> > If I have a scalar which according to length() is 10E7 Unicode characters,
> > will it fit within my disk quota of which I have 20E7 bytes left?
> Again, it depends on the output discipline you will use to get it on
> disk, and thus should be part of whatever library you use for output
> disciplines.  Why do you think it should be otherwise?

Aha, I think I see your point here.  Any time you send something
outside your program, there are potentially subroutines munging your
data before it escapes.  Therefore you shouldn't try to calculate the
size of the output yourself, as you don't know what the size of your
output will be.  Therefore, you have to ask whatever's munging your
output how long a string will be when it's output.

Oy.  Once again, I'm not sure "easy things easy" has been preserved.
Not that I have better suggestions for how to do it, mind.

> > Any encoding which hasn't yet been encoded in Encode?
> In that case, how did it ever get internally represented as utf8?

Someone gives me a UTF-8 string and says "produce Martian Normal Form
from this!"  Now I think about this one more, I think it isn't a
problem after all.  If I know I'm getting UTF-8 given to me, then I
will only be concerned with characters, not bytes.  If I don't know
what I'm being given, I need to convert it to UTF-8 or some other
known character encoding first, and then deal with it char-by-char.

So I guess all the things one would want to do will still be
available, I'm just worried that they'll require lots of OO crap
to do.

Perhaps length() etc. should work with the currently selected
filehandle?  Oh man, what misery this opens.

Nat Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About