On Tue, Feb 20, 2001 at 01:11:12AM +0100, abigail@foad.org wrote: > On Mon, Feb 19, 2001 at 06:07:14PM -0500, Andrew Pimlott wrote: > > On Mon, Feb 19, 2001 at 04:47:53PM -0600, Jarkko Hietaniemi wrote: > > > > > > As far "what is broken", I do understand the concern of "exposing too > > > much of the internal representation" (which at the moment happens to > > > be UTF-8) to the user, having bytes and character is confusing at > > > best. However, I'm not fully convinced that completely hiding it is > > > wise, either. If from Perl level one cannot reach back to the bytes > > > comprising the UTF-8 representation of the characters, I feel we are > > > trying to pad the cell too softly. > > > > My kingdom for one example. > > If you step out of the box, it's easy to come up with examples. If you have time, can you be more concrete about your examples, for example with code snippets? I have the idea that you're getting at something, but I can't quite figure out what. > When ever you need to interface with something that has no understanding > of Unicode, for which everything is data, you want to be able to look > bytewise to your strings. When talking to a serial device for instance, > or a hard disk, whose capacity will be measured in bytes, not variable > width characters. Device drivers might not be commonly written in Perl, > but that doesn't mean it should be impossible. But these bytes will not be treated as the bytes of a utf8 encoded Perl string. Do you mean that you will require that perl represent your byte strings as now, and throw an error if it ever felt the need to represent them in utf8? > But you don't have to go that low level. uuencode & base64 work with 8-bit > bytes. Taking your Unicode string, looking at it as bytes, uuencode it, > send it, receive it, uudecode it and looking at it again as Unicode will > work - as long as you can get to the bytes representation. Interesting example (cf as well the recent i18n of DNS); here are my questions. What if perl were currently representing your string in your local encoding? How would you coerce it to utf8? Would you expect coersion to happen automatically inside a block with some pragma? Would there be an explicit function? Would there be a magic incantation like $str .= "\x{ffff}"; chop $str; ? Would you accept a method that would fail if perl ever decided to use UCS-4 as its wide string internal representation? I tend to find it saner to do something like my $utf8str = to_utf8 $str; foreach my $byte (split /.*/, $utf8str) { ... } ie, so that the Perl characters in $utf8str are not the same characters as in $str, they are just byte values. Is your concern merely efficiency, or am I missing something? > A lot of existing compression and encryption software just look at the > data to be compressed or encrypted as bit or byte streams. There is no > reason to create Unicode aware versions of those tools before they can > be used on Unicode data. But to create Perl programs that compresses or > encrypts data that can be decompressed or decrypted with the existing > tools, your Perl program needs to be able to look at the data as a > sequence of bytes. The big question (to me) is, why would perl ever think of these strings as utf8 encoded in the first place? A rephrasing: You can obviously write these programs with non-Unicode perl. What do you expect could change in Unicode-enabled perl that could break them? I absolutely expect Unicode-enabled perl to work well with bytes. I expect that it would have an internal representation that would hold strings in which each character value is <= 0xff efficiently. What does this have to do with getting the bytes out of a possible other representation? AndrewThread Previous | Thread Next