develooper Front page | perl.perl5.porters | Postings from February 2001

Unicode - remaining Camel-III conflicts

Thread Next
February 24, 2001 12:52
Unicode - remaining Camel-III conflicts
Message ID:

I just re-read Unicode chapter in Camel-III.

Here are my observations:

Camel calls out gethostbyaddr() as an example of an interface
that should 'downgrade' a UTF-8 string - but pp_sys.c doesn't 
do that.

Camel says pack/unpack letters "c" and "C" do not change".
So that endorses Karsten Sperling et. al.'s view. Fine.

Unfortunately it also says:
"Hovever there is a new "U" specifier that will convert between UTF-8
characters and integers:

  pack('U*',1,20,300,4000) eq v1.20.300.4000

Which explicity contradicts my assertion that 'U' was intended to 
get perl's characters into UTF-8 encoded octets sequence.
Oh well. 

Then we have 'use bytes' - in my own opinion we have the transparency 
in a good enough state that use bytes is unnecessary. 
My worry is that Camel encorages its use in some sense without 
clearly defining what it means.

The Camel says this about 'use bytes':

In this case you may put a use bytes declaration around the byte-oriented
code to force it to use byte semantics even on strings marked as utf8 
strings. You are then responsible for any necessary conversions.

The upshot of all this is that a typical builtin operator will operate
on characters unless it is in the scope of a use bytes pragma.


The use bytes pragma-will never turn into a no-op. Not only is 
it necessary for byte-oriented code, but it also has the side effect 
of defining byte-oriented wrappers around certain functions for use outside
the scope of use bytes. As of this writing, the only wrapper is for 
length, but there are likely to be more as time goes by. To use such a 
wrapper say:

  use bytes (); # load wrappers without importing byte semantics
  $charlen =        length("\x{ffff_ffff}");  # returns 1 
  $charlen = bytes::length("\x{ffff_ffff}");  # returns 7

Outside the scope of a use bytes declaration, perl version 5.6 works
(or at least is intended to work like this:

Which is wonderfully vague. The remainder of the chapter is devoted
to defining character semantics, but there is no definition of 
"byte semantics" there.

But later it says (in Function chapter p 680):

Perl purposefully confuses bytes with characters in the scope of 
a use bytes declaration, so whenever we say character you should 
take it to mean byte in a use bytes context. In other words, use bytes
just wraps the definition of a character back to what it was in older
versions of Perl.

The last sentence seems to be the best on offer as to what "byte semantics"
is - and so defines what "use bytes" does.

But as it invokes "older versions of perl" as the reference it is 
not useful in describing what happens when a "use bytes" fragment
encounters a character larger than 255.

I would make the case that in such a case the character would have been 
wrapped at point of creation in "older perls".


my $s = chr(256);

{use bytes;
 $s .= 'A';  # must wrap all chars to 0..255

# $s should now be v0.65

This is however contradicted by the bytes::length discussion.

The worry I have is that the Camel implies that adding 

use bytes;

Makes your code "safer for byte oriented code". However the 
current implementation exposes the _current_ representation
at the point of call. Thus 'use bytes' is _less_ safe than 
just letting perl do-the-right-thing and downgrade the representation.

Nick Ing-Simmons

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About