develooper Front page | perl.perl5.porters | Postings from March 2007


Thread Next
Juerd Waalboer
March 3, 2007 15:30
Message ID:
Hi all,

While I'm working on updating Perl documentation wrt Unicode and UTF8,
one of the biggest problems is that perlunicode itself seems to be
written in the future tense, while that future is already the current

> In future, Perl-level operations will be expected to work with
> characters rather than bytes.

Nearly all Perl level operations already work with characters rather
than bytes. (Though byte == character in byte strings, of course.)

> (about utf8)
> when character semantics become the default, this pragma may become a
> no-op.

The pragma now indicates that the /source/ is in utf8. This has nothing
to do with character semantics. Perl (5) is highly unlikely to ever
switch to utf8 by default, for Perl source code, because utf8 cannot be
safely distinguished from latin1, and there is a lot of latin1 code in
the wild.

> Unless explicitly stated, Perl operators use character semantics
> for Unicode data and byte semantics for non-Unicode data.

This is inherently wrong. All text strings are unicode strings. Even a
string that's still encoded as latin1 internally, is unicode. Codepoint
65 is "A", just like in US-ASCII, and just like in ISO-8859-1, but that
doesn't mean it's not unicode.

It's not "Unicode data", but the "UTF8 flag", that makes Perl decide
differently. And that means it's "data that is internally encoded as

This sentence could be changed to:

| Unless explicitly stated, Perl operators use character semantics for
| strings that are internally encoded as UTF-8, and byte semantics for
| strings that are not.

However, since byte == character == byte when a string is not encoded as
UTF-8, this sentence is effectively a no-op, and you could instead
suffice with:

| Perl operators use character semantics.

My feeling is that perlunicode.pod is a bit outdated in some parts, and
uses suboptimal jargon in some others.

I'm wondering if it's worth updating, or if rewriting makes more sense.

For now, I'm skipping perlunicode.
korajn salutojn,

  juerd waalboer:  perl hacker  <>  <>
  convolution:     ict solutions and consultancy <>

Ik vertrouw stemcomputers niet.
Zie <>.

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About