develooper Front page | perl.perl5.porters | Postings from March 2007

[PATCH] unicode/utf8 pod

Thread Next
From:
Juerd Waalboer
Date:
March 3, 2007 16:27
Subject:
[PATCH] unicode/utf8 pod
Message ID:
20070304002549.GE4723@c4.convolution.nl
Here's my work in progress.

Attached is the clean diff. In the message body is an annotated version.

> -    print v9786;              # prints UTF-8 encoded SMILEY, "\x{263a}"
> +    print v9786;              # prints SMILEY, "\x{263a}"

The encoding for output depends on the effective layers.

Outputting "wide characters" without specifying an encoding is
considered wrong, and indeed does emit a warning.

> -(S utf8) (F) Perl detected something that didn't comply with UTF-8
> -encoding rules.
> +(S utf8) (F) Perl detected a string that didn't comply with UTF-8
> +encoding rules, even though it had the UTF8 flag on.

Indicates that the problem has nothing to do with UTF8 *byte* strings.

> -One possible cause is that you read in data that you thought to be in
> -UTF-8 but it wasn't (it was for example legacy 8-bit data).  Another
> -possibility is careless use of utf8::upgrade().
> +One possible cause is that you set the UTF8 flag yourself for data that
> +you thought to be in UTF-8 but it wasn't (it was for example legacy
> +8-bit data). To guard against this, you can use Encode::decode_utf8.

I couldn't think of, or find, any situation in which utf8::upgrade would
cause malformed UTF8.

So instead, I documented some common causes for this error message. In
general, seeing this message means that you shouldn't have set the UTF8
flag manually or with :utf8.

> +If you use the C<:encoding(UTF-8)> PerlIO layer for input, invalid byte
> +sequences are handled gracefully, but if you use C<:utf8>, the flag is
> +set without validating the data, possibly resulting in this error
> +message.

Many documents use :utf8, but that blindly sets the UTF8 flag,
regardless of the actual data.

> +See also L<Encode/"Handling Malformed Data">.

Might come in handy :)

> -To mark FILEHANDLE as UTF-8, use C<:utf8>.
> +To mark FILEHANDLE as UTF-8, use C<:utf8>. This will fail on invalid
> +UTF-8 sequences; C<:encoding(UTF-8)> is a safer (but slightly less
> +efficient) choice.

Again, :utf8 is often a bad idea.

> -chr(0x263a) is a Unicode smiley face.  Note that characters from 128
> -to 255 (inclusive) are by default not encoded in UTF-8 Unicode for
> -backward compatibility reasons (but see L<encoding>).
> +chr(0x263a) is a Unicode smiley face.  
> (...)
> -Note that under the C<bytes> pragma the NUMBER is masked to
> -the low eight bits.
> +Note that characters from 128 to 255 (inclusive) are by default
> +internally not encoded as UTF-8 for backward compatibility reasons (but
> +see L<encoding>).

Moved the encoding part to the end, and rephrased it (keyword:
"internally"). Removed the part about the bytes pragma, because I think
that that is better documented in bytes.pm only. (Because I believe that
"use bytes" is very commonly a bad idea.)

>  Note the I<characters>: if the EXPR is in Unicode, you will get the
>  number of characters, not the number of bytes.  To get the length
> -in bytes, use C<do { use bytes; length(EXPR) }>, see L<bytes>.
> +of the internal string in bytes, use C<bytes::length(EXPR)>, see
> +L<bytes>.  Note that the internal encoding is variable, and the number
> +of bytes usually meaningless.  To get the number of bytes that the
> +string would have when encoded as UTF-8, use
> +C<length(Encoding::encode_utf8(EXPR))>.

Explicitly mentions common myth/misunderstanding, and remove one of the
offending sources for it.

> -  open(FH, "<:utf8", "file")
> +  open(FH, "<:encoding(UTF-8)", "file")

Again, :utf8 should only be used for reading when you're absolutely
certain that your data is correct. And even then, it's probably a
premature optimization still.

> -The string should not contain any character with the value > 255 (which
> -can only happen if you're using UTF-8 encoding).  If it does, it will be
> -treated as something that is not UTF-8 encoded.  When the C<vec> was
> -assigned to, other parts of your program will also no longer consider the
> -string to be UTF-8 encoded.  In other words, if you do have such characters
> -in your string, vec() will operate on the actual byte string, and not the
> -conceptual character string.
> +If the string happens to be encoded as UTF-8 internally (and thus has
> +the UTF8 flag set), this is ignored by C<vec>, and it operates on the
> +internal byte string, not the conceptual character string, even if you
> +only have characters with values less than 256. 

The original documentation for vec is just wrong. It's not about the
values of the characters in the string, but about the UTF8 flag, which
might be set because those values I<used to be> there.

> -a variable number of bytes to represent a character, instead of just
> -one. You can learn more about Unicode at http://www.unicode.org/
> +a variable number of bytes to represent a character. You can learn more
> +about Unicode and Perl's Unicode model in L<perlunicode>.

A lot of relevant information about unicode is already in our docs, but
much easier to grok than the info at unicode.org.

> -The API function C<is_utf8_string> can help; it'll tell you if a string
> -contains only valid UTF-8 characters. However, it can't do the work for
> -you. On a character-by-character basis, C<is_utf8_char> will tell you
> -whether the current character in a string is valid UTF-8.
> +In general, you either have to know what you're dealing with, or you
> +have to guess. 

Explicitly make clear that there's no way to automatically detect and be
sure.

> -character. Characters with values 1...128 are stored in one byte, just
> -like good ol' ASCII. Character 129 is stored as C<v194.129>; this
> +character. Characters with values 0...127 are stored in one byte, just
> +like good ol' ASCII. Character 128 is stored as C<v194.128>; this

Need I say anything about this? Let's just forget it ever happened ;)

>  Currently, Perl deals with Unicode strings and non-Unicode strings
> -slightly differently. If a string has been identified as being UTF-8
> -encoded, Perl will set a flag in the SV, C<SVf_UTF8>. You can check and
> -manipulate this flag with the following macros:
> +slightly differently. A flag in the SV, C<SVf_UTF8>, indicates that the
> +string is internally encoded as UTF-8. Without it, the byte value is the
> +codepoint number and vice versa (in other words, the string is encoded
> +as iso-8859-1). You can check and manipulate this flag with the
> +following macros:

"identified as being UTF-8" suggests that we can tell. In reality, we
have to tell Perl.

>  The problem comes when you have, for instance, a string that isn't
> -flagged is UTF-8, and contains a byte sequence that could be UTF-8 -
> +flagged as UTF-8, and contains a byte sequence that could be UTF-8 -
>  especially when combining non-UTF-8 and UTF-8 strings.

Typo.

>  =head2 How do I convert a string to UTF-8?
> -If you're mixing UTF-8 and non-UTF-8 strings, you might find it necessary
> -to upgrade one of the strings to UTF-8. If you've got an SV, the easiest
> -way to do this is:
> +If you're mixing UTF-8 and non-UTF-8 strings, it is necessary to upgrade
> +one of the strings to UTF-8. If you've got an SV, the easiest way to do
> +this is:

"might find it necessary" suggests that not having to upgrade is the
common situation, while actually, I couldn't think of *any* mixed
latin1/utf8 situation where upgrading isn't needed.

> -by the end user, it can cause problems.
> +by the end user, it can cause problems in deficient code.

Indeed Perl porters should be careful with upgrading, but at least
indicate that the end user can actually do something about the code, by
fixing it.

Code broken by automatic upgrading is generally code that fails to
decode input or encode output.

>  point of view) characters in a single byte while encoding the rarer
>  ones in three or more bytes.
> -So what has this got to do with C<pack>? Well, if you want to convert
> -between a Unicode number and its UTF-8 representation you can do so by
> -using template code C<U>. As an example, let's produce the UTF-8
> -representation of the Euro currency symbol (code number 0x20AC):
> +Perl uses UTF-8, internally, for most Unicode strings.
> +So what has this got to do with C<pack>? Well, if you want to compose a
> +Unicode string (that is internally encoded as UTF-8), you can do so by
> +using template code C<U>. As an example, let's produce the Euro currency
> +symbol (code number 0x20AC):

pack "U" gives you the *unicode* string, not the *UTF-8* string. Yes,
internally it'll be UTF-8, but conceptually, it's encodingless unicode,
and the user needs to explicitly request encoding.

>     $UTF8{Euro} = pack( 'U', 0x20AC );
> +   # Equivalent to: $UTF8{Euro} = "\x{20ac}";

Because it is.

> -Inspecting C<$UTF8{Euro}> shows that it contains 3 bytes: "\xe2\x82\xac". The
> -round trip can be completed with C<unpack>:
> +Inspecting C<$UTF8{Euro}> shows that it contains 3 bytes:
> +"\xe2\x82\xac". However, it contains only 1 character, number 0x20AC.
> +The round trip can be completed with C<unpack>:

Stress that it's one character.

> +Unpacking using the C<U> template code also works on UTF-8 encoded byte
> +strings.

Because it does.

> +Please note: in the general case, you're better off using
> +Encode::decode_utf8 to decode a UTF-8 encoded byte string to a Perl
> +unicode string, and Encode::encode_utf8 to encode a Perl unicode string
> +to UTF-8 bytes. These functions provide means of handling invalid byte
> +sequences and generally have a friendlier interface.

Advise to not use pack when the "normal" functions suffice.

>  standard ASCII character set.  Perl now supports I<Unicode>, a standard
>  for representing the alphabets from virtually all of the world's written
> -languages, and a host of symbols.  Perl uses the UTF-8 encoding, in which 
> -ASCII characters are still encoded as one byte, but characters greater 
> -than C<chr(127)> may be stored as two or more bytes.
> +languages, and a host of symbols.  Perl's text strings are unicode strings, so
> +they can contain characters with a value (codepoint or character number) higher
> +than 255

Conceptually, we have unicode strings. Internal representation is
irrelevant at this point, and utterly confusing to the newcomer.

>  What does this mean for regexps? Well, regexp users don't need to know
>  much about Perl's internal representation of strings.  But they do need
> -to know 1) how to represent Unicode characters in a regexp and 2) when
> -a matching operation will treat the string to be searched as a
> -sequence of bytes (the old way) or as a sequence of Unicode characters
> -(the new way).  The answer to 1) is that Unicode characters greater
> -than C<chr(127)> may be represented using the C<\x{hex}> notation,
> -with C<hex> a hexadecimal integer:
> +to know 1) how to represent Unicode characters in a regexp and 2) that
> +a matching operation will treat the string to be searched as a sequence
> +of characters, not bytes.  The answer to 1) is that Unicode characters
> +greater than C<chr(255)> are represented using the C<\x{hex}> notation,
> +because the \0 octal and \x hex (without curly braces) don't go further
> +than 255.

\xab and \x{ab} are the same. Don't suggest that they're significantly
different...

And re 2), the regex engine always works character based, even if
working on latin1, because then a byte is a character, and the
distinction is even less important.

> -Unicode characters in the range of 128-255 use two hexadecimal digits
> -with braces: C<\x{ab}>.  Note that this is in general different than
> -C<\xab>, which is just a hexadecimal byte with no Unicode significance,
> -except when your script is encoded in UTF-8 where C<\xab> has the
> -same byte representation as C<\x{ab}>.

More \x stuff.

> -The answer to requirement 2), as of 5.6.0, is that if a regexp
> -contains Unicode characters, the string is searched as a sequence of
> -Unicode characters.  Otherwise, the string is searched as a sequence of
> -bytes.  If the string is being searched as a sequence of Unicode
> -characters, but matching a single byte is required, we can use the C<\C>
> -escape sequence.  C<\C> is a character class akin to C<.> except that
> -it matches I<any> byte 0-255.  So
> -
> -    use charnames ":full"; # use named chars with Unicode full names
> -    $x = "a";
> -    $x =~ /\C/;  # matches 'a', eats one byte
> -    $x = "";
> -    $x =~ /\C/;  # doesn't match, no bytes to match
> -    $x = "\N{MERCURY}";  # two-byte Unicode character
> -    $x =~ /\C/;  # matches, but dangerous!
> -
> -The last regexp matches, but is dangerous because the string
> -I<character> position is no longer synchronized to the string I<byte>
> -position.  This generates the warning 'Malformed UTF-8
> -character'.  The C<\C> is best used for matching the binary data in strings
> -with binary data intermixed with Unicode characters.

"Unicode characters"? All latin1 characters are unicode characters too!
This cannot be used to discriminate.

\C should not be used. It's painful. It should especially not be
mentioned in a *tutorial*.

> +The answer to requirement 2), as of 5.6.0, is that a regexp uses unicode
> +characters. Internally, this is encoded to bytes using either UTF-8 or a
> +native 8 bit encoding, depending on the history of the string, but
> +conceptually it is a sequence of characters, not bytes. See
> +L<perlunitut> for a tutorial about that.

Re-phrase 2), more in synch with current jargon.

> -Let us now discuss the rest of the character classes.  Just as with
> +Let us now discuss Unicode character classes.  Just as with Unicode

Character classes were discussed many, many paragraphs ago, so now
talking about "the rest" is a bit weird.

> +People who want to learn to use Unicode in Perl, should probably read
> +L<the Perl Unicode tutorial|perlunitut> before reading this reference
> +document.

Advertise my own tutorial. No, really, beginners are frightened by the
heap of technical information in perlunicode. That's why I wrote
perlunitut in the first place :)

>  The regular expression compiler produces polymorphic opcodes.  That is,
>  the pattern adapts to the data and automatically switches to the Unicode
> -character scheme when presented with Unicode data--or instead uses
> -a traditional byte scheme when presented with byte data.
> +character scheme when presented with data that is internally encoded in
> +UTF-8 -- or instead uses a traditional byte scheme when presented with
> +byte data.

"Unicode data" again.

>  C<$^V eq v5.6.0>.  Note that the characters in this string value can
> -potentially be in Unicode range.
> +potentially be greater than 255.

latin1 is in the "Unicode range" too. Much like "unicode data", this is
too vague.

And in fact, the characters can even be way outside the unicode range,
because Perl doesn't care:

    v65.999999999.65


Still to do:

- perlunicode
- perluniintro
- perlunitut
- utf8
- bytes
- Encode
- encoding
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerd@juerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy <sales@convolution.nl>

Ik vertrouw stemcomputers niet.
Zie <http://www.wijvertrouwenstemcomputersniet.nl/>.

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About