develooper Front page | perl.unicode | Postings from December 2004

Re: Make support the real UTF-8

Thread Previous | Thread Next
Bjoern Hoehrmann
December 2, 2004 03:43
Re: Make support the real UTF-8
Message ID:
* Gisle Aas wrote:
>As you probably know perl's version of UTF-8 is not the real thing.  I
>thought I would hack up a patch to support the encoding as defined by
>Unicode.  That involves rejecting illegal chars (like surrogates,
>"\x{FFFF}" and "\x{FDD0}), chars above 0x10FFFF, overlong sequences
>and such.

I would very much like to have this functionality available in some
standard module. Though, what do you mean here by rejecting exactly?
For example, by default, I would expect

  decode("UTF-8" => "Bj\xF6rn")

to return "Bj\x{FFFD}rn" as documented in `perldoc Encode`; would
this change (i.e., would it croak instead)?

>Before I do this I would like to get some feedback on the interface.
>My prefered interface would be to make:
>   encode("UTF-8", $string)
>imply the official restricted form and then have
>   encode("UTF-8-Perl", $string)
>be used as the name for Perl's relaxed and extended version of the
>encoding.  The encode_utf8($string) function would continue to be the
>same as encode("UTF-8-Perl", $string).

I would prefer there was no semantic overloading of "UTF-8" at all,
I generally expect that anything called UTF-8 refers to UTF-8 as
defined in the Unicode standard or RFC 3629. I was for example sur-
prised that Encode::is_utf8(...) considers sequences UTF-8 that are
not UTF-8 as defined in those specifications (the documentation
explicitly states "well-formed UTF-8").

Now that we have this problem, introducing more places where one needs
to carefully check the documentation what is considered UTF-8 does not
seem like the best option, having decode_utf8() and decode(utf8=>...)
mean some- thing different is likely going to cause confusion. Maybe
this could go the other way round, i.e. introduce a new encoding
"UTF-8-Strict" or something.

>This implies that encode("UTF-8", $string) can start failing while
>previously it could not.

As above, by default I do not think it should fail but rather use a
replacement character instead of croaking. The result should be the
same as (using RFC-3629-UTF-8 to mean the non-Perl UTF-8)

  encode("RFC-3629-UTF-8" => decode("RFC-3629-UTF-8" => $string))

where decode("RFC-3629-UTF-8") would always return a RFC-3629-UTF-8
string with no illegal sequences (and as that should not fail, the
above should not fail either). I.e.

  encode("RFC-3629-UTF-8" => $string) eq
  encode("RFC-3629-UTF-8" => decode("RFC-3629-UTF-8" => $string))

would always hold true (assuming that decode("RFC-3629-UTF-8") would
ignore that the UTF-8 flag on $string is already set and decode

>Other suggestions or comments?

There should be a corresponding is_foo function that checks whether
a sequence of octets (or a string with the UTF-8 flag set) is actually
UTF-8 as defined in the relevant specifications, maybe by adding one
more argument to Encode::is_utf8 like

  Encode::is_utf8($string, $perl_utf8_check, $real_utf8_check)
Björn Höhrmann · ·
Weinh. Str. 22 · Telefon: +49(0)621/4309674 ·
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · 

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About