Re: [PATCH] Improved hibit text literals

Larry Wall
February 10, 2000 18:35
Gisle Aas writes:
: This patch relative to 5.5.650 makes perl do the right thing for
: literals containing hibit charactets.  The follwing behaviour will
: change if you apply this patch:
:     - a \x{} escape will not force the UTF8 flag on, unless the value
:       is acutally higher than \xFF.


:     - the "\xff will produce malformed UTF-8 character; use \x{ff}"
:       warning is gone, since we now always do the right thing :-)


:     - under 'use utf8', hibit chars that are illegal utf8 are encoded
:       using utf8; basically automatically turns latin1 into utf8.
:       This ensure that there will never be illegal UTF8 sequences in
:       a literal string that has the UTF8 flag set.

I know I originally put in the comment, "could cvt latin-1 to utf8
here", but I'm currently thinking that if a file has utf8 mixed with
latin-1, it's probably already in serious trouble by the time it gets
to the latin-1, so it probably better croak.  Especially if the
filehandle was implicitly put into utf8 mode by thinking it saw utf8
earlier, when in fact it only saw bizarre latin-1.  The better approach
is to make them go back and insert "use charset 'latin-1'" or some such
at the beginning.

:     - Octal escapes like \400 and \777 will actually do the right thing now.
:       Previously you only got the low 8-bits.

Hmm.  An argument could be made that those should be illegal, though
I don't know that I want to make it.

: But, it still looks like the \N{} support will not work as it is
: now. It never sets the UTF8 flag on the string by itself.

Well, it should resolve to a character that's either above \xFF or not,
so it seems conceptually simple.  But I have to confess to not
understanding the \N code at all:

    print "\N{WHITE SMILING FACE}";


    constant(\N{...}): %^H is not localized at - line 2, within string

Talk about obscure error messages!  I think it means that \N will need
to be taught about pulling in the Unicode names by default.  Previously,
I think it assumed the Unicode names would come in with a "use utf8",
but that's going away, so we need to make it the default if \N doesn't
otherwise recognize its name, I imagine.

But thanks!  It's easy to sit on the sidelines and carp, but we need more
real code whackers like you.

