develooper Front page | perl.perl5.porters | Postings from February 2000

Re: should "use byte" be "use bytes"?

From:
Gisle Aas
Date:
February 11, 2000 11:28
Subject:
Re: should "use byte" be "use bytes"?
Message ID:
m33dqzmtdg.fsf@eik.g.aas.no
Larry Wall <larry@wall.org> writes:

> There is, however, a difference between code that is explictly in the
> scope of a "use bytes" and code that is implicitly binary.  And that
> difference lies in how utf8 data from other modules is treated.  The
> default assumption is that any data that has been marked as utf8 should
> be treated as utf8, so an implicitly binary script will try to do
> the right thing, such as promoting binary/latin-1 strings to utf8
> if you concatenate it with utf8, for instance.
> 
> But in the scope of "use bytes", no such promotion happens, because
> "use bytes" basically says, "I don't care if the string is marked as
> utf8 or not, just treat it as a bucket of bits."  So if you concatenate
> a latin-1 string with a utf8 string, you'll get nonsense.  But that's
> your problem, just as it is in old Perl.  The "use bytes" declaration
> indicates you're willing to accept that responsibility.

I think this is the wrong thing to do.  My thinking is that 'use byte'
should never expose the internal UTF8 encoding.  It should be possible
to switch to UCS4 encoding internally without any user visible change.

I suggest that if you operate on a UTF8-flagged string in 'use byte'
scope, then the string should be downgraded to a binary/latin-1 string
if it doesn't contain any chars with ord() > 255, and we should croak
if chars that can't be represented in 8bits exists.  We should never
end up with UTF8-flagged strings that contain illegal UTF8-sequences.

There should never be any user visible difference between two strings
containing the same logical chars, just because one has the UTF8-flag
set and the other has not.  This makes perl free to internally convert
between UTF8 and binary representation for strings when it feels like
it.  I think this freedom is needed.

What I am saying is that functions like length(), substr() should not
change behaviour at all when in 'use byte' scope, but they might croak
if they try to operate on chars with ord() > 255.  For string
concatenation we will also downgrade UTF8 stuff and croak if it is not
possible.

Downgrading could happen with a function like this (untested).

SV*
sv_utf8_off(SV* sv)
{
    if (SvPOK(sv) && SvUTF8(sv)) {
        char *c = SvPVX(sv);
        char *first_hi = 0;
        /* need to figure out if this is possible at all first */
        while (c < SvEND(sv)) {
            if (*c & 0x80) {
                I32 len;
                UV uv = utf8_to_uv(c, &len);
                if (uv > 256)
                    croak("Big byte");
                if (!first_hi)
                    first_hi = c;
                c += len;
            }
            else {
                c++;
            }
        }

        if (first_hi) {
            char *src = first_hi;
            char *dst = first_hi;
            while (src < SvEND(sv)) {
                if (*src & 0x80) {
                    I32 len;
                    U8 u = (U8)utf8_to_uv(c, &len);
                    *dst++ = u;
                    src += len;
                }
                else {
                    *dst++ = *src++;
                }
            }
            SvCUR_set(dst - SvPVX(sv));
        }
        SvUTF8_off(sv);
    }
    return sv;
}

I also think that perl ought to expose some way to explicitly
decode/encode a string as UTF8.  Perhaps some new pack()-letters can
be used for that?  This is also untested code.

SV*
sv_utf8_encode(SV *sv)
{
   if (SvPOK(sv)) {
        if (SvUTF8(sv)) {
            SvUTF8_off(sv);
        }
        else {
            int hicount = 0;
            char *c;
            for (c = SvPVX(sv); c < SvEND(sv); c++) {
                if (*c & 0x80)
                  hicount++;
            }
            if (hicount) {
                char *src, *dst;
                SvGROW(sv, SvCUR(sv) + hicount + 1);

                src = SvEND(sv) - 1;
                SvCUR_set(sv, SvCUR(sv) + hicount);
                dst = SvEND(sv) - 1;

                while (src < dst) {
                  if (*src & 0x80) {
                    dst--;
                    uv_to_utf8((U8*)dst, (U8)*src--);
                    dst--;
                  }
                  else {
                    *dst-- = *src--;
                  }
                }
            }
        }
   }
   return sv;
}

SV*
sv_utf8_decode(SV *sv)
{
    if (SvPOK(sv)) {
        char *c;
        bool has_utf = FALSE;
        sv_utf8_off(sv);

        /* it is actually just a matter of turning the utf8 flag on, but
         * we want to make sure everything inside is valid utf8 first.
         */
        c = SvPVX(sv);
        while (c < SvEND(sv)) {
            if (*c & 0x80) {
                I32 len;
                (void)utf8_to_uv((U8*)c, &len);
                if (len == 1) {
                    /* bad utf8 */
                    return sv;
                }
                c += len;
                has_utf = TRUE;
            }
            else {
                c++;
            }
        }

        if (has_utf)
            SvUTF8_on(sv);
    }

    return sv;
}

Regards,
Gisle



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About