Larry Wall <larry@wall.org> writes: > There is, however, a difference between code that is explictly in the > scope of a "use bytes" and code that is implicitly binary. And that > difference lies in how utf8 data from other modules is treated. The > default assumption is that any data that has been marked as utf8 should > be treated as utf8, so an implicitly binary script will try to do > the right thing, such as promoting binary/latin-1 strings to utf8 > if you concatenate it with utf8, for instance. > > But in the scope of "use bytes", no such promotion happens, because > "use bytes" basically says, "I don't care if the string is marked as > utf8 or not, just treat it as a bucket of bits." So if you concatenate > a latin-1 string with a utf8 string, you'll get nonsense. But that's > your problem, just as it is in old Perl. The "use bytes" declaration > indicates you're willing to accept that responsibility. I think this is the wrong thing to do. My thinking is that 'use byte' should never expose the internal UTF8 encoding. It should be possible to switch to UCS4 encoding internally without any user visible change. I suggest that if you operate on a UTF8-flagged string in 'use byte' scope, then the string should be downgraded to a binary/latin-1 string if it doesn't contain any chars with ord() > 255, and we should croak if chars that can't be represented in 8bits exists. We should never end up with UTF8-flagged strings that contain illegal UTF8-sequences. There should never be any user visible difference between two strings containing the same logical chars, just because one has the UTF8-flag set and the other has not. This makes perl free to internally convert between UTF8 and binary representation for strings when it feels like it. I think this freedom is needed. What I am saying is that functions like length(), substr() should not change behaviour at all when in 'use byte' scope, but they might croak if they try to operate on chars with ord() > 255. For string concatenation we will also downgrade UTF8 stuff and croak if it is not possible. Downgrading could happen with a function like this (untested). SV* sv_utf8_off(SV* sv) { if (SvPOK(sv) && SvUTF8(sv)) { char *c = SvPVX(sv); char *first_hi = 0; /* need to figure out if this is possible at all first */ while (c < SvEND(sv)) { if (*c & 0x80) { I32 len; UV uv = utf8_to_uv(c, &len); if (uv > 256) croak("Big byte"); if (!first_hi) first_hi = c; c += len; } else { c++; } } if (first_hi) { char *src = first_hi; char *dst = first_hi; while (src < SvEND(sv)) { if (*src & 0x80) { I32 len; U8 u = (U8)utf8_to_uv(c, &len); *dst++ = u; src += len; } else { *dst++ = *src++; } } SvCUR_set(dst - SvPVX(sv)); } SvUTF8_off(sv); } return sv; } I also think that perl ought to expose some way to explicitly decode/encode a string as UTF8. Perhaps some new pack()-letters can be used for that? This is also untested code. SV* sv_utf8_encode(SV *sv) { if (SvPOK(sv)) { if (SvUTF8(sv)) { SvUTF8_off(sv); } else { int hicount = 0; char *c; for (c = SvPVX(sv); c < SvEND(sv); c++) { if (*c & 0x80) hicount++; } if (hicount) { char *src, *dst; SvGROW(sv, SvCUR(sv) + hicount + 1); src = SvEND(sv) - 1; SvCUR_set(sv, SvCUR(sv) + hicount); dst = SvEND(sv) - 1; while (src < dst) { if (*src & 0x80) { dst--; uv_to_utf8((U8*)dst, (U8)*src--); dst--; } else { *dst-- = *src--; } } } } } return sv; } SV* sv_utf8_decode(SV *sv) { if (SvPOK(sv)) { char *c; bool has_utf = FALSE; sv_utf8_off(sv); /* it is actually just a matter of turning the utf8 flag on, but * we want to make sure everything inside is valid utf8 first. */ c = SvPVX(sv); while (c < SvEND(sv)) { if (*c & 0x80) { I32 len; (void)utf8_to_uv((U8*)c, &len); if (len == 1) { /* bad utf8 */ return sv; } c += len; has_utf = TRUE; } else { c++; } } if (has_utf) SvUTF8_on(sv); } return sv; } Regards, Gisle