develooper Front page | perl.perl5.porters | Postings from February 2007

Re: Future Perl development

Thread Previous | Thread Next
February 7, 2007 01:45
Re: Future Perl development
Message ID:
On 2/7/07, Marvin Humphrey <> wrote:
> However, all that encode/decode overhead would kill the performance
> of these libraries, rendering them far less useful.  It would be nice
> it Perl's internal encoding was always, officially UTF-8 -- then
> there wouldn't be a conflict.  But I imagine that might be very hard
> to pull off on EBCDIC systems, so maybe it's better this way -- I get
> to choose not to support EBCDIC systems (along with systems that
> don't use IEEE 754 floats, and systems where chars are bigger than a
> byte).

I for one would argue that if we were going to go to a single internal
encoding that utf8 would be the wrong one. Utf-16 would be much
better. It would allow us to take advantage of the large amount of
utf-16 code out there, ranging from DFA regexp engines to other
algorithms and libraries. On Win32 the OS natively does utf-16 so much
of the work would be done by the OS. Id bet that this was also a
reason why other languages choose to use utf-16. In fact i wouldnt be
surprised if we were the primary language using utf8 internally at

I mean heck, utf8 was a kudge worked out on a napkin to make it
possible to store unicode filenames in a unix style filesystem. (utf8
has the property that no encoding of a high codepoint contains any
special character used by a unix filesystem) WTF would we use a kludge
as our primary internal representation when there are better
representations to use? Especially when you consider the performance
impact of doing so (use unicode and watch the regex engine get much

IMO UTF8 internally makes sense only when you condsider that most of
the time stuff is happening using latin_1or whatever you want to call
the single byte encoding we use.

> >> I don't care whether $string is a text-string or byte-string, I
> >> just want
> >> it to returns the same string.
> >
> > Perhaps you should care. In a language such as Java, you are forced to
> > care, as byte[] and String are different types. Perl blurs this
> > difference,
> > and lets you believe that you should not need to care.
> I agree, Mark.  Silent upgrading of bytes to Unicode strings cost me
> a bunch of debugging time when I learned the hard way that you need
> to care.  I was writing a serializer that concatenated Unicode
> strings together with packed integers to make sort keys.  It never
> occurred to me that such a concat operation would corrupt the packed
> integer, and it took me a long time to hunt down why my sort op was
> failing.

I think this is the real itch. Before utf8 it was fine to think of
strings as "byte buffers", but they arent byte buffers and never have
been, they are strings, and strings dont contain bytes (whatever
Gerrard thinks :-), they contain characters. \x{} doesnt _ever_
produce a specifc byte, it produces a specific _character_ (that just
happens to represented by a byte in one of the internal encodings we

The correct way to get a specific byte is via pack,  not via any
string escape as string escape operate on characters. The fact that a
given perl may internally encode strings as bytes is irrelevent.

So, imo what we need is an SV flag that says "this is a byte buffer,
it does not contain characters and anytime it is concatenated with a
string the string should also be treated as a byte buffer regardless
of its actual state." Thus concatenating a byte-buffer with a utf8
string would not upgrade the bytebuffer.

The problem here is not our internal encoding, which should be opaque,
but rather our lack of support for an explicitly byte oriented storage
and our heritage of treating strings as character buffers, even though
they arent really.


perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About