develooper Front page | perl.perl5.porters | Postings from May 2008

On the problem of strings and binary data in Perl.

Thread Next
May 20, 2008 06:51
On the problem of strings and binary data in Perl.
Message ID:
As we have seen in recent threads we have been somewhat schizophrenic
in how we deal with strings.

I believe I have a proposal which would allow us to bypass these
problems while at the same time maintaining backwards compatibility. I
believe that this solution is compatible with some other proposals
like adding better support for case modifying options and things like
"use unicode semantics" for regexes and stuff.

My proposal is this:


Make it such that the utf8 flag on means that the string contains
unicode codepoints encoded as utf8.

When the utf8 flag is off an additional field in the SV would be used
to determine what type of string the data contained. (I guess this
would be a pointer to some struct or an offset into a table)

If a string was not explicitly marked to be something else it would be
default assumed to be Latin-1. (null pointer or offset=0)

Two strings would only be legally concatenable if they were of the
same type, or if there existed defined conversion routines from both
types to Unicode. In the case of a string type mismatch both would be
upgraded to utf8 according to their type. An exception to this rule
would be a binary string type which would be concatable with anything,
and which would never be modified nor cause anything else to be
modified when concatenated with it.

We would provide something like bless to mark strings as being of a
particular charset and encoding combination.

WRT Win32:

All strings would be forced to unicode* and the widecharacter apis
would be used (possibly unless the string was of type ANSI or the
string was of type Binary in which case the 8 bit apis would be used).


Im not sure how this would impact XS. I think it would leave existing
XS unchanged, and make new XS easier to write. But im open to being
told im all wrong. :-)

* this would throw an error if the string was not of a type that can
be converted to unicode.
ps: I saw the proposal for a UPV type, im at a loss to understand how
this would do anythign more than make the situation worse.

perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About