develooper Front page | perl.perl5.porters | Postings from March 2007

stronger type determination (was Re: the utf8 flag ...)

Thread Next
Darren Duncan
March 31, 2007 03:36
stronger type determination (was Re: the utf8 flag ...)
Message ID:
Juerd Waalboer said on March 28, 2007 02:13:
>  What I want (and I think you want too) is a real type system, to 
>have two different distinct types: byte strings and character 
>strings. It would be bad to use a flag called "UTF8" for this, 
>because a byte string can also be UTF8 encoded. Perl already suffers 
>from this problem, but because the UTF8 flag is *INTERNAL*, it's not 
>a big deal. It would be if it surfaced and was used by Perl coders.

Yes, a stronger type system is exactly what I want, and that is what 
my example library (in real life, QDRDBMS) wants to use internally; 
it internally treats character data (which is encoding agnostic) and 
binary data (undifferentiated bits) and integers and non-integer 
numbers all as disjoint data types that must be explicitly converted 
between.  The aforementioned 4 are like Perl 6's Str, Blob, Int, Num, 
but that Perl 6 provides implicit conversion in many cases.

I want to emphasize here that I am knowingly wanting to access 
details that normal programmers, and users of my library, shouldn't 
have to know about, because I am conceptually enhancing the language 
itself, though most concequences of that occur behind a wall.

Part of my rationale here is that I want my library to be highly 
deterministic, which means there should be zero ambiguity as to what 
the input data is, and its semantics should be consistent and easy to 

A Perl 5 string with its utf8 flag off is ambiguous if we want to 
treat it as anything other than an undifferentiated string of bytes. 
If it is character data, there is a wide multitude of encodings that 
it could possibly be; latin-1 is just one of many 8-bit encodings for 

I prefer for my library to only accept strongly vetted and 
unambiguous data, and let the user program deal with the consequences 
of Perl 5's weak scalar type system, where they explicitly resolve 
themselves weak values into strong ones.  I'm not just going to 
*assume* that strings with the bit off are latin-1.

I will note that the user invoking Encode routines or setting 
filehandle traits is an explicit action on their part, so conversion 
between bytes and characters *is* being done explicitly, and so users 
are thinking about it and the results should not be ambiguous.

>  A whole type system is a bit too much to implement in Perl 5, I 
>think. Our current unicode string semantics are a great way to deal 
>with not having types, in my opinion.

While Perl 5 doesn't officially have a strong type system, unlike 
Perl 6, I do recognize that it does still conceive each scalar value 
as one of several distinct data types internally, and this is largely 
exposed in the language, and I want to exploit it so that I can get 
as close to strong semantics as I can under the circumstances.

For example, is_utf8() to my mind says whether Perl says a scalar is 
considered to be characters (internal encoding doesn't matter) or 
undifferentiated bits, and in normal cases that flag would be set 
true by something like a successful invocation of 
Encode::decode_utf8(), since that function vetted the data and so 
moved the string from ambiguous to something unambiguous.

Since Perl 5 lacks strong data types in the general sense, unlike 
Perl 6, I am trying the best I can to use whatever clues Perl 5 can 
give me, such as that flag, or access to some internal flag to say 
whether a scalar is in string or number mode.

Frankly, I would like to easily pass/fail on these examples:

   wants_int( 42 ); # allows
   wants_int( "42" ); # routine throws exception
   wants_int( 0+$foo ); # allows
   wants_int( ''.$foo ); # routine throws exception

   wants_text( 42 ); # routine throws exception
   wants_text ( "42" ); # allows
   wants_text ( 0+$foo ); # routine throws exception
   wants_text ( ''.$foo ); # allows

This is assuming that Perl actually records 42 and "42" differently; 
if it doesn't, then I won't ask for the ability to discriminate since 
Perl itself doesn't; but if Perl treats those differently, I want to 
as well.

Juerd also said:
>  How often should Perl check for this? Directly after decoding only, 
>or also after mutating operations like substr, or s///?

The utf8 flag being turned on or off only happens eg as a result of 
decode() or encode(); a string mutation would not change it.

As a corollary to what I said before, pack() should always return a 
string with the flag off, as its result is a bit string, and likewise 
the string that unpack() decodes should be expected to have the 
string off, because its actual bit pattern is significant.

Also, it should be an error for, eg, $raw_jpeg_image_data to have the 
utf8 flag on, since it is obviously a bit pattern.

-- Darren Duncan

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About