Front page | perl.perl5.porters |
Postings from March 2007
stronger type determination (was Re: the utf8 flag ...)
Thread Next
From:
Darren Duncan
Date:
March 31, 2007 03:36
Subject:
stronger type determination (was Re: the utf8 flag ...)
Message ID:
p06240801c233ad244e90@[192.168.1.101]
Juerd Waalboer said on March 28, 2007 02:13:
> What I want (and I think you want too) is a real type system, to
>have two different distinct types: byte strings and character
>strings. It would be bad to use a flag called "UTF8" for this,
>because a byte string can also be UTF8 encoded. Perl already suffers
>from this problem, but because the UTF8 flag is *INTERNAL*, it's not
>a big deal. It would be if it surfaced and was used by Perl coders.
Yes, a stronger type system is exactly what I want, and that is what
my example library (in real life, QDRDBMS) wants to use internally;
it internally treats character data (which is encoding agnostic) and
binary data (undifferentiated bits) and integers and non-integer
numbers all as disjoint data types that must be explicitly converted
between. The aforementioned 4 are like Perl 6's Str, Blob, Int, Num,
but that Perl 6 provides implicit conversion in many cases.
I want to emphasize here that I am knowingly wanting to access
details that normal programmers, and users of my library, shouldn't
have to know about, because I am conceptually enhancing the language
itself, though most concequences of that occur behind a wall.
Part of my rationale here is that I want my library to be highly
deterministic, which means there should be zero ambiguity as to what
the input data is, and its semantics should be consistent and easy to
understand.
A Perl 5 string with its utf8 flag off is ambiguous if we want to
treat it as anything other than an undifferentiated string of bytes.
If it is character data, there is a wide multitude of encodings that
it could possibly be; latin-1 is just one of many 8-bit encodings for
example.
I prefer for my library to only accept strongly vetted and
unambiguous data, and let the user program deal with the consequences
of Perl 5's weak scalar type system, where they explicitly resolve
themselves weak values into strong ones. I'm not just going to
*assume* that strings with the bit off are latin-1.
I will note that the user invoking Encode routines or setting
filehandle traits is an explicit action on their part, so conversion
between bytes and characters *is* being done explicitly, and so users
are thinking about it and the results should not be ambiguous.
> A whole type system is a bit too much to implement in Perl 5, I
>think. Our current unicode string semantics are a great way to deal
>with not having types, in my opinion.
While Perl 5 doesn't officially have a strong type system, unlike
Perl 6, I do recognize that it does still conceive each scalar value
as one of several distinct data types internally, and this is largely
exposed in the language, and I want to exploit it so that I can get
as close to strong semantics as I can under the circumstances.
For example, is_utf8() to my mind says whether Perl says a scalar is
considered to be characters (internal encoding doesn't matter) or
undifferentiated bits, and in normal cases that flag would be set
true by something like a successful invocation of
Encode::decode_utf8(), since that function vetted the data and so
moved the string from ambiguous to something unambiguous.
Since Perl 5 lacks strong data types in the general sense, unlike
Perl 6, I am trying the best I can to use whatever clues Perl 5 can
give me, such as that flag, or access to some internal flag to say
whether a scalar is in string or number mode.
Frankly, I would like to easily pass/fail on these examples:
wants_int( 42 ); # allows
wants_int( "42" ); # routine throws exception
wants_int( 0+$foo ); # allows
wants_int( ''.$foo ); # routine throws exception
wants_text( 42 ); # routine throws exception
wants_text ( "42" ); # allows
wants_text ( 0+$foo ); # routine throws exception
wants_text ( ''.$foo ); # allows
This is assuming that Perl actually records 42 and "42" differently;
if it doesn't, then I won't ask for the ability to discriminate since
Perl itself doesn't; but if Perl treats those differently, I want to
as well.
Juerd also said:
> How often should Perl check for this? Directly after decoding only,
>or also after mutating operations like substr, or s///?
The utf8 flag being turned on or off only happens eg as a result of
decode() or encode(); a string mutation would not change it.
As a corollary to what I said before, pack() should always return a
string with the flag off, as its result is a bit string, and likewise
the string that unpack() decodes should be expected to have the
string off, because its actual bit pattern is significant.
Also, it should be an error for, eg, $raw_jpeg_image_data to have the
utf8 flag on, since it is obviously a bit pattern.
-- Darren Duncan
Thread Next
-
stronger type determination (was Re: the utf8 flag ...)
by Darren Duncan