develooper Front page | perl.perl5.porters | Postings from May 2008

Re: on the almost impossibility to write correct XS modules

Glenn Linderman
May 21, 2008 10:44
Re: on the almost impossibility to write correct XS modules
Message ID:
On approximately 5/21/2008 9:06 AM, came the following characters from 
the keyboard of Juerd Waalboer:
> Glenn Linderman skribis 2008-05-21  8:50 (-0700):
>> On approximately 5/21/2008 1:29 AM, came the following characters from 
>> the keyboard of Rafael Garcia-Suarez:
>>> Some way to mark PVs as "binary" and not upgradeable to SvUTF8 would be
>>> handy, though.
>> What's the goal?
> Dual:
> 1. To provide a means of indicating that something is binary rather than
> text. This can be useful in an encoding capable DBI drivers/wrappers for
> example, to indicate that a "?" placeholder is already binary, and
> should not be text-encoded. (You'd want to do this based on column
> introspection but that's very slow and very hard to write portably.)
> Another use case involves data serialization for exchange with languages
> that have native binary strings, like Java.

Does your parenthetical remark mean that it should be done by detecting 
that the column is SQL BINARY or VARBINARY vs SQL CHAR or VARCHAR?  If 
so, I agree with it.  It is not clear why it is slow, nor hard to write 
portably.  SQL BINARY has existed for years, and shouldn't have any 
portability problems.

So has the argument that SvUTF8 is "only one bit", which implies that it 
might be hard to add another bit, or it might have been done years ago.

I don't understand the limits or costs to the internals to know if an 
extra flag can be added; and if it has to be added at the cost of making 
every string have more space overhead, I'm not sure if it is worth the 

Data serialization for diverse types is not unique to strings; Perl's 
concept of "number" which might be represented as a sequence of ASCII 
numerals, a binary integer, or a binary float, has similar issues when 
serializing.  Appropriate type specification (and possibly conversions) 
must be done at the point of serialization to meet the external 
specification for numbers; doing that for strings (what character 
encoding for character data, or none at all but make sure it is bytes 
for binary byte streams) is no harder, and would leverage the same 
type-specification interface to the serialization module or function.

> 2. To prevent programming errors; you should see this as a matter of
> strictures. Most silly mistakes made in Unicode programming are related
> to people who fail to understand the difference between binary and text
> strings, and as a result from that, they sometimes add text strings to
> binary strings. While conceptually that's always a mistake, it happens
> so often and it's such an easy mistake te make (apparently) that it
> would be nice to have language support that changes "upgrade entire
> string to SvUTF8" to "add only the new portion as UTF8 (encoded, not
> SvUTF8 marked), keep the original as it is"
>> If the goal is to prevent the cost of upgrading and downgrading, well, 
>> just fix the bug that attached the upgraded data... and the cost of 
>> doing so also vanishes.
> Detecting upgrades is hard. There's a module (encoding::warnings) that
> enables warnings for it globally, but you often want it on a single
> string instead. Indeed the bug where characters >255 are added to the
> binary string should be fixed, but finding out where/when that happens
> can be a lot of work and currently requires knowledge of internals.

No, it currently requires knowledge of how to use utf8::is_utf8() to 
help isolate where the upgrade happens.  Unless utf8::is_utf8() is 
removed (which would break some amount of compatibility), that is all 
the knowledge required.

One can make the case that utf8::is_utf8() exposes internals, so to that 
extent, it requires internals knowledge.  One can make the case that it 
should be called something other than utf8::is_utf8(), and I'd agree, 
but utf8::is_utf8() already exists and probably can't be easily removed 

Fixing the other problems would mean that utf8::is_utf8() would be a 
debug tool only, rather than an "outwit the semantic changes caused by 
storage format" crutch as well.

If it is cheap enough to add the "stricture", I'm not against having it 
added; I'm just trying to figure out if there are enough benefits to pay 
the cost, whatever the cost is.  Certainly it is more convenient to be 
told the exact line where the stricture would be violated, but a few 
debugging calls to utf8::is_utf8() would tell the tale.  I don't yet see 
any significant benefits that would outweigh a significant 
implementation cost (if it is significant) in terms of added time or space.

Glenn --
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About