On approximately 5/21/2008 9:06 AM, came the following characters from the keyboard of Juerd Waalboer: > Glenn Linderman skribis 2008-05-21 8:50 (-0700): >> On approximately 5/21/2008 1:29 AM, came the following characters from >> the keyboard of Rafael Garcia-Suarez: >>> Some way to mark PVs as "binary" and not upgradeable to SvUTF8 would be >>> handy, though. >> What's the goal? > > Dual: > > 1. To provide a means of indicating that something is binary rather than > text. This can be useful in an encoding capable DBI drivers/wrappers for > example, to indicate that a "?" placeholder is already binary, and > should not be text-encoded. (You'd want to do this based on column > introspection but that's very slow and very hard to write portably.) > Another use case involves data serialization for exchange with languages > that have native binary strings, like Java. Does your parenthetical remark mean that it should be done by detecting that the column is SQL BINARY or VARBINARY vs SQL CHAR or VARCHAR? If so, I agree with it. It is not clear why it is slow, nor hard to write portably. SQL BINARY has existed for years, and shouldn't have any portability problems. So has the argument that SvUTF8 is "only one bit", which implies that it might be hard to add another bit, or it might have been done years ago. I don't understand the limits or costs to the internals to know if an extra flag can be added; and if it has to be added at the cost of making every string have more space overhead, I'm not sure if it is worth the tradeoff. Data serialization for diverse types is not unique to strings; Perl's concept of "number" which might be represented as a sequence of ASCII numerals, a binary integer, or a binary float, has similar issues when serializing. Appropriate type specification (and possibly conversions) must be done at the point of serialization to meet the external specification for numbers; doing that for strings (what character encoding for character data, or none at all but make sure it is bytes for binary byte streams) is no harder, and would leverage the same type-specification interface to the serialization module or function. > 2. To prevent programming errors; you should see this as a matter of > strictures. Most silly mistakes made in Unicode programming are related > to people who fail to understand the difference between binary and text > strings, and as a result from that, they sometimes add text strings to > binary strings. While conceptually that's always a mistake, it happens > so often and it's such an easy mistake te make (apparently) that it > would be nice to have language support that changes "upgrade entire > string to SvUTF8" to "add only the new portion as UTF8 (encoded, not > SvUTF8 marked), keep the original as it is" > >> If the goal is to prevent the cost of upgrading and downgrading, well, >> just fix the bug that attached the upgraded data... and the cost of >> doing so also vanishes. > > Detecting upgrades is hard. There's a module (encoding::warnings) that > enables warnings for it globally, but you often want it on a single > string instead. Indeed the bug where characters >255 are added to the > binary string should be fixed, but finding out where/when that happens > can be a lot of work and currently requires knowledge of internals. No, it currently requires knowledge of how to use utf8::is_utf8() to help isolate where the upgrade happens. Unless utf8::is_utf8() is removed (which would break some amount of compatibility), that is all the knowledge required. One can make the case that utf8::is_utf8() exposes internals, so to that extent, it requires internals knowledge. One can make the case that it should be called something other than utf8::is_utf8(), and I'd agree, but utf8::is_utf8() already exists and probably can't be easily removed anyway... Fixing the other problems would mean that utf8::is_utf8() would be a debug tool only, rather than an "outwit the semantic changes caused by storage format" crutch as well. If it is cheap enough to add the "stricture", I'm not against having it added; I'm just trying to figure out if there are enough benefits to pay the cost, whatever the cost is. Certainly it is more convenient to be told the exact line where the stricture would be violated, but a few debugging calls to utf8::is_utf8() would tell the tale. I don't yet see any significant benefits that would outweigh a significant implementation cost (if it is significant) in terms of added time or space. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking