On May 2, 2015, at 11:12 AM, Karl Williamson <public@khwilliamson.com> wrote: > When | & ^ ~ are executed on strings, or when the new |. &. ^. ~. operators are run, the internal representation of those strings is relied on (and hence exposed). This means different behaviors will often result on EBCDIC vs ASCII platforms. > > More importantly, whether a string is in UTF-8 or not may affect the result. There is no such problem if the string is comprised solely of ASCII characters (on ASCII machines or ASCII-equivalent characters plus the C1 controls on EBCDIC machines), which is why people may not have been bitten much by this in the past. > > So what to do if the string has non-ASCII characters and is in UTF-8? I see the following possibilities: > > A) no change from current behavior, document it better. (This is what will happen in v5.22) > > B) warn > > C) Do the operation on the underlying code points (that is effectively convert to U32 or U64 before the operation, and convert back at the end) > > D) Downgrade if possible and leave the result downgraded, or possibly upgrade the result. I suppose warn if not possible to downgrade > > E) **Your ideas here** I vote for D, since it closely matches what is done elsewhere. Alternatively we could croak for any characters > 255, which I think we do for some sys calls already. In either case it would be ‘Wide character in whatever’. Sorry for being silent of late. I have suddenly had an extra work load that used up all my spare time, but things seem to be quieting down a little now.Thread Previous | Thread Next