develooper Front page | perl.perl6.users | Postings from September 2019

Is it possible for Str to not be well formed?

Thread Next
From:
Darren Duncan
Date:
September 15, 2019 21:43
Subject:
Is it possible for Str to not be well formed?
Message ID:
618ecba4-74bc-24e8-2215-42ed23bffc1f@darrenduncan.net
I'm defining an API that takes only well formed Str objects, meaning it would 
only accept Str whose Unicode codepoints are all in the set 
{0..0xD7FF,0xE000..0x10FFFF} and in particular there are no UTF-16 surrogate 
characters, and it would do so as a yes/no stricture without coercing anything 
outside of the set.

I am aware of how behind the scenes Perl 6 uses multiple levels of abstraction 
for Str objects, and in particular may often use Normal Form G to utilize 
codepoints above 0x10FFFF to be able to represent graphemes in constant space.

I have a few questions:

1. Do I even have to test the Str at all?  Does Perl 6 guarantee that all Str 
are well formed, such that for example if one tried to decode UTF-16 that 
contained invalid surrogate codepoints (single ones or ones not properly paired 
up) that this would fail early, or is it possible that a Str could be created 
without fuss that contains the invalid surrogates?  I suspect Perl 6' inherent 
laziness would make passing through invalid codepoints more likely, but perhaps 
that isn't the case.

2. Does Perl 6 ever have Str that are not internally in some normal form?  That 
is, if a file contains say a mixture of NFC and NFD, the actual codepoints will 
be preserved at the start until some operation requires them to be in a normal 
form?  I'm thinking this may be a good case for laziness, eg you don't need 
normal forms to just move data around, but it can help if you want to count 
graphemes, so it only normalizes when such an operation happens.

3. If a Str can contain invalid surrogates or be wrong in some other way, what 
is the best / most performant way to test that a Str is only valid?  Context is 
akin to a "Str where ..." and what we put in the "...".

4. How can I get the actual codepoints from a Str without normalizing them 
first?  I realize for typical use cases, explicitly using the NFC/NFD etc 
methods, or "ords" which uses NFC, is the most correct, but if say I just want 
what we already have, how would I do that?  I realize the result may not be 
particularly useful in the face of NFG.

For a wider context, I know that in other programming languages like .NET or 
Java it is possible for their strings to have invalid surrogates, and I'm trying 
to figure out if Perl 6 can have the same problem or not.

Thank you.

-- Darren Duncan

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About