develooper Front page | perl.perl6.internals | Postings from June 2001

RE: The internal string API

Thread Previous | Thread Next
From:
Dan Sugalski
Date:
June 19, 2001 12:10
Subject:
RE: The internal string API
Message ID:
5.1.0.14.0.20010619145819.020707a0@24.8.96.48
At 11:53 AM 6/19/2001 -0700, Hong Zhang wrote:

> > * Convert from and to UTF-32
> > * lengths in bytes, characters, and possibly glyphs
> > * character size (with the variable length ones reporting in negative
>numbers)
>
>What do you mean by character size if it does not support variable length?

Well, if strings are to be treated relatively abstractly, and we still want 
to poke around through the string buffer, we need to know how big a 
character is.

> > * get and set the locale (This might not be the spot for this)
>
>The locale should be context based. Each thread should have its own
>locale.

I'm thinking locale is, in some ways, like tainting where it's really a 
property of the data rather than a property of the code region. We have 
more than just Unicode data to deal with as well--plain ASCII will be there 
too, and locale is more applicable to the data than it seems in Unicode.

On the other hand, the case of mixed-data strings was one that hadn't 
occurred to me. With that in mind, it's a far less useful thing to tag data 
with.

> > * normalize (a noop for non-Unicode data)
> > * Get the encoding name
>
>The encoding name is tricky. Neither Java or POSIX defines their
>naming scheme. I personally prefer full name with lower case,
>such as "iso8859-1", the API converts name to lower automatically.
>The encoding name must be strict ASCII. Some common aliases
>may be provided. There must be an API to list all supported encoding
>during runtime.

Yep, I fully agree. (Well, I'm not sure of the ASCII restriction on the 
name, but I can live with that as a lowest-common-denominator sort of thing)

The name's really a tag of sorts that code can use to make some sort of 
reasonable decisions about things so I don't much care how we specify it as 
long as its unique. (I do wish that there was an external naming scheme as 
a standard we could snag--I hate inventing that sort of thing)

> > * Do a substr operation by character and glyph
>
>The byte based is more useful. I have utf-8, and I want to substr it
>to another utf-8. It is painful to convert it or linear search for
>charaacter
>position.

The pain is the reason for specifying it in the API. If we force the pain 
to be local to the encoding then it means that we don't have to embed it in 
the core.

> > I don't know if we want to treat encoding and data format separately--it
> > would seem to make sense to be able to have a string tell us it's
> > Unicode/UTF-32/Korean rather than just UTF-32/Korean, since I
> > don't see why it wouldn't be allowable to use the UTF-8 or UTF-16 encoding
> > on non-Unicode data. (Not that it'd necessarily be all that useful, and I
> > can see just not allowing it)
>
>I don't see the core should support language/locale in this detail.
>I deal a lot of mix chinese/english text file. There is no way to represent
>it using plain string, unless you want to make string be rich-format-text
>-buffer. Current locale or explicit locale parameter will suffice your goal.

Fair enough. I can see things like mixed chinese/japanese/korean text being 
even more problematic. Almost enough to make me think we *should* build 
some sort of rich text format support into the core (if we were general we 
could use it for XML/HTML/SGML fata as well) but I think I'll leave the 
dictating of that to Larry.

					Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
dan@sidhe.org                         have teddy bears and even
                                      teddy bears get drunk


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About