develooper Front page | perl.perl6.internals | Postings from February 2001

Re: PDD 2, vtables

Dan Sugalski
February 10, 2001 11:17
Re: PDD 2, vtables
Message ID:
At 08:47 AM 2/10/2001 -0200, Branden wrote:
>Dan Sugalski wrote:
> > >The string API should be sufficiently smart to be able to convert data
> > >one encoding to another as it's more convenient.
> >
> > No, the vtable functions for the variables should know how to convert from
> > and to perl's preferred string representations, and can do whatever
> > Magic they care to iternally.
> >
>I don't see why Perl couldn't deal with multiple representations internally.
>Conversion could be done on the way in, internally for efficiency on certain
>operations, and on the way out, again.

It can, and it will. The question is "which ones". The regex engine will 
almost undoubtedly deal with only fixed-sized characters. Perl itself will 
probably restrict itself to fixed width characters as well. Individual 
variable classes can store data in any form they want. (If someone wants to 
leverage zlib to write a class that compresses its data, I'm fine with that)

> > >On the other side, for a string that is matched against regexps, it
> > >matter much if it has variable character length, since regexps normally
> > >all the string anyway, and indexing characters isn't much of a concern.
> >
> > You underestimate the impact of variable-length data, I think. Regexes
> > should go rather faster on fixed-length than variable length data. How
> > so depends on your processor. (I can guarantee that Alphas will run a
> > darned sight faster on UTF-32 than UTF-8...)
> >
>Aggreed. Should go faster. But maybe I don't need it that fast!

That's fine. Speed is my #1 priority. Memory usage is secondary. (An 
important secondary, but secondary nonetheless) Which doesn't rule out 
UTF-8, of course--it may turn out that converting things is slower than 
dealing with variable width data, in which case priority #1 wins.

>(I really think it shouldn't be so much slower than doing it on an ASCII
>string with the same total buffer size, it only would have to fetch another
>byte on certain conditions and build the extended character representation,
>what isn't hard either.)

You might not think so, but you would be wrong. You have a test and 
potential branch (possibly more--folks with lots of UTF-8 data, which 
includes everyone with a non-latin alphabet) on *every* character. That is 
not cheap on modern processors. Yes, you're pulling in significantly less 
data, which has an impact with UTF-32 (and garbage collection) but I'm not 
sure you'll find it a win.

We can benchmark it and see if my feeling is wrong once we get some code 
and a testing scaffold built.

> > >It would be nice if the user had some control to this, for example by
> > >"I don't care this string will be used by substr, leave it in UTF-8 since
> > >it's too big and I don't want to waste memory!", or "This string isn't
> > >big, so I should convert it to bloated UTF-32 at once!", or even "use
> > >'memory';".
> >
> > That would be:
> >    my str $foo : utf8 : fixed;
> > or possibly
> >    use less qw(memory);
> >
>Probably not my str $foo :utf8 :fixed, since then if I have $bar = $foo it
>would convert the string value from $foo to anything else, right?

Might. Larry's not set the rules on what attributes are passed on with 
assignment. If you're really worried, there's no reason not to set 
attributes on $bar either.

> > Generally speaking you probably don't want to do this. Odds are if you
> > think you know what's going on better than the compiler, you're wrong.
> > always, but in a non-trivial number of cases, in my experience)
> >
>I can't beat the compiler, that's for sure. But I really don't think I want
>to read a 100KB file into a variable all at once and end up with 400KB
>memory usage only for that file. And I really don't care if `regexps' go
>slower on that, I can live with it...

If it's binary data or 8-bit characters, you won't. If it's UTF-8 you might 
see expansion, but how much depends on how many 7-bit characters you have. 
And then only if something actually asks for the data in UTF-32 format.

This has been enough to convince me that there should be UTF-8 as one of 
the base character types for vtables, even if we don't use it in many 
places internaly. For stuff that's just read and printed, it'll save 
memory, I think. Hope, at least. (Though it probably means the regex engine 
should deal with variable-width characters, and I'd really rather it didn't)

> > >And I believe 8-bit ASCII will always be an option, for who doesn't care
> > >about extended characters and want the best of both worlds on speed and
> > >memory usage.
> >
> > 8-bit characters in general, yep. (ASCII is really 7-bit) ASCII, EBCDIC,
> > raw byte buffers.
> >
>That includes Latin-1, Latin-etc. (I believe they're 10 or 12), which are
>the same as the ISO-8859-1, ISO-8859-(etc).

Yes. Anything that doesn't require UTF-8.


--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai                         have teddy bears and even
                                      teddy bears get drunk Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About