develooper Front page | perl.perl5.porters | Postings from November 2010

Re: "perl: utf8.c:1997: Perl_swash_fetch: Assertion `klen <= sizeof(PL_last_swash_key)'failed." [5.12.1]

Thread Previous | Thread Next
karl williamson
November 24, 2010 20:22
Re: "perl: utf8.c:1997: Perl_swash_fetch: Assertion `klen <= sizeof(PL_last_swash_key)'failed." [5.12.1]
Message ID:
Chip Salzenberg wrote:
> I'm experimenting with some text scanning code against a very large corpus,
> and I've got Perl 5.12.1 dying on this assertion failure:
>    perl: utf8.c:1997: Perl_swash_fetch: Assertion `klen <=
> sizeof(PL_last_swash_key)' failed.
> I'm not at all familar with what swashes even are, let alone what they
> precisely do; I know they're related to Unicode attributes, that's about it.
>  Any clues out there?  I'm scanning through so many pages with such a highly
> distributed program that I will have some difficulty identifying the
> specific files that trigger it, but perhaps this can be fixed from first
> principles.

I can say a little about them.  I've single stepped through the mostly
undocumented code using gdb more than once, and still don't grasp how it 
all comes together.

On a 32 bit machine, Perl allows the ordinal of characters to range 
between 0 and 2**32-1.  To store, say, if such a character is a digit, 
would require an array of that many bits, unless games are played.  And, 
there are dozens of properties; memory would be exhausted.  The actual 
data for the properties are stored on disk.  Perl uses similar concepts 
as operating systems to essentially page in portions of the arrays that 
are being looked at.  I presume it uses an LRU algorithm to keep a 
certain sized working set, and page out portions of the arrays that 
haven't been used recently (although, since they are read-only, it 
doesn't have to actually page them out, just overwrite them).

I'm guessing that the name came from the term 'swatch', like a piece of 
fabric that is part of a greater whole.

So, these swatches are like virtual memory.  The program can refer to 
any element in an array, but transparently, except for performance and 
bugs, only portions of it are actually resident at any moment.  (Jarkko 
long ago said this was a poor choice of data structure, and we should 
move to inversion lists instead, and I'm thinking about doing that for 
5.16; but that doesn't help you in the present.)

Here's a relevant comment:
     /* Given a UTF-X encoded char 0xAA..0xYY,0xZZ
      * then the "swatch" is a vec() for all the chars which start
      * with 0xAA..0xYY
      * So the key in the hash (klen) is length of encoded char -1

One way klen could be wrong is malformed utf8 in, I believe.  But my 
guess is there is a bug in the paging code and it's got its pointers 
screwed up.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About