Chip Salzenberg wrote: > I'm experimenting with some text scanning code against a very large corpus, > and I've got Perl 5.12.1 dying on this assertion failure: > > perl: utf8.c:1997: Perl_swash_fetch: Assertion `klen <= > sizeof(PL_last_swash_key)' failed. > > I'm not at all familar with what swashes even are, let alone what they > precisely do; I know they're related to Unicode attributes, that's about it. > Any clues out there? I'm scanning through so many pages with such a highly > distributed program that I will have some difficulty identifying the > specific files that trigger it, but perhaps this can be fixed from first > principles. > I can say a little about them. I've single stepped through the mostly undocumented code using gdb more than once, and still don't grasp how it all comes together. On a 32 bit machine, Perl allows the ordinal of characters to range between 0 and 2**32-1. To store, say, if such a character is a digit, would require an array of that many bits, unless games are played. And, there are dozens of properties; memory would be exhausted. The actual data for the properties are stored on disk. Perl uses similar concepts as operating systems to essentially page in portions of the arrays that are being looked at. I presume it uses an LRU algorithm to keep a certain sized working set, and page out portions of the arrays that haven't been used recently (although, since they are read-only, it doesn't have to actually page them out, just overwrite them). I'm guessing that the name came from the term 'swatch', like a piece of fabric that is part of a greater whole. So, these swatches are like virtual memory. The program can refer to any element in an array, but transparently, except for performance and bugs, only portions of it are actually resident at any moment. (Jarkko long ago said this was a poor choice of data structure, and we should move to inversion lists instead, and I'm thinking about doing that for 5.16; but that doesn't help you in the present.) Here's a relevant comment: /* Given a UTF-X encoded char 0xAA..0xYY,0xZZ * then the "swatch" is a vec() for all the chars which start * with 0xAA..0xYY * So the key in the hash (klen) is length of encoded char -1 */ One way klen could be wrong is malformed utf8 in, I believe. But my guess is there is a bug in the paging code and it's got its pointers screwed up.Thread Previous | Thread Next