develooper Front page | perl.cpan.testers.discuss | Postings from October 2019

Data Retention Policies

Thread Next
Doug Bell
October 17, 2019 17:33
Data Retention Policies
Message ID:
The latest outage lasted 5 days. We gave up trying to negotiate with the down server and got someone to physically reboot it. Because we still have data in MyISAM tables, this comes with a potential for a few issues, not the least of which is it can take days to rebuild the MyISAM indexes after a hard reboot (luckily that did not happen, and we seem to be back online).

When I joined the project, one of the initial goals was to move away from MyISAM on to InnoDB (or, possibly, another DB entirely). My efforts to do that continually run in to problems:

* Some parts of the data _will not_ convert to InnoDB as-is due to differences between MyISAM and InnoDB.
* The program I wrote to modify that data to a different format which can exist in InnoDB will take months to complete.
* Relatedly, I have no reason to suspect moving all that data to a different database would take any less time.
* The only reason we need these two servers specifically and solely dedicated to the database is because of the database's size

These issues all have a common root: There is a lot of data. I might say too much data.

CPAN Testers has accepted 100+ million test reports since it came online. Some of these reports are for distributions no longer available on CPAN. Reports are still being submitted for abandoned modules not updated in decades for out-of-support Perl versions. Every development release of the Perl interpreter gets tested against some (most? all?) of CPAN on multiple platforms. This adds up to thousands of reports per day, and if the database was up I could check what percentage of them are ever visited by human eyes (but my guess is 5-10%).

Even if the data is not seen by humans, it's useful in the aggregate: Regression analysis requires as much data as possible to make its hypotheses and suggestions. Even if the data is old does not mean it's useless: Old versions of modules can still be installable from CPAN, and folks are still running old versions of Perl.

That said, timely data is more useful than untimely data. Do we need reports submitted in 2006? Data for modules only available on BackPAN isn't actionable, so do we need to keep that information?

In the end, irrelevant data is worse than useless, it is actively detrimental to the site's stability (as I mentioned above). For that reason, I propose to implement the following data retention policies:

1. Full text reports will be kept a maximum of 5 years
2. Report summaries will be kept for all distributions installable from CPAN, or if no longer installable from CPAN, 5 years
	* This means that someone will still know if a distribution passes/fails, but if an author wants to know why they'll have to reproduce it themselves
3. Along with (2), release summaries for distributions not installable from CPAN and older than 5 years will be removed
	* This ensures that the release summaries can be rebuilt from the report summaries, and that there isn't a strange difference in numbers between the CPAN Testers website and consumers of the release data

So, this means that for all distributions available on CPAN, we will still know pass/fail/na/unknown and which Perls and platforms. For the first five years after the report's submission, one can view the entire text of the report. If the distribution is still on CPAN, the full text report will be deleted 5 years after it was submitted, but the summary information will remain. If the distribution is removed from CPAN, all reports and all summary information older than 5 years will be deleted.

Purging report text older than 5 years will reduce the database by about half. For the 1TB database we have now, that reduces it to a svelte 500GB. If we purge more, we gain more, though report submissions have been increasing over the years:

| total     | 5y       | 4y       | 3y       | 2y       | 1y      |
| 107822513 | 62514949 | 48597230 | 35256342 | 21516482 | 9889931 |

So, questions for those affected:

* Do you look at text reports older than 5 years? 3 years? 1 year?
* Are test summaries useful to you without the full text of the report?
* Are pass/fail counts older than 5 years useful to you? 3 years? 1 year?

I'd like to implement this sooner rather than later so I can build some faster recovery systems, but I'll leave discussion open at least a week while I develop the tools I need to do this anyway.

Doug Bell

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About