Instapaper Outage Cause & Recovery
Without knowledge of the pre-April 2014 file size limit, it was difficult to foresee and prevent this issue. As far as we can tell, there’s no information in the RDS console in the form of monitoring, alerts or logging that would have let us know we were approaching the 2TB file size limit, or that we were subject to it in the first place. Even now, there’s nothing to indicate that our hosted database has a critical issue.
If we ever had knowledge of the file size limit, it likely left with the 2013-era betaworks contractors that performed the Softlayer migration.
[…]
We didn’t have a good disaster recovery plan in the event our MySQL instance failed with a critical filesystem issue that all of our backups were also subject to.
[…]
When it became clear the dump would take far too long (first effort took 24 hours, second effort with parallelization took 10 hours), we began executing on a contingency plan to get an instance in a working state with limited access to Instapaper’s archives. This short-term solution launched into production after 31 hours of downtime. The total time to create that instance and launch it into production was roughly six hours.
[…]
Our only recourse was to restore the data to an entirely new instance on a new filesystem. This was further complicated by the fact that our only interface into the hosted instances is MySQL, which made filesystem-level solutions like rsync impossible without the direct assistance from Amazon engineers.
1 Comment RSS · Twitter
This outage got me wondering how hard it would be to roll your own Instapaper-like service hosted privately on a server you control, and how much demand there might be for an open-source solution for such a concept.
cl