MANILA, Philippines – Following a recent service outage, cloud storage provider Dropbox provided a post-mortem, explaining what caused the downtime and the actions they will be taking to prevent a repeat of it.
Dropbox’s head of infrastructure, Akhil Gupta, went on the Dropbox Tech Blog to say that a planned maintenance went awry.
He said the maintenance was meant to upgrade the operating system (OS) on some of the machines they use for their databases, and an upgrade script “checks to make sure there is no active data on the machine before installing the new OS.”
A bug in the update script pushed a reinstallation command onto some machines that were active at the time. Despite each database having a redundancy system in place – basically a duplicated component that acts as a backup, and in this case is made up of a master machine and two slave machines for redundancy – some of the databases were affected, causing Dropbox to go down.
Gupta assures the public that files were not at risk, as the databases did not contain file data. Rather, they were used to provide specific features, like photo album sharing and camera uploads.
He also explained that basic services were restored within 3 hours by recovering from backups, but the size of some of the databases slowed the recovery process.
The recovery processes only completed at 4:40 pm PT of January 12 (January 13, Philippine time).
The way forward
Dropbox took additional measures to prevent another downtime of this sort from this happening.
Gupta wrote that an additional layer of checks was added, making machines verify their state before executing a command. The additional layer of checks will prevent machines running critical processes from executing commands that could cause them to break down.
Dropbox also built a tool that should speed up recovery for large backups, should another downtime occur. The company plans to make this new tool open source “so others can benefit from what we’ve learned.” – Rappler.com
There are no comments yet. Add your comment to start the conversation.