16

I am building an app that is fast moving into production and I am concerned about the possibility that due to hacking, some silly personal error (like running rake db:schema:load or rake db:rollback) or other circumstance we may suffer data loss in one database table or even across the system.

While I don't find it likely that the above will happen, I would be remiss in not being prepared in case it ever does.

I am using Heroku's PG Backups (which is to be replaced with something else this month), and I also run automated daily backups to S3: http://trevorturk.com/2010/04/14/automated-heroku-backups/, successfully generating .dump files.

What is the correct way to deal with data loss on a production app?

  1. How would I restore the .dump file in case I need to? Can I do a selective restore if a small part of the system is hit?
  2. In case a selective restore is not possible: assume one table loses data 4 hours after the last backup. Result => would fixing the lost table require rolling back 4 hours of users' activity? Any good solution to this?
  3. What is the best way to support users through the inconvenience if something like this happens?
Christian Fazzini
  • 18,642
  • 20
  • 100
  • 212
sscirrus
  • 50,379
  • 41
  • 125
  • 211
  • 2
    If you're not *already* restoring the backups (on a non-production box), then you don't have a backup. – Craig Stuntz May 10 '11 at 17:57
  • @CraigStuntz - by this do you mean it's important to regularly restore backups to a kind of 'shadow' website? Or do you mean restoring them locally? What is the purpose of doing this if users only go to mysite.com? – sscirrus May 10 '11 at 19:35
  • The purpose of doing it is that backup tools can easily produce files which can't actually be restored in non-trivial installations. Backups are only good if they can actually be used to produce a working server. – Craig Stuntz May 10 '11 at 20:01
  • @CraigStuntz - do .dump files on S3 not count as being able to produce a working server? – sscirrus May 10 '11 at 20:03
  • 1
    The point of doing a backup is not "doing the backup". The point of doing a backup is to be able to restore from the backup. IT history is replete with stories of unusable backups, copies, and dumps. Don't be the next entry on the Daily WTF (http://www.thedailywtf.com) – Mike Sherrill 'Cat Recall' May 10 '11 at 21:45
  • @Catcall - ha! Very nice. I suppose my desire to NOT be in the Daily WTF is what's motivating these inquiries before disaster strikes :) – sscirrus May 10 '11 at 22:29

4 Answers4

6

A full DR (disaster recovery) solution requires the following:

  1. Multisite. If a fire, flood, Osama Bin Laden or whathaveyou strikes the Amazon (or is it Salesforce?) data center that Heroku uses, you want to be sure that your data is safe elsewhere.
  2. On-going replication of the data to a separate site (or sites). That means that every transaction that's written to your database on one site, is replicated within seconds to the mirror database on the other site. Most RDBMS's have mechanisms to let you do a master-slave replication like that.
  3. The same goes for anything you put on a filesystem outside of the database, such as images, XML configuration files etc. S3 is a good solution here - they replicate everything to multiple data centers for you.
  4. I won't hurt to create periodic (daily or so) dumps of the database and store them separately (e.g. on S3). This helps you recover from data corruption that propagates to the slave DBs.
  5. Automate the process of data recovery. You want this to just work when you need it.
  6. Test everything. Ideally, you want to automate the test process and run it periodically to ensure that your backups can restore. Netflix Chaos Monkey is an extreme example of this.

I'm not sure how you'd implement all this on Heroku. A complete solution is still priced out of reach for most companies - we're running this across our own data centers (one in the US, one in EU) and it costs many millions. Work according to the 80-20 rule - on-going backup to a separate site, plus a well tested recovery plan (continuously test your ability to recover from backups) covers 80% of what you need.

As for supporting users, the best solution is simply to communicate timely and truthfully when trouble happens and make sure you don't lose any data. If your users are paying for your service (i.e. you're not ad-supported), then you should probably have an SLA in place.

Elad
  • 3,085
  • 2
  • 18
  • 17
  • thanks a lot for this detailed answer; I especially liked the Netflix Chaos Monkey reference! I'm running a small startup, hence very little resources and a desire to do the very best job we can despite our limitations. I'm trying to see how I can set up a pretty resilient system using Heroku, which itself runs on Amazon. We have taken care of 1, 3, and 4 so far. – sscirrus May 10 '11 at 23:26
  • 2
    @sscirrus I understand - used to run a small startup myself a short while back. I think your next step should be #5, then #6 becomes a breeze. In any case, there are many ways for startups to fail, and data loss is hardly the most common one, so I'd prioritize building something that's generating enough value to be worthwhile protecting in the first place :) – Elad May 11 '11 at 05:33
0

About backups, you cannot be sure at 100 percent every time that no data will be lost. The best is to test it on another server. You must have at leat two types of backup :

  • A database backup, like pg-dump. A dump is uniquely SQL commands so you can use it to recreate the whole database, just a table, or just a few rows. You loose the data added in the meantime.

  • A code backup, for example a git repository.

Hartator
  • 4,496
  • 3
  • 36
  • 67
  • I have a git repository and regular pg-dumps through Heroku. Looks like I'm passing the initial hurdle? :) – sscirrus May 10 '11 at 19:49
0

in addition to Hartator's answer:

  • use replication if your DB offers it, e.g. at least master/slave replication with one slave

  • do database backups on a slave DB server and store them externally (e.g. scp or rsync them out of your server)

  • use a good version control system for your source code, e.g. Git

  • use a solid deploy mechanism, such as Capistrano and write your custom tasks, so nobody needs to do DB migrations by hand

  • have somebody you trust check your firewall setup and the security of your system in general

The DB-Dumps contain SQL-commands to recreate all tables and all data... if you were to restore only one table, you could extract that portion from a copy of the dump file and (very carefully) edit it and then restore with the modified dump file (for one table).

Always restore first to an independent machine and check if the data looks right. e.g. you could use one Slave server, take if offline, then restore there locally and check the data. Good if you have two slaves in your system, then the remaining system has still one master and one slave while you restore to the second slave.

Tilo
  • 31,645
  • 4
  • 72
  • 101
  • do you know how the master/slave relationship would work with respect to Heroku? Thank you very much for your answer.. I'm having a little trouble determining which of these apply to me and which Heroku has already taken care of. – sscirrus May 10 '11 at 20:05
0

To simulate a fairly simple "total disaster recovery" on Heroku, create another Heroku project and replicate your production application completely (except use a different custom domain name).

You can add multiple remote git targets to a single git repository so you can use your current production code base. You can push your database backups to the replicated project, and then you should be good to go.

The only step missing from this exercise verses a real disaster recovery is assigning your production domain to the replicated Heroku project.

If you can afford to run two copies of your application in parallel, you could automate this exercise and have it replicate itself on a regular basis (e.g. hourly, daily) based on your data loss tolerance.

Steve Wilhelm
  • 5,930
  • 1
  • 28
  • 36