Automatic Offsite PostgreSQL Backups Without a Password

Topics: Command Line, Web Servers

When it comes to backups I’m paranoid and lazy. That means I need to automate the process of making redundant backups.

Pretty much everything to do with luxagraf lives in a single PostgreSQL database that gets backed up every night. To make sure I have plenty of copies of those backup files I download them to various other machines and servers around the web. That way I have copies of my database files on this server, another backup server, my local machine, several local backup hard drives, in Amazon S3 and Amazon Glacier. Yes, that’s overkill, but it’s all so ridiculously easy, why not?

Here’s how I do it.

Make Nightly Backups of PostgreSQL with pg_dump and cron

The first step is to regularly dump your database. To do that PostgreSQL provides the handy pg_dump command. If you want a good overview of pg_dump check out the excellent [PostgreSQL manual]. Here’s the basic syntax:

pg_dump -U user -hhostname database_name > backup_file.sql

So, if you had a database named mydb and you wanted to back it up to a file that starts with the name of the database and then includes today’s date, you could do something like this:

pg_dump -U user -hlocalhost mydb > mydb.`date '+%Y%m%d'`.sql

That’s pretty useful, but it’s also potentially quite a big file. Thankfully we can just pipe the results to gzip to compress them:

pg_dump -U user -hlocalhost mydb | gzip -c > mydb.`date '+%Y%m%d'`.sql.gz

That’s pretty good. In fact for many scenarios that’s all you’ll need. Plug that into your cron file by typing crontab -e and adding this line to make a backup every night at midnight:

0 0 * * * pg_dump -U user -hlocalhost mydb | gzip -c > mydb.`date '+%Y%m%d'`.sql

For a long time that was all I did. But then I started running a few other apps that used PostgreSQL databases (like a version Tiny Tiny RSS), so I needed to have quite a few lines in there. Plus I wanted to perform a VACUUM on my main database every so often. So I whipped up a shell script. As you do.

Actually most of this I cobbled together from sources I’ve unfortunately lost track of since. Which is to say I didn’t write this from scratch. Anyway here’s the script I use:

#!/bin/bash
#
# Daily PostgreSQL maintenance: vacuuming and backuping.
#
##
set -e
for DB in $(psql -l -t -U postgres -hlocalhost |awk '{ print $1}' |grep -vE '^-|:|^List|^Name|template[0|1]|postgres|\|'); do
  echo "[`date`] Maintaining $DB"
  echo 'VACUUM' | psql -U postgres -hlocalhost -d $DB
  DUMP="/path/to/backup/dir/$DB.`date '+%Y%m%d'`.sql.gz"
  pg_dump -U postgres -hlocalhost $DB | gzip -c > $DUMP
  PREV="$DB.`date -d'1 day ago' '+%Y%m%d'`.sql.gz"
  md5sum -b $DUMP > $DUMP.md5
  if [ -f $PREV.md5 ] && diff $PREV.md5 $DUMP.md5; then
    rm $DUMP $DUMP.md5
  fi
done

Copy that code and save it in a file named psqlback.sh. Then make it executable:

chmod u+x psqlback.sh

Now before you run it, let’s take a look at what’s going on.

First we’re creating a loop so we can backup all our databases.

for DB in $(psql -l -t -U postgres -hlocalhost |awk '{ print $1}' |grep -vE '^-|:|^List|^Name|template[0|1]|postgres|\|'); do

This looks complicated because we’re using awk and grep to parse some output but basically all it’s doing is querying PostgreSQL to get a list of all the databases (using the postgres user so we can access all of them). Then we pipe that to awk and grep to extract the name of each database and ignore a bunch of stuff we don’t want.

Then we store the name of database in the variable DB for the duration of the loop.

Once we have the name of the database, the script outputs a basic logging message that says it’s maintaining the database and then runs VACUUM.

The next two lines should look familiar:

DUMP="/path/to/backup/dir/$DB.`date '+%Y%m%d'`.sql.gz"
pg_dump -U postgres -hlocalhost $DB | gzip -c > $DUMP

That’s very similar to what we did above, I just stored the file path in a variable because it gets used again. The next thing we do is grab the file from yesterday:

PREV="$DB.`date -d'1 day ago' '+%Y%m%d'`.sql.gz"

Then we calculate the md5sum of our dump:

md5sum -b $DUMP > $DUMP.md5

The we compare that to yesterday’s sum and if they’re the same we delete our dump since we already have a copy.

  if [ -f $PREV.md5 ] && diff $PREV.md5 $DUMP.md5; then
    rm $DUMP $DUMP.md5
  fi

Why? Well, there’s no need to store a new backup if it matches the previous one exactly. Since sometimes nothing changes on this site for a few days, weeks, months even, this can save a good bit of space.

Okay now that you know what it does, let’s run it:

./psqlback.sh

If everything went well it should have asked you for a password and then printed out a couple messages about maintaining various databases. That’s all well and good for running it by hand, but who is going to put in the password when cron is the one running it?

Automate Your Backups with cron

First let’s set up cron to run this script every night around midnight. Open up crontab:

crontab -e

Then add a line to call the script every night at 11:30PM:

30 23 * * * ./home/myuser/bin/psqlback.sh > psqlbak.log

You’ll need to adjust the path to match your server, but otherwise that’s all you need (if you’d like to run it less frequently or at a different time, you can read up on the syntax in the cron manual).

But what happens when we’re not there to type in the password? Well, the script fails.

There are a variety of ways we can get around this. In fact the PostgreSQL docs cover everything from LDAP auth to peer auth. The latter is actually quite useful, though a tad bit complicated. I generally use the easiest method — a password file. The trick to making it work for cron jobs is to create a file in your user’s home directory called .pgpass.

Inside that file you can provide login credentials for any user on any port. The format looks like this:

hostname:port:username:password

You can use * as a wildcard if you need it. Here’s what I use:

localhost:*:*:postgres:passwordhere

I hate storing a password in the plain text file, but I haven’t found a better way to do this.

To be fair, assuming your server security is fine, the .pgpass method should be fine. Also note that Postgres will ignore this file if it has greater than 600 permissions (that is, user is the only one that can execute it. Let’s change that so that:

chmod 600 .pgpass

Now we’re all set. Cron will run our script every night at 11:30 PM and we’ll have a compressed backup file of all our PostgreSQL data.

Automatically Moving It Offsite

Now we have our database backed up to a file. That’s a start. That saves us if PostgreSQL does something wrong or somehow becomes corrupted. But we still have a single point of failure — what if the whole server crashes and can’t be recovered? We’re screwed.

To solve that problem we need to get our data off this server and store it somewhere else.

There’s quite a few ways we could do this and I have done most of them. For example we could install s3cmd and send them over to an Amazon S3 bucket. I actually do that, but it requires you pay for S3. In case you don’t want to do that, I’m going to stick with something that’s free — Dropbox.

Head over to the Dropbox site and follow their instructions for installing Dropbox on a headless Linux server. It’s just one line of cut and pasting though you will need to authorize Dropbox with your account.

BUT WAIT

Before you authorize the server to use your account, well, don’t. Go create a second account solely for this server. Do that, then authorize that new account for this server.

Now go back to your server and symlink the folder you put in the script above, into the Dropbox folder.

cd ~/Dropbox
ln -s ~/path/to/pgbackup/directory .

Then go back to Dropbox log in to the second account, find that database backup folder you just symlinked in and share it with your main Dropbox account.

This way, should something go wrong and the Dropbox folder on your server becomes compromised at least the bad guys only get your database backups and not the entirety of your documents folder or whatever might be in your normal Dropbox account.

Credit to Dan Benjamin, who’s first person I heard mention this dual account idea.

The main thing to note about this method is that you’re limited to 2GB of storage (the max for a free Dropbox account). That’s plenty of space in most cases. Luxagraf has been running for more than 10 years, stores massive amounts of geodata in PostgreSQL, along with close to 1000 posts of various kinds, and a full compressed DB dump is still only about 35MB. So I can store well over 60 days worth of backups, which is plenty for my purposes (in fact I only store about half that).

So create your second account, link your server installation to that and then share that folder with your main Dropbox account.

The last thing I suggest you do, because Dropbox is not a backup service, but a syncing service, is copy the backup files out of the Dropbox folder on your local machine to somewhere else on that machine. Not move, but copy. So leave a copy in Dropbox and make a second copy on your local machine outside of the Dropbox folder.

If you dislike Dropbox (I don’t blame you, I no longer actually use it for anything other than this) there are other ways to accomplish the same thing. The already mentioned s3cmd could move your backups to Amazon S3, good old scp could move them to another server and of course you can always download them to your local machine using scp or rsync (or SFTP, but then that wouldn’t be automated).

Naturally I recommend you do all these things. I sync my nightly backups to my local machine with Dropbox and scp those to a storage server. Then I use s3cmd to send weekly backups to S3. That gives me three offsite backups which is enough even for my paranoid, digitally distrustful self.