Consistent Backups from PostgreSQL
PostgreSQL provides multi-version concurrency control. Multi-version concurrency control makes it easier to create database consistent backups. Snapshots are also possible; this is equivalent to recovering from a PostgreSQL server crash.
Logical Full Backups
pg_dumpall--create logical backups of the database into a backup file. To avoid file size restrictions due to filesystem limits, I sometimes send the output of these commands to the Unix
split command. These tools can run on any machine (not just on the machine running PostgreSQL postmaster server). For example, here's how to dump a Postgres database to multiple files with names starting with pg_backup:
# pg_dump postgresdb | split -b 1024m - pg_backup
Raw (Physical) Full Backups: Continuous Archiving
PostgreSQL 8.1 keeps track of all changes made to the database data files in Write-Ahead Logs (WALs) under the pg_xlog subdirectory of the cluster data directory. You can archive these WALs in a different directory before the database reuses them by copying them to a directory backed up by a network backup tool.
To copy the logs to the /backup/postgres/WAL directory, add a following shell command to your postgresql.conf file:
archive_command = 'cp -i %p /backup/postgres/WAL/%f < /dev/null'
Note: It is necessary to backup configuration files such as postgresql.conf, pg_hba.conf, and pg_ident.conf separately. To do a full raw online backup, as the database superuser:
- Execute the SQL command
SELECT pg_start_backup('backup_id1');The backup information (backup label, time of command execution, name of first WAL segment file) will be in the the /backup/postgres/backup_id1 file.
- Perform a backup of the filesystem backup of the database data directory (/usr/local/pgsql/data). Exclude the WAL files under the pg_xlog subdirectory. The archive command will back up these files.
- Execute the SQL command to inform that backup is complete:
SELECT pg_stop_backup();Stopping the backup creates a backup history file in the WAL archive location. The starting WAL file is used as part of the backup history file.
After executing the stop command, the WAL archival operation will be complete. Now you can back up all files under the WAL archival and the database data directory using a network backup tool.
All automatically archived WAL logs are usable as incremental backups. Use the network backup tool to perform incremental backups of your WAL archive directory.
Network-based Backup and Recovery
After getting a consistent data copy using replication or snapshots, you can use network-based backup and recovery tools to backup the data into various media, such as tapes, disks, optical devices, and NAS appliances.
Amanda is the most popular open source backup and archiving tool. Amanda is an active SourceForge project with tens of thousands of users and hundreds of developers. It has been part of the public domain since 1991 and is available as part of various Linux and BSD distributions. Latest packages are available from the Amanda downloads page. As of this writing, Amanda 2.5 is the latest stable release.
Amanda can backup multiple servers running the LAMP applications in the same backup run or in different backup runs. You can run an Amanda server on one of the LAMP application servers or on a dedicated Amanda backup server. Amanda will be active only during a backup run and will be idle at other times.
One of the unique features of Amanda is the consistent backup window. Amanda attempts to backup the same amount of data from the various backup clients running LAMP applications in each backup run. Amanda will distribute the full backup for each client (actually each backup unit) over the backup cycle days to achieve same amount of backup size each run. For system administrators it provides a consistent backup window for each backup run without having to tune the backup configuration. Figure 2 shows three client filesystems backed up daily by Amanda. In the first backup cycle, Amanda performs a full backup for all filesystems. Amanda, based on the amount of backup daily media available and amount of data changes in each client filesystem, constructs a backup plan of backup levels (full, incremental backup levels 1-9) for each run. A full backup is backup level 0. All filesystems have at least a full backup done every backup cycle days (Amanda configuration parameter). As the figure shows, the amount of data backed up is almost the same for each run resulting in consistent backup windows.
Figure 2: The amount of data backed up for each daily run
Amanda uses platform tools for backup, data compression, and encryption. It uses tools such as GNU tar, filesystem dump, and Schily tar for backup and can use any tool that is available in the client. Amanda uses the tool's data format on the backup media. This feature allows the possibility of restoring data without using Amanda. All Amanda backup media is readable using
mt commands. In fact, the Amanda media file header has the command to restore the media as a text string.
Here's how to use
dd to display Amanda tape file header:
# dd if=<amanda_media_dev> bs=32k count=1 AMANDA: FILE 20060228 natasha /boot lev 1 comp N program /bin/gtar
To restore this backup, position the tape at the start of the file and run:
dd if=<tape> bs=32k skip=1 | /bin/gtar -f... - 1+0 records in 1+0 records out
Amanda can backup to tapes, disks, optical media, media changers, and RAIT. RAIT stands for redundant array of tapes--in fact, it works for any media volume. In concept it is similar to RAID and works by striping the backup data across multiple media volumes and the parity information in another media volume. RAIT with a two-volume set is same as mirroring backup data across two media volumes, e.g., you can backup to disk (for immediate recovery needs) and to tape (for archival needs) simultaneously.
Amanda allows a lot of configuration flexibility in specifying how to perform backups, even down to the individual file granularity level. Users can specify the type of compression and encryption, whether to perform them on the Amanda client or server, and which network to be used for backups for each backup unit.
To reduce the complexity introduced by the configuration flexibility, configuration tools are available for first-time Amanda users. Amanda supports both data consistency mechanisms for LAMP application backup--snapshots as well as replication. Amanda server is scalable and can backup hundreds of LAMP application servers in each backup run. The commands to prepare the data for consistent backups can be done as a pre-backup action before the Amanda backup. Because Amanda can use any platform command for backing up applications using Amanda, it is easy to use different data protection mechanisms for different LAMP application servers in the same Amanda configuration.
Amanda developers are working on an application API that will make addition of new backup programs for applications easier. Some Amanda users backup LAMP application data, including MySQL databases, to a temporary directory in a filesystem. Later, Amanda's backup run grabs the filesystem directory. This method allows quicker recovery from the temporary filesystem and recovery from Amanda media for longer term needs. Amanda users and developers use the Amanda wiki for documenting the project and incorporating Amanda into the overall IT processes.
To recover data from the backup archive, use the Amanda recovery tools:
amrecover tool allows users to browse the index database and choose the backups to restore. Any machine with the Amanda client software installed can use this tool. It is always advisable to restore backups to a temporary directory location. Application files can be restored to the correct location on the LAMP application server as well as MySQL and PostgreSQL databases from
pg_dump backup files, respectively. Remember that you should perform full a backup restoration only when the database and application process are not running.
Here's how to perform a complete restoration for database
mysqldump and the backup file mysql_backup:
# mysql database1 < mysql_backup
Here's how to perform a complete restoration of a Postgres database from
pg_dump using backup files starting with pg_backup:
# cat pg_backup.* | psql postgresdb
To restore a PostgreSQL database from the raw backup files, recreate the PostgreSQL cluster data directory from the restored files. It is advisable to make copy of the existing contents before restoring files from the backup. The pg_xlog sub-directory can contain WAL that were not archived. The recovery process uses the recovery.conf file in the cluster data directory. This file contains the location of the archived WAL logs. Here is an example recovery.conf file:
restore_command = 'cp /backup/postgres/WAL/%f %p'
You can perform incremental restores from the MySQL binary logfiles to the temporary directory by Amanda. The incremental restore can take place at a start time to an end time or from a starting log position to ending log position in the binary logs. The log position can be used to fix operator errors using the backup files.
Here's an incremental restore of all MySQL database changes performed up to May 1, 2006:
# mysqlbinlog --stop-date="2006-05-01 12:00:00" backup-logs.[0-9]* | mysql -u <user name> -p <password>
Perform PostgreSQL incremental restores by using settings in recovery.conf in the cluster data directory. It is possible to specify end timestamps for recovery or end transaction ID for recovery. When the postmaster process restarts, it will use the
restore_command setting to retrieve the archived WAL segments up to the specified stop point--either the date/time stamp or the specific transaction ID.
It is critical to test the data recovery procedure for LAMP applications on a regular basis.
Security is important for any data protection process. The data must be secure in the backup media, which often means using encryption. You should track the keys and keep them in a secure location also under backup. Amanda's flexibility in encrypting data on the backup client where LAMP applications are running or on the Amanda server provides more options to secure the data. Use secure communication during the backup with tools such as OpenSSH or through network tunneling.
The MySQL and PostgreSQL database backup and restore users should have only the minimum privileges required to back up and restore the database. For example, the minimum privileges for a backup user for the
mysqldump tools are
Several open source data protection tools are available for LAMP applications. These tools can help you create secure backups for LAMP applications. Increased use of LAMP technologies for web applications, data warehousing applications, and enterprise applications make it necessary to protect the customer data used by these applications.
The many new and exciting open source developments in the data protection field, such as creating backups while maintaining database referential integrity and using storage available over the internet as backup media, allow the creation of just the right LAMP backup solution for an individual site's needs.
Paddy Sreenivasan is the Vice President of Engineering and a co-founder of Zmanda, Inc.
Return to O'Reilly Databases