O'Reilly    
 Published on O'Reilly (http://oreilly.com/)
 See this if you're having trouble printing code examples


Data Protection for LAMP Applications

by Paddy Sreenivasan
07/13/2006

The number of web and enterprise applications based on the LAMP application stack is growing. LAMP applications use Linux or BSD as the operating system, Apache as the web server, MySQL or PostgreSQL as the application database, and PHP, Perl, or Python as the programming language. Other applications use JBoss as the middleware. These layers have the following common attributes:

An often overlooked aspect in the LAMP application solution is the protection of the application and configuration data. This article examines how to use available open source tools to protect the LAMP application data. The security aspects of the application data and securing the LAMP application servers is beyond the scope of this article. It is also important to test the data recovery scenarios before the actual need arises.

LAMP Application Data Characteristics

From a data protection perspective, one of the key characteristics of LAMP application and configuration data is that they reside on filesystems or in MySQL or PostgreSQL databases. The Apache configuration is usually in a directory such as /etc/httpd/ directory, and PHP configuration files are often in /etc/php/. Both MySQL and PostgreSQL data, index, and other configuration files are also commonly stored in filesystems. Another characteristic is that LAMP applications may be distributed on multiple servers. Distributed LAMP application servers provide more scalability and availability for the applications. Enterprise and web applications may require high levels of availability, with very little downtime available for backups.

These characteristics of LAMP applications data (database and filesystem storage, as well as cross-server distribution) can make regular backups challenging. The system administrator needs to pay attention to the data in all layers of the LAMP application in order to get a consistent data backup for the whole LAMP application stack. Figure 1 shows a sample application and its data.

LAMP application and where the application data resides
Figure 1. A LAMP application and its data

The type of data protection solution implemented depends on the impact on application performance, application availability, type of failures to recover from, and the cost of implementing the solution. One of the key requirements for backed up data is the level of data consistency required. There are two levels of data consistency:

Very few applications support the capability of temporarily quiescing the application I/O at a consistent point in time to allow for backups. Database servers provide such capability.

Backup recovery often takes place to recover from operator errors, disk/filesystem failures, application data corruption due to bugs, or complete application host failures. The data protection solution you choose depends on the type of failures from which you want to recover and how quickly you need to recover your data.

Consistent MySQL Backups

MySQL provides two tools for backing up the databases. These tools are open source and are available as part of the MySQL distribution. The mysqldump tool creates a logical backup of the database whereas mysqlhotcopy tool creates a physical (or raw) backup of the database. The logical backup archive contains the MySQL statements to reconstruct the database data, whereas the physical backup archive contains database data, indexes, and logs in raw format (database specific format). Neither utility maintains database referential integrity; they do not back up all of the tables at the same time or same consistency point.

The tool used to backup MySQL depends on the storage engines used by the tables in the database. Transaction-capable storage engines, including InnoDB, BDB, Solid, and MySQL's Falcon (in the future), support transactional backups without any effect on the database application's availability. These backups are usually referred to as "hot" or "live" backups. Storage engines that do not support transactions, such as MyISAM, ARCHIVE, and MERGE, require obtaining a global read lock on the database and flushing active database data and logs to the disk.

Logical Full Backups with mysqldump

Logical backup consists of SQL statements (e.g., CREATE TABLE, INSERT INTO) that can re-create the complete database or a table in the database. The mysqldump tool creates logical backups of MySQL database or tables for all the storage engines. The necessary command-line options depend on whether the underlying storage engine supports transactions:

mysqldump [options] db_name [tables]

mysqldump [options] --databases db_name1 [db_name2 db_name3...]

mysqldump [options] --all-databases

Two common mysqldump options are --opt and --extended-insert. They generate optimized SQL statements. These options will speed up the restore operation and create smaller backup files. The --opt option is necessary to create correct DDL statements (CREATE TABLE). --lock-all-tables and --flush-logs lock all tables in the database and flush the data. All pending transactions are committed. These options are necessary for non-transactional storage engines such as MyISAM, ARCHIVE, and MERGE. Obtaining a read lock on all tables in an active database during the backup may not be feasible.

If you are using snapshot or replication data consistency mechanisms, it may not be necessary to lock all tables and flush logs. --single-transaction can create consistent backups for transactional storage engines such as InnoDB and BDB. The lock-all-tables option or the single-transaction option should be used. --master-data=2 writes the binary logfile name and position to the backup output as an SQL statement. It can help in point-in-time recovery.

For example, to back up all databases containing only MyISAM tables:

mysqldump --opt --extended-insert --lock-all-tables --flush-logs
--routines --triggers --master-data=2 --all-databases

One of the advantages of the logical backup is that you can restore the data to another database server. Logical backup is independent of the server architecture and filesystem type. It is also possible to correct operator errors (such as erroneous DROP TABLE statements) from logical backups. This is significant advantage compared to raw backups.

You can also run the mysqldump tool on a different machine from the MySQL server. The disadvantage of logical backups is the backup file size can be larger than the actual database (sometimes twice the size). The restoration time from the logical backup can also be significant; all SQL statements in the logical backup have to be re-run during the restore process.

Raw (Physical) Full Backups with mysqlhotcopy

A raw backup consists of the database data in a binary format and will be different for machine architectures. The mysqlhotcopy tool is a Perl script that makes a copy of the data, index, and data dictionary files to the local machine or to a remote machine. This tool takes advantage of the fact that MySQL stores database table data, index, and logfiles in the database data directory configured by the database administrator. The mysqlhotcopy tool works only for storage engines that do not support transactions and will not work with storage engines that support transactions since these tables may not be stored in the database data directory.

The tool obtains a read lock on the tables/database being backed up and flushes the database pages to the disk before doing a backup.

mysqlhotcopy --flushlog db_name_1 ... db_name_n /path/to/new_directory

Restoring from raw backups is significantly faster than the logical backups because there are no SQL statements to execute. The raw backup size is the same as the size of database data and logs. Because it is a raw backup, it is not possible to correct operator errors.

Most of the engines that support transactions provide tools that helps create raw backups. The SOLID storage engine provides its own backup tool that does binary backups. InnoDB provides a commercial tool (ibbackup) that creates raw backups for InnoDB tables. InnoDB also provides the innobackup script that takes care of backing of InnoDB tables, data dictionary (.frm files), and other MyISAM tables using the mysqldump tool.

Incremental Backups

MySQL server can maintain binary logs that contain all events (MySQL statements) or updated data. Binary logging must be enabled to do incremental MySQL backups. Binary logs keep track of all changes to the data in the database and can be replayed later to restore the database, but enabling database logging can reduce the database performance.

You can configure MySQL to put the binary log on a different disk from the MySQL data directory to reduce the performance impact. This will also help improve availability. Here's the command to enable binary logs with the prefix backup_logs:

# mysqld --log-bin=backup_logs

MySQL rotates its binary logs when the server restarts; you can also rotate the logs to mark the end of an incremental backup by using the mysqladmin flush-logs command. You can back up the binary logs since the last full MySQL backup or the incremental backup using any network backup tool.

Data Consistency Using Snapshots

You can store your MySQL database data directory in logical volumes using the Logical Volume Manager (LVM) or the Enterprise Volume Management System (EVMS). Using logical volumes to store data and index files allows the volume to extend across physical disks. (You can even extend filesystems by using the filesystem growfs command).

Mirroring logical volumes can increase database availability, and both LVM and EVMS allow making a snapshot of logical volume contents to another logical volume. The snapshot volume is a copy-on-write volume, so the data gets written to the snapshot volume on when the data block on the original volume changes. No data copying is done when you create the snapshot.

To create a consistent copy of the database, create a snapshot of the logical volume that contains the MySQL database data directory. It is necessary to flush tables and obtain a read lock before taking a snapshot. Obtaining a read lock on an active database might take some time. Because snapshots are copy-on-write copies, a snapshot operation does not take much time. You can release the lock after the snapshot operation and not hold it during the backup processes. It is not necessary to obtain a read lock for tables with transactional storage engines. When you restore a backup copy, the database will recover from the transaction logs. Obtaining a read lock on a transactional storage engine table will commit all pending transactions and there will be a significant impact on the performance.

Here are the steps to take a snapshot.

  1. Obtain a read lock on the database and flush the logs. This step is necessary if the database contains tables with non-transactional storage engines, or if you must back up specific tables with non-transactional storage engines.

    mysql> FLUSH TABLES WITH READ LOCK
  2. Create a snapshot volume for the logical volume where the database data directory resides (/dev/mysqlvolume in this example). The following LVM command creates a snapshot volume, mysqlbackup, of size 50 MB. The data changes during the backup cannot exceed 50 MB. Usually, a snapshot volume size is 15% to 20% of the original volume size. If the database is really active, you might need a larger snapshot volume size.

    # lvcreate -L50M -s -n mysqlbackup /dev/mysqlvolume
  3. Release the read lock on the database if you obtained one in Step 1.

    mysql> UNLOCK TABLES
  4. Mount the snapshot volume in a different directory.

    # mount /dev/mysqlbackup /backup
  5. Use a network-based backup tool to backup the database backup files from the /backup directory and application files to backup media, such as disks, tapes, or optical media.
  6. Remove the snapshot logical volume.

    # lvremove -f /dev/mysqlbackup
  7. Using EVMS instead of LVM for logical volumes provides additional advantages and reduces the complexity of the back up procedure. EVMS allows you to expand snapshot volumes. This will take care of long backup windows in case the initial size of snapshot volume is small.

    EVMS also allows you to create a permanent snapshot volume. Reinitialize it as Step 1 and forego removing it at the end of the backup. The snapshot volume acts as a data backup and the snapshot rollback is a quick restore mechanism.

    Using filesystem freeze and unfreeze operations along with data snapshots can provide better data consistency for the database and the application files. This is a feature of filesystems, such as XFS and GFS, and works only in conjunction with LVM snapshots to obtain a consistent view of the database and application files.

    If you use freezes, do one before Step 2 and then unfreeze it after Step 2. If you use the ext3fs filesystem for the LAMP application and the filesystem data and have enabled metadata journaling, you can skip the freeze operation.

    To freeze the XFS filesystem /var/lib/mysql in preparation for the backup:

    # xfs_freeze -f  /var/lib/mysql

    To unfreeze the XFS filesystem:

    # xfs_freeze -u /var/lib/mysql

Data consistency using replication

MySQL allows the creation of replication slaves for the master server. A master can support multiple replication slaves, and a replication server can act as a master server. All the transactions executed on the master server replay on the slave server asynchronously. MySQL commands are sent to the slave server for execution. The slave server can be behind the master replication server if the slave is slow in executing the MySQL commands. This statement-based replication is available for all storage engines.

MySQL 5.1 also supports row-based replication. The master server creates a binary log of how the table rows are affected by each statement and replicates the binary log to the slave. Row-based replication handles replication of stored routines and triggers, such as non-deterministic user defined functions. Statement-based replication produce smaller logs and are easier to audit.

There is another "mixed" mode of replication in MySQL 5.1; it uses statement-based replication by default and row-based replication for MySQL statements that use functions that will give different results when executed on master and slave server.

Replication is often useful for database high availability and load balancing, but another use is for back up. The replication slave has a consistent view of the database, but you can back it up using any of the MySQL backup methods described earlier.

Assuming that you have a replication slave set up for backups, here are the steps.

  1. Stop the slave replication and check the slave status on the slave server to see if the replication has stopped:

    mysql> STOP SLAVE;
  2. Capture the master logfile name and position:

    mysql> SHOW SLAVE STATUS;
  3. Perform any of the previous MySQL backup methods to capture the database to files.
  4. Add the CHANGE MASTER MySQL statement to the backup file using the values obtained in Step 2. This step marks the master logfile and the position in the logfile. Marking the log position makes backup restoration easier.
  5. Use Amanda to back up the database backup files and application files to backup media, such as disks, tapes, optical media.
  6. Start the slave process:

    mysql> START SLAVE;

Using replication for backup does not take care of operator errors because such errors are part of the replication. A replication solution is more expensive than a snapshot solution because it needs additional MySQL hosts. On the other hand, if you use replication for availability or load balancing, it is easy to use it for backup purposes as well.

Consistent Backups from PostgreSQL

PostgreSQL provides multi-version concurrency control. Multi-version concurrency control makes it easier to create database consistent backups. Snapshots are also possible; this is equivalent to recovering from a PostgreSQL server crash.

Logical Full Backups

Two tools--pg_dump and pg_dumpall--create logical backups of the database into a backup file. To avoid file size restrictions due to filesystem limits, I sometimes send the output of these commands to the Unix split command. These tools can run on any machine (not just on the machine running PostgreSQL postmaster server). For example, here's how to dump a Postgres database to multiple files with names starting with pg_backup:

# pg_dump postgresdb | split -b 1024m - pg_backup

Raw (Physical) Full Backups: Continuous Archiving

PostgreSQL 8.1 keeps track of all changes made to the database data files in Write-Ahead Logs (WALs) under the pg_xlog subdirectory of the cluster data directory. You can archive these WALs in a different directory before the database reuses them by copying them to a directory backed up by a network backup tool.

To copy the logs to the /backup/postgres/WAL directory, add a following shell command to your postgresql.conf file:

archive_command = 'cp -i %p /backup/postgres/WAL/%f < /dev/null'

Note: It is necessary to backup configuration files such as postgresql.conf, pg_hba.conf, and pg_ident.conf separately. To do a full raw online backup, as the database superuser:

  1. Execute the SQL command SELECT pg_start_backup('backup_id1'); The backup information (backup label, time of command execution, name of first WAL segment file) will be in the the /backup/postgres/backup_id1 file.
  2. Perform a backup of the filesystem backup of the database data directory (/usr/local/pgsql/data). Exclude the WAL files under the pg_xlog subdirectory. The archive command will back up these files.
  3. Execute the SQL command to inform that backup is complete: SELECT pg_stop_backup(); Stopping the backup creates a backup history file in the WAL archive location. The starting WAL file is used as part of the backup history file.

After executing the stop command, the WAL archival operation will be complete. Now you can back up all files under the WAL archival and the database data directory using a network backup tool.

Incremental Backups

All automatically archived WAL logs are usable as incremental backups. Use the network backup tool to perform incremental backups of your WAL archive directory.

Network-based Backup and Recovery

After getting a consistent data copy using replication or snapshots, you can use network-based backup and recovery tools to backup the data into various media, such as tapes, disks, optical devices, and NAS appliances.

Amanda is the most popular open source backup and archiving tool. Amanda is an active SourceForge project with tens of thousands of users and hundreds of developers. It has been part of the public domain since 1991 and is available as part of various Linux and BSD distributions. Latest packages are available from the Amanda downloads page. As of this writing, Amanda 2.5 is the latest stable release.

Amanda can backup multiple servers running the LAMP applications in the same backup run or in different backup runs. You can run an Amanda server on one of the LAMP application servers or on a dedicated Amanda backup server. Amanda will be active only during a backup run and will be idle at other times.

One of the unique features of Amanda is the consistent backup window. Amanda attempts to backup the same amount of data from the various backup clients running LAMP applications in each backup run. Amanda will distribute the full backup for each client (actually each backup unit) over the backup cycle days to achieve same amount of backup size each run. For system administrators it provides a consistent backup window for each backup run without having to tune the backup configuration. Figure 2 shows three client filesystems backed up daily by Amanda. In the first backup cycle, Amanda performs a full backup for all filesystems. Amanda, based on the amount of backup daily media available and amount of data changes in each client filesystem, constructs a backup plan of backup levels (full, incremental backup levels 1-9) for each run. A full backup is backup level 0. All filesystems have at least a full backup done every backup cycle days (Amanda configuration parameter). As the figure shows, the amount of data backed up is almost the same for each run resulting in consistent backup windows.

amount of data backed up each daily run
Figure 2: The amount of data backed up for each daily run

Amanda uses platform tools for backup, data compression, and encryption. It uses tools such as GNU tar, filesystem dump, and Schily tar for backup and can use any tool that is available in the client. Amanda uses the tool's data format on the backup media. This feature allows the possibility of restoring data without using Amanda. All Amanda backup media is readable using dd and mt commands. In fact, the Amanda media file header has the command to restore the media as a text string.

Here's how to use dd to display Amanda tape file header:

# dd if=<amanda_media_dev> bs=32k count=1
AMANDA: FILE 20060228 natasha /boot  lev 1 comp N program /bin/gtar

To restore this backup, position the tape at the start of the file and run:

dd if=<tape> bs=32k skip=1 |   /bin/gtar -f...  -
1+0 records in
1+0 records out

Amanda can backup to tapes, disks, optical media, media changers, and RAIT. RAIT stands for redundant array of tapes--in fact, it works for any media volume. In concept it is similar to RAID and works by striping the backup data across multiple media volumes and the parity information in another media volume. RAIT with a two-volume set is same as mirroring backup data across two media volumes, e.g., you can backup to disk (for immediate recovery needs) and to tape (for archival needs) simultaneously.

Amanda allows a lot of configuration flexibility in specifying how to perform backups, even down to the individual file granularity level. Users can specify the type of compression and encryption, whether to perform them on the Amanda client or server, and which network to be used for backups for each backup unit.

To reduce the complexity introduced by the configuration flexibility, configuration tools are available for first-time Amanda users. Amanda supports both data consistency mechanisms for LAMP application backup--snapshots as well as replication. Amanda server is scalable and can backup hundreds of LAMP application servers in each backup run. The commands to prepare the data for consistent backups can be done as a pre-backup action before the Amanda backup. Because Amanda can use any platform command for backing up applications using Amanda, it is easy to use different data protection mechanisms for different LAMP application servers in the same Amanda configuration.

Amanda developers are working on an application API that will make addition of new backup programs for applications easier. Some Amanda users backup LAMP application data, including MySQL databases, to a temporary directory in a filesystem. Later, Amanda's backup run grabs the filesystem directory. This method allows quicker recovery from the temporary filesystem and recovery from Amanda media for longer term needs. Amanda users and developers use the Amanda wiki for documenting the project and incorporating Amanda into the overall IT processes.

Recovery Process

To recover data from the backup archive, use the Amanda recovery tools: amrecover and amrestore. The amrecover tool allows users to browse the index database and choose the backups to restore. Any machine with the Amanda client software installed can use this tool. It is always advisable to restore backups to a temporary directory location. Application files can be restored to the correct location on the LAMP application server as well as MySQL and PostgreSQL databases from mysqldump and pg_dump backup files, respectively. Remember that you should perform full a backup restoration only when the database and application process are not running.

Here's how to perform a complete restoration for database database1 using mysqldump and the backup file mysql_backup:

# mysql database1 < mysql_backup

Here's how to perform a complete restoration of a Postgres database from pg_dump using backup files starting with pg_backup:

# cat pg_backup.* | psql postgresdb

To restore a PostgreSQL database from the raw backup files, recreate the PostgreSQL cluster data directory from the restored files. It is advisable to make copy of the existing contents before restoring files from the backup. The pg_xlog sub-directory can contain WAL that were not archived. The recovery process uses the recovery.conf file in the cluster data directory. This file contains the location of the archived WAL logs. Here is an example recovery.conf file:

restore_command = 'cp /backup/postgres/WAL/%f %p'

Incremental Restores

You can perform incremental restores from the MySQL binary logfiles to the temporary directory by Amanda. The incremental restore can take place at a start time to an end time or from a starting log position to ending log position in the binary logs. The log position can be used to fix operator errors using the backup files.

Here's an incremental restore of all MySQL database changes performed up to May 1, 2006:

# mysqlbinlog --stop-date="2006-05-01 12:00:00" backup-logs.[0-9]* |
     mysql -u <user name> -p <password>

Perform PostgreSQL incremental restores by using settings in recovery.conf in the cluster data directory. It is possible to specify end timestamps for recovery or end transaction ID for recovery. When the postmaster process restarts, it will use the restore_command setting to retrieve the archived WAL segments up to the specified stop point--either the date/time stamp or the specific transaction ID.

It is critical to test the data recovery procedure for LAMP applications on a regular basis.

Backup Security

Security is important for any data protection process. The data must be secure in the backup media, which often means using encryption. You should track the keys and keep them in a secure location also under backup. Amanda's flexibility in encrypting data on the backup client where LAMP applications are running or on the Amanda server provides more options to secure the data. Use secure communication during the backup with tools such as OpenSSH or through network tunneling.

The MySQL and PostgreSQL database backup and restore users should have only the minimum privileges required to back up and restore the database. For example, the minimum privileges for a backup user for the mysqlhotcopy and mysqldump tools are SELECT, RELOAD, and LOCK TABLES.

Conclusions

Several open source data protection tools are available for LAMP applications. These tools can help you create secure backups for LAMP applications. Increased use of LAMP technologies for web applications, data warehousing applications, and enterprise applications make it necessary to protect the customer data used by these applications.

The many new and exciting open source developments in the data protection field, such as creating backups while maintaining database referential integrity and using storage available over the internet as backup media, allow the creation of just the right LAMP backup solution for an individual site's needs.

Resources

Paddy Sreenivasan is the Vice President of Engineering and a co-founder of Zmanda, Inc.


Return to O'Reilly Databases

Copyright © 2009 O'Reilly Media, Inc.