Virtual machines (VMs), due to their encapsulation and standardization of virtual hardware components, enable quick recovery on a diverse set of hardware platforms. This means quicker time to recovery and a significant drop in cost for implementation of a disaster recovery (DR) plan.
In increasingly complex and heterogeneous server environments, the time and cost for disaster recovery planning has climbed. Virtualizing servers and storage increases flexibility, fault tolerance, and ease of recovery in the event of a disaster. This article covers the server virtualization aspects of lower costs, increased recovery time, and reduced system-administrator stress.
There are currently several implementations of virtual machine technology. This article addresses the VM technology that virtualizes the underlying hardware and provides a virtual machine monitor (VMM) to handle allocating resources. For this type of VM technology, each VM sees exactly the same hardware. This is true even if the underlying hardware of the system running the VMs changes.
What does this mean for disaster recovery? A guarantee that, should the VM power up in a different location (DR cold/hot site) and the hardware underlying the virtual machine server changes, the VM will need no configuration or plug-and-play changes. The only potential changes will be to IP addressing and subnetting.
Two components affect the time to recovery (TTR). The first is the setup of a virtual infrastructure at the hot/cold recovery site. This includes setup of the virtual infrastructure servers, storage devices (local/NAS/SAN), and networking.
The second is recovery after a disaster. This includes loading the VM disk images on the storage devices and the configurations on the virtual infrastructure servers. Typical restore time for a 10GB system using virtual disks, after loading the images on the storage devices, is less than five minutes. This window includes registration of the system and powering on the VM.
For this second component, the time to place the data on storage at the recovery site depends on several factors:
For all three items, the amount of time is the same as for physical machines. After recovering the data, the virtualization technology actually speeds the recovery. Here are the typical recovery steps after data recovery ends:
With the exception of potential IP addressing and VLAN changes, there are no VM changes. You don't need to change Linux's lilo.conf or Microsoft's boot.ini registry settings when performing a DR process in a virtual infrastructure.
The quickest and costliest recovery method uses a snapshot or mirroring setup between your main data site and your recovery site. This typically takes place between two SAN systems. This is common at larger sites with multiple data centers in different geographic locations. One site snapshots or mirrors data from one corporate data site to the other and vice versa.
Many hot/cold sites will charge a fee based on the level of service that you require.
Typically, recovery sites use a "first come, first served" basis. This can mean an unacceptable wait time, depending on your needs. When you consider the level of service that will support your DR plan needs and the budget required for implementing with outsourced DR site solutions, your savings can be exponential -- even if you use outsourced DR sites.
There are several factors common to service-level agreements (SLAs). Consider the following in relation to your virtual infrastructure:
There are several good virtual machine implementations available, including:
I work for VMware, so I'll present examples from my experience.
|
Let's take an example of corporate site and recovery site, comparing physical versus virtual implementations for servers. For this discussion, we'll assume that the recovery site requires the customer to purchase or provide the recovery hardware and that this has happened before a disaster occurs.
For a very simple case, let's consider 10 servers in a physical environment at a corporate data center (site P1) with a recovery to 10 servers at a recovery site (site R1). For this implementation, we need 10 servers at the recovery site.
When performing recovery, we have two options: bare-metal restore or typical tape-backup restore. In either case, we must address OS types (which may require different software), plug-and-play issues (if the recovery site hardware is different), and re-licensing issues.
The recovery site cost it is relatively equal in cost to the original site and will require two components to recovery: data recovery (typically, from tape) and hardware reconfiguration.
If we take the same number of servers (10 VMs) running on one VMware ESX Server at the corporate data center (site P2), we now require one server at the remote site (site R2). For cost comparisons, there is a significantly smaller budget requirement for the recovery phase than the physical implementation above. In this case, there is only one physical machine to set up and install.
For performing recovery, we need only the virtual disks for each machine and the virtual configuration file for each machine. We don't need an operating system, backup/restore software, or physical configuration.
After restoring the data, for the VM we only need to perform the registration and power it up. Even without going into detail, the cost for the recovery site is significantly lower than for a comparable physical implementation: cost(P1+R1) > cost(P2+R2)
VMware offers other key items that increase the DR capabilities of its virtualization products. The first is the VMware Virtual Center server, which can manage all of the VMs in an ESX Server and GSX Server environment. It can identify the current capacity of each server and types of VMs that peak CPU utilization. This last item is different from what you'll see from within a VM. Virtual center shows actual utilization rather than what the VM thinks it is using.
The second item is the VMware VMotion software, which allows you to move a running VM from a highly loaded ESX Server to a lesser-burdened ESX Server. This enables on-demand computing to maximize resources, allowing systems to migrate over high-speed (typically gigabit) network connections, including certain categories of wide area networks (WANs).
The third item is a technology that VMware implements with an Application Programming Interface (API). This allows you to add a write cache to a VM disk (called a REDO log) so that the disk is quiescent and the write cache receives all of the changing data to the disk. In effect, this allows you to take snapshots of a running VM that you can then transfer to a remote system or remote recovery site without bringing down the original production system.
The fourth item is the VMware P2V software, which enables you to convert a physical machine (running a specific operating system) to a virtual machine. This updates the HAL, device drivers, and other key components to create a virtual bootable replica of the physical machine. The P2V software can help to prevent data loss on soon-to-fail hardware.
We've set up several businesses with these systems. For example, Oak Associates, an independent equity investment manager for institutions and individuals, has used virtual infrastructure to improve their DR capabilities. This technology helped them to lower costs, speed recovery, and simplify their DR implementation. Here are some excerpts from the case study.
Scott Hill, senior technology officer for Oak Associates, says, "If something were to happen to this site, we'd just go to our disaster recovery site, and all of our VMware host machines are already up and running. With ordinary backups, you have the data, but it can take a long time to restore it and make it usable. With VMware, it's easy to shut down a machine and make a copy of it. I can back up the whole machine in three minutes."
Jeff Szastak, technology officer for Oak Associates, said using VMware has dramatically reduced hardware costs, saving the company from buying more servers for the disaster recovery site. "For our original disaster recovery system using a SAN, we were doing everything at both sites," Szastak says. "We were replicating everything, which means that if I purchased a brand new server for the disaster recovery site, it didn't actually run anything unless there was a failure. We were purchasing twice as much equipment as we really needed. With VMware, I don't have to have the same equipment there. I can use my existing server hardware there and bring in newer, higher-performing servers here."
John Y. Arrasjid is currently a senior member of the VMware Professional Services Organization as a consulting architect.
Return to the O'Reilly Network
Copyright © 2009 O'Reilly Media, Inc.