People
The people aspect is the most important part in the beginning stages of your disaster recovery plan. The reason for this is the amount of effort that needs to go into planning, assessing, writing processes, selecting the technology and testing it. The level of people resources to monitor will be determined by the amount of money you put toward your technology component.
The amount of money you put toward technology will determine the level of people resources to monitor, failover and failback should go in the opposite direction once the initial stages have been fully researched and a process selected. From the failover and failback aspect, and depending on what method you use, your need for people resources will be either high if you use a low-tech solution or low if you employ a more high-tech solution. The primary reason for this is the level of automation and monitoring that is usually part of the solution. Unfortunately, with a high-tech solution, often you'll need more experienced people to implement it. But, in all respects, a failover situation is not something to be taken lightly and, therefore, a high level of expertise needs to exist regardless of the solution.
Keep in mind that a less automated approach means more people time is spent on the failover process and much more time is spent on the failback process. Also, depending on the complexity of the failover, such as a multi-database or multi-server failure, you may face an increased need for more resources. One person can only do so much, and if you consider that scenario when implementing your plan, it will pay off.
Process
There are several procedures and processes you should address and document in your DR plan. Writing it all down is a very time-consuming people component, but it's a very valuable one in a time of crisis. Once you document the procedures, this task is not complete. As your environment changes, you must continue to update the procedures.
Here are some items to include in your process:
- Plan and assess your needs and budget.
- Create an escalation list of staff members to contact at critical points of failure.
- Create a priority list of your servers and possibly to the database level in priority rank, so you know which needs to address first in the case of a multi-component failure.
- Establish SLAs that provide realistic guidelines with your user community if a system goes down. For instance, how much time will it take to recover based on your needs and budget? You should do that across all of your servers because if there is a widespread problem, multiple groups will probably be involved.
- Develop roles and responsibilities regarding who is responsible for what aspect, so when a crisis does arise, there should be no arguing over who is responsible for handling any particular issue.
- Create an audit/change log of all servers so you can go back and see what items have been updated at a server and database level.
- Break down failover procedures based on your technology solution into the following groups:
- One database
- Multiple databases
- Entire instance
- Entire server
- Grouped application servers -- applications that work as a team (Web servers, app servers, database servers, etc.)
- Entire datacenter
- Break down failback procedures based on your technology solution, which are based on the groups above.
- Assess testing procedures: how often, what is involved and what actually constitutes a valid test.
- Schedule a DR review process, which may be once a quarter or once a year at which time you should assess your plan to ensure it still meets your overall business needs.
Technology
After the cost of people who will plan, implement and monitor your failover solution, the technology factor is the most expensive component. Just how expensive it is depends on the solution you select for failover. There are several ways of handling failover, from the simple to complex and from inexpensive to very expensive.
The following table lists the different options that are available:
Solution | Cost | Complexity | Failover | Failback |
---|---|---|---|---|
Hardware Clustering | High | High | Fast | Fast |
Software Clustering | High | High | Fast | Fast |
Replication | Medium | Medium | Medium with manual processing | Slow with manual processing |
Continuous Data Protection | Medium | Medium | Medium | Slow |
Log Shipping | Low | Low | Medium | Slow |
Backup and Restore | Low | Low | Slow | Slow |
Database Mirroring | Low | Low | Fast, but only at the database level | Fast, but only at the database level |
Summary
Before you can have an effective disaster recovery plan, assess your need and the budgetary dollars you can allocate toward your failover solution. Based on those needs, select the appropriate technology solution and begin to wrap your processes around the selected technology. Be sure your staff is properly trained on the technology so implementation is done correctly. They must understand the failover process and also understand how to failback to your primary servers when needed. Take the time to understand the true business need and then develop your strategy to meet the need.