Few years back, When I joined an company, I saw that there were no proper Documentations, No standard operating procedures (S.O.P’s) available and all the network was quite messed up. So I Decided to Document everything.
This document is an drafted version I made, It outlined all the proceedure / steps that should be taken when any disaster or downtime occurs at the network. [Its an incomplete version as it was the only drafted version I was able to retreive from my past email. I will try to complete it soon. I will try to add my DRP plan for the Mini ISP and Cable.Net Environment, hang on ]
HERE WE GO . . . . . . . . .
NEW ALLIED ELECTRONICS INDUSTIES (Pvt) Ltd. IT DEPT.
DISASTER RECOVERY PLAN ( D.R.P)
as a part of
BUSINESS CONTNIUITY PLAN ( B.C.P )
NEW ALLIED ELECTRONICS INDUSTRIES (PVT) LTD.
Below material is not written by me, it was copied from a book, I don’t exactly remember the link, but I will add it soon, I only edited and make it smaller as per the requirement.
What is Disaster Recovery?
Disaster Recovery (DR) is, or should be part of your Business Continuity plan. It is defined as the way of recovering from a disturbance to, or a destructive incident in, your daily Network / operations. In the context of Information Systems and Technology, this means that if an incident completely destroys data, slows down productivity, or causes any other major interruptions of your operations or your business, the process of reverting to normal operations with minimum outage from that incident is called Business Continuity. Disaster Recovery is, or should be, a part of that process. You could say that Business Continuity and Disaster Recovery go hand in hand, but they do vary depending on the area and subject. For example, if your WAN connection goes offline, it means that your business units can no longer communicate via email or internet with each other, although each local unit can still operate and continue to work. This scenario would definitely be outlined in your Business Continuity Plan. However, if your server room burns down in one location, the rebuilding of the server room and the data housed in it would be Disaster Recovery.
Why is Disaster Recovery Needed?
A lot of people may ask themselves: “Why would we need a ‘guide’ for Disaster Recovery?”
If a Domain Controller (DC) has a critical failure, we just install another one”. This might seem to work at first, and even for a longer period in small organizations, but in the long run, there would be problems, and a lot of error messages. Correct recovery is crucial to ensure a stable AD environment. The speed at which problems appear, grows exponentially if there are multiple locations of various sizes across different time zones and countries
Design Your Active Directory
In most corporations and large organizations, there are people with job titles such as “Network Architect”, “Windows Server Configuration Owner” or “Network Designer“. These people do not have these titles just for fun. In large organizations, there is an actual need for people whose sole purpose is to design or optimize the networking topology according to how technology progresses.
There are always new ways of doing things and new designs surfacing in the IT world, and those people need to stay on top of their respective fields.
Disaster Recovery for Active Directory
We have understand that DR is an important part of a Business Continuity plan. But now, we can go further and say that, DR for AD is only a part of a Disaster Recovery plan, and not the whole plan by itself. You are correct if you think that you should have different DR guides for different things.
It is important to take the standpoint that the person who performs the recovery has little or no knowledge of the system. If you roll out your own hardened and customized version of Windows 2003, some things might differ during the installation and someone who has no clear guide will install a system that differs from your actual DC install guidelines. This can cause incompatibility or result in an improperly-functioning system, later on. This happens say, when you have specific policies that are applied to DCs, and during an install process, the selection of policies is called in a manner different from the dictats of the DC policy.
You might think that this situation will never arise, but hurricane Katrina in the U.S., and the tsunami that struck Thailand, India, and others, proves that it can. Situations may arise when a knowledgeable person is not around at the time of crisis, so the guide needs to be as clear as possible. It may also be possible that the person doing the actual recovery is an external IT consultant or junior IT staff member because the senior and trained staff are not available. In this case, the person handling the recovery may not at familiar with your environment all be.
AD is a great system, but it is also very complex. Performing correct DR is therefore crucial. If AD forms a part of, or is the backbone of, your network and IT infrastructure, a proper guide to bringing it back online in the event of an incident needs to be as clear and concise as possible.
The Business Continuity plan, and the DR guides, especially the AD DR guides, should be practiced and tested at regular intervals. This effectively means that once a year or so, you need to test that your guides are working and that they will actually bring your business back online. In order to test all kinds of scenarios, building a test environment—preferably virtualized because it gives you much more flexibility such as rollbacks and snapshots—is a necessity.
It may be difficult to convince the top management that your systems could actually fail, but replicating your systems, or even just a crucial portion of your server infrastructure, and testing that would definitely be acceptable to them.
Documentation seems to be a problem in many companies and is usually the component in a project that is most often overlooked. Every time that either a new employee starts or an external contractor is hired for an AD related project, instead of getting a binder with proper documentation, he or she is assigned a buddy who explains the systems and infrastructure. Then, the first new task is to write the documentation that has been missing for the last X years. However, after the first week he or she realizes there is not enough information and when they ask for it, they get some vague pointers on where to look.
Unfortunately, the usual circle is that documentation is left for later stages in the project, and over time gets forgotten or information is passed on by word of mouth, or as a collection of links to websites, instead. Over time, the missing or incomplete documentation becomes a costly burden to the organization knowledge is lost and, because of its non-existence, is impossible to back up. The eventual creation of this documentation, which wouldn’t have taken that much time to begin with, is a lengthy and expensive process.
Documentation is not really that hard to do, but it can be hard to convince your project or program manager allocate the extra time in order to complete it. Usually, the questions will arise as to why this needs to be done now and cannot be done later. A good argument for this kind of questioning would be to explain to him or her that at a later time, information is no longer fresh and remembered, or that it is necessary for backtracking problems. I have found that both of these work very well and generally managers will give you time to document properly. If, however, you don’t get the time, please make sure that you obtain written confirmation regarding the project or program managers acceptance that there is no way of knowing what has been done, and no time to write proper documentation.
Getting documentation done is actually quite easy. It comprises two steps, and once you have done this a couple of times, it will flow easily and you will produce documentation that your manager will actually be proud to show around.
First open notepad or any text editor and write, in short points, what you do, every step. In some cases, I just copy and paste the command, or the output, or both, into a line and keep going. Once you have completed the task, take a standard company template and format it into four sections. The outline is shown in the following table:
|Document part or section||Description|
|Presentation page||A plain page containing nothing but the title of the document, the department, and the name of the author. A version table at the bottom of the page is optional.|
|Index||A proper index table. This should be on its own page and will make it look more professional.|
|Purpose||This describes, in a short paragraph, what this document is about.|
|Content||All of the actions you took with detailed descriptions. Screenshots are a big winner here. Also make sure you separate different subjects with headers.|
If you write a document about what group policies you are currently applying, then any change needs to be reflected in that document for it to beup-to-date.
Documentation plays a big part in disaster recovery, and sitting having afreshly-recovered domain, not knowing some of the settings that were applied earlier that now prevent things from working, dearly-it may even cost you even your job!
When writing your DR, please make sure that you have a printed copy in each location and at least one offsite copy per location. In some companies, it is standard practice for the domain or Enterprise admin at least to have a printed copy at home or on a USB key with him or her at all times. It is also good practice to have a printed copy or an electronic copy in the location’s safe so that it can be retrieved very quickly.
Write your documents regarding your infrastructure as clearly as possible, and do not make any assumptions about who will be reading the documents. It could very well be a summer worker or a trainee, although very often companies rely on professional DR-specialized companies. Some of these companies not only do regular, twice a year, complete DR in an isolated environment, but also sometimes provide you with warm sites to get your infrastructure back up and running more quickly. However, you never know what the disaster situation will be and if it is bad, you will want to ensure that everything possible is provided in the instructions.
Design and Implement a Disaster Recovery Plan for Your Organization
Implementing a Disaster Recovery guide in an organization that has never had one, or has had one that is outdated, may seem like an easy task. But it is not, as there are many hurdles that need to be overcome in the Disaster Recovery process. So, an accurate and proper method of implementation is very important. This chapter is designed to help you take that approach and get the whole process of Disaster Recovery implemented as fast as possible.
A lot of people assume that a Disaster Recovery guide (DRG) explains reasonably well what needs to be done to get systems back online. This is absolutely wrong. The first question that this assumption could raise is, why would one superficially touch the subject when you are writing a guide already? The second question could be that one never knows who will do the actual recovery. This statement is something that quite a few administrators that I know smiled over so at. The most compelling arguments, however are that someone technical is always around and that a non-technical person is unlikely to perform the recovery.
While both arguments have their validity, the risk of a non-technical person restoring one of the mission-critical systems and clicking the wrong button in the process, is just too high. Even if it takes a few more hours to write a proper guide, it can save days during system recovery.
The key to a successful and well-implemented DRG is motivation. If there is no motivation from the management and no motivation from the actual technical personnel, then it is not possible to develop a well-implemented and functional DRG. The all-too-common problem, though, is that the motivation usually comes in the form of an incident where a DRG would have helped but was not available.
Create a Business Continuity Plan
Business Continuity Plans are, as mentioned earlier, high-level documents and procedures. These should always accompany Disaster Recovery guides. A BCP can be created for the Active Directory as well, and the sample in Appendix can help us get started. But in order to create one, we need to have a clear view of our infrastructure and what impact any outage has on our business. The key thing that needs to be done is to define the acceptable downtime and recovery time.
The communications department should also be involved in this process so that the right communications channels and responsibilities are used and defined. Communications, within the company and with external entities, can be crucial in the event of a disaster if an organization has responsibilities to investors or is in collaboration with partners. Setting and defining the right channels and processes for company personnel helps to mitigate the outage because users will then know that there is an issue and that the IT department is working on it. They won’t bombard you with phone calls complaining that they cannot work properly.
The second important thing, though no less critical, is to define a call tree. We need to have a complete contact list and an escalation path clearly defined in our BCP. The communications department also needs to be involved in this.
Design and Implement a Disaster Recovery Plan for Your Organization
The call tree is a diagram with different levels of escalation, with the responsible person and phone number listed. With this, it is easy for someone to follow the chain of command and understand who needs to give the go-ahead for a certain action.
The following diagram shows the call tree for LG N.A.E as an example: [image not available now]
During an outage or disaster, the communications department should take responsibility for communicating the issue to the entire workforce, and not just the technical staff. For example, the information bulletin could state that the IT department is aware of the problem and is working on solving it, and also give a rough estimate of the time within which the problem is expected to be fixed and normal operations resumed.
The BCP needs to be clearly understandable and well written, because in the event of a disaster, confusing instructions can hardly be helpful. Once the final draft is ready, it would be best to have the communications department or technical writer(s) go over it to ensure an easily-readable yet professional-looking BCP.
Present it to the Management (Part 1 and 2)
This is a step that should be done by someone who has good presentation skills and an in-depth knowledge of the BCP that was designed. It is also a “two-part step” because the project has to get going start before the final draft can be approved. In order to clear this process with the management, the importance and the consequences of the BCP have to be communicated to them in anon-threatening manner.
Often, people who were deeply involved in the design of the BCP and the DRP failed in making it official due to their lack of presentation skills and “social connectivity”. Explaining in detail what we are trying to achieve and why it is crucial for the organization is essential. Once the process has been cleared and has received the go-ahead for creation of the BCP, we must proceed to the next step, and then come back to this step later.
Ultimately, it is in the best interest of the organization to have a proper DRP. Obtaining management clearance, and therefore being able to make the BCP and DRP an official standard in the organization, can open a lot of doors for you in the acceptance department. Whenever you hear complaints regarding the implementation, or disagreements in terms of content or testing, you can point to the directive and say: “take your complaints up to the next level“. Nine times out of ten, the discussion ends at that point.
Define Roles and Responsibilities
This step is an important one because the people who have been delegated responsibilities are also accountable for them. This might not be what some people want, so the roles and responsibilities have to be discussed with the staff to ensure that they understand the implications of them.
A clear list of contacts and their roles in the BCP and DRG should be drawn up. This is not a step to be rushed. Make sure that everyone involved, including the managers, know what they are supposed to be doing when push comes to shove.
Also important here is the on-call role. Someone from the IT department should always be contactable. Rotation of this role, as well as adequate compensation for this duty, need to be clearly defined. The on-call person needs to have a clear understanding of what steps to take when something happens, and how he or she can determine whether this needs to be escalated or not.
Once everyone is on board and clear with their responsibilities, we need to put this into a visual form, a call tree. Many people, especially a lot of technical staff, complain about presenting things visually. & lot of professionals agree that a visual representation helps immensely in understanding a process, a visual representation of that process helps immensely. When you then read the text regarding that representation, most likely you will understand and memorize the process steps easier.
To get a clear picture of what roles and responsibilities should be included in the BCP of LG N.A.E, see the following table. This example gives an overview of who should be included.
|Director Information Officer||Office phone and emergency number|
|IT Manager||Office phone and emergency number|
|IT Engineer / Designer(s)||Office phone and emergency number|
|IT Administrator(s)||Office phone and emergency number|
|Branch Technicians or Specialist(s)||Office phone and emergency number|
|Branch System Specialist(s)||Office phone and emergency number|
|Internal Communications||Office phone and emergency number|
|External Communications||Office phone and emergency number|
Ensure that Everyone is Aware of Locations of the DRP
This has happened twice in companies that I worked with like Fariya Netoworks . They had invested a lot of money into a DRP process and tested it once. They passed with flying colors, but the man in charge (in this example, me) subsequently left the company. The DRP was put on ice because no one took the responsibility and even worse, the whole plan got “lost”.
At Some places , When First , I asked for the BCP and DRP, I got a blank face saying: “Well, we have it somewhere”. Eventually, someone dug up a draft version from their archived inbox. After 2 weeks of searching, I found the actual plan in an obscure and forgotten place on their intranet. Not really a good thing.
|Please make sure that the location of the DRP is well known. Make a section in your IT pages in your intranet, print it out, and hand it to everyone, and always mail the latest version to the people involved. An off-site, updated, copy of the DRP and all its related documents, along with copies of software that is running in your organization, is absolutely critical. The process of keeping the DRP off-site in printed form and possibly also in electronic form is likely going to be an enormous time and money saver. This way, many copies will be around in case of an emergency.|
Define the Order of Restoration for Different Systems (Internet Servers / Domaind Controller / ADC / Mail Servers then Add One Server etc.
The contents to be recovered and their order of recovery should be clearly defined in the DRP and the BCP. (This means, first the root DC in the hub site, then the first Domain Controller, then the second, then one at a regional sites, and so on.) Also to ensure internet connectivity you must have backup liens and proxy servers ready.
Go back to “Presentation to Management”
This is the final step. Once everything is implemented, documented, and tested, go back to the management and tell them that the task is complete. Show them numbers for recovery times, pie charts of possibilities, and maximum outage numbers. Once they are convinced that money was not wasted, get it all approved and standardized.
You should be well known by then as “the man” for disaster recovery and your job, in case of an emergency, just got much, much easier.
In this presentation, we went through all the steps and processes required to get a DRP implemented successfully. Knowing the correct processes, even if it seems strange and out of place, and then applying these processes can save a lot of additional work, and possibly your job.
If you have a trained team and a plan that illustrates every step of the way, your downtime will be minimal and if the downtime is caused by something that you had no control over, such as a natural disaster or someone with a screwdriver in the wrong place, then your management and your company will know what they invested the time, effort and money into.
This is by no means a complete guide to implement a DRP but it should definitely point you in the right direction, and a good way there.
NEW ALLIED ELECTRONICS INDUSTRIES (PVT) LTD.