Monday, October 15, 2012

VMware Site Recovery Manager - Accessing Test Network during Disaster Recovery Drills!

Like most of my blog posts, this topic was also a question which was raised by one of my customer during an SRM Plan & Design engagement. I did not find a blog or a document which speaks about this topic and hence I thought of documenting this on vXpress and help the community use this solution if they face a similar situation.

Well, the topic is pretty much self explanatory, however let me go ahead and dissect it for those who are wondering what is TEST Network with respect to SRM. 

The most popular feature of VMware SRM is that, it allows you to perform DR drills, using the Test Recovery option which allows you to Test your DR side Virtual Machines, Applications, Networks and the Workflows which you define in the Recovery Plans. These Recovery Plans are created during the configuration of SRM and they are modern day DR Run-books which execute as soon as you run a Test Recovery of Actual Recovery from the SRM Console. Lets look at the difference between the TEST & RECOVERY highlighted in RED in the screenshot below:-

Once you have created a Recovery Plan, which defines the workflow which need to be executed when you press either of those buttons, you are ready to either perform a 

a) Test Drill - Just a test of your DR site virtual machines, to see if your DR solution is actually working. In this process, the Production Virtual Machines keep running on the Primary Site, while copy of these machines in the DR side are mounted on the ESXi servers and are powered on in a snapshot mode (This snapshot is deleted when you cleanup test recovery, so that you do not save any changes on the DR VM's while testing). The replication of data whether Storage Based of vSphere Replication (host based) is not impacted with this Test Drill at any time.

b) Recovery - This button if used, means you actually had a bad day at office... It means you met a disaster, and finally decided that your production site is Down (due to a fire, power outage, earthquake, floods etc). Once you press this button and agree to the warnings, you force the DR machines to power on based on your Recovery Plan and start operations from your DR Site.

Now there is a minor difference in both the cases. In case of Recovery your primary VM's are down, hence you power on your secondary VM's to continue business operations. The Secondary Site network can be an extended network from the Primary site or can be a different sub-net as well. You would not have duplicate Host Names or IP issues since the primary machines are DOWN.

In case of the Test Drill, since the Primary machines are still UP, you power on the DR machines in a ISOLATED TEST NETWORK. This can be created either by choosing the AUTO option while defining DR and Test Networks in the Recovery Plan or by provisioning an ISOLATED VLAN with IP addresses which can be assigned to these test machines and Testing can be performed.

So far I hope it was easy to understand and implement...

Now,since the product has this capability of Test DR Drills, you would want to Test your Recovery Plans, which include, Virtual Machines, Operating Systems, Data, VM Interoperatbility etc, which can be powered on in a bubble environment and tested as and when needed. This can be done even when your production is up and running so this is COOL. However, you need to understand that this testing needs that all the elements which you need to perform a test should be a part of this ISOLATED network, hence anything outside this network cannot be tested or included in this trust zone to avoid DNS conflicts which could lead to data loss/corruption etc. For eg. If you are testing a 3 tier application which has a Web VM which is virtualized and protected via SRM, an application VM (virtualized and protected via SRM), and a database which is PHYSICAL and is not protected via SRM, then you cannot really test the application completely as the physical database cannot run in the Test Mode like VMware Virtual Machines.

Even if you have the capability in your database to run on a snapshot mode, it is not recommended to include that DB in your Test environment unless you are changing the DB networking to the isolated Test Network. Do not create any routes between your Test network and LAN as this can cause trouble which is irreparable. 

Phewwww.... Alright, now since you would follow the right rules, lets talk about accessing this test network. Lets say you are capable to test these Applications, VM instances etc and you want your testers to access this environment from your Primary Site (in most of the cases here is where the application teams, users etc would be sitting). You have a couple of options here:-

a) Jumpstart Terminal Server - You can provision a W2K8 R2 VM on the DR site with RDS ( aka termial server) license and allow your testers to access this machine and use the web browser to access the application. This VM can be used without a Terminal Server License if you do not want multiple Testers to access this VM via RDP. This VM would be provisioned with 2 vNics. One connected to your TEST Network Isolated port Group and the other to your DR Site LAN. Needless to say that your Primary site users should have access to the secondary site LAN via MPLS cloud etc.

b) VMware View Desktops - VDI is another way of making this possible, since you can provision desktops in this network PG and ensure that you create a seperate pool for DR testers and allow them to connect when needed. 

c) vSphere Client Access - You can allow the Testers to Login to the DR site vCenter with limited access and then can directly launch the console of the Test Virtual Machines and play around. This should be very well planned and tested to avoid any unauthorized access.

d) VMRC Weblink - You can generate a Virtual Machine Remote Console weblink and give them to the Testers to use in case they need to, however this will also give them direct access to the virtual machine files and data which you may or may not want to share.

I am sure you can think of other ways as well, but remember that you think and freeze a method during the planning phase to ensure that you can test your deployment in a pilot before going live in the production environment. 

Here are a few screenshots from a PPT which I prepared for explaining this scenario.

SRM Setup between Primary & DR Site using vSphere Replication of Storage Array Based Replication

Performing a Test Recovery which will continue the Storage Replication and Bring the DR Machines up in a Test Network in a Snapshot Mode

The Primary Site has gone down and the Recovery is executed. The business has failed over to the DR Site and the Virtual Machines are connected to the DR Network

Well, I know this might bring up more questions in your mind and if it does then feel free to use the comment column and I will be happy to discuss these options. Choose the best for your DR environment and I can ensure that you would never face any issues whatsoever.


  1. What an informative article. I have been looking for disaster recovery services for our business for what seems like forever. This article was very helpful in helping me understand. Thanks so much for the post.

  2. @brielle.. I am glad this helped.. Feel free to reach out if you need helps around the topic..


  3. We've completed our SRM installation and done several bubble tests, each time adding a component for client connectivity. Due to the volume of testers required for each exercise (50+) we're considering running the test as a recovery so we can get host to host connectivity. I could run all VM's on a single host but the performance would be unacceptable. I'd be interested in hearing from anyone who has run a recovery without a fallback (data replication back to production is fortunately out of scope). I think this is our solution but I'm being told by the consultant it can' be done (he didn't know about the lack of AD authentication without a DC so I'm not giving his opinion must weight. Thanks

  4. Hey Kat,

    That's a really good questions and a valid concern. As I have mentioned in the article above, SRM Test Recovery can really limit you to just test the sanity of the of the OS on the DR Site during test fail-overs. The Bubble Network restricts you from end to end testing, i.e. from VM - Application, database and end user.

    Especially, if you have resources your applications (running in VM) need outside the VMware environment (for e.g. - An Oracle Database Cluster on physical nodes), then it is not possible for you to test your App Connectivity with the DB or perform any test transactions.

    I have implemented a solution for a VMware client, where we have actually given them the flexibility of testing DR END to END. What favored us was the OS on the virtual machines. They were running non-windows (Linux/Unix) based OS platforms, hence during DR test, we would actually perform a test fail-over on the DR Network, however we change the hostname of the servers as well, in order to ensure that we isolate the DR VM from production.

    Then we connected to the Oracle Database which was on a Read/Write mode (for the test period in the DR site). They could do it at the Database layer by pausing replication from protected to DR site.

    Once the DR Site is up, they would have users test all the applications end to end and also perform dummy transactions. Once the testing was successful, they would run a cleanup and also discard the DB log and continue the replication.

    Now, what I always tell my clients is that you should always go for at-least one Actual Recovery rather than test as that would check the robustness of your recovery plan and how well you have done your Business Impact Analysis.

    Hope this helps.

  5. Hey Sunny..that's for the response. I'm fortunate in that all of the applications I need to recover are virtualized and part of existing SRM recovery plans. The more I research the more I realize I am going to have to execute the plans in Recovery mode to be able to test end-to-end. Our April exercise is audited so it's critical all systems perform as if it were production. My challenge is that I don't want to execute a fallback (the CIO would never OK it as part of a test) but neither do I want to break what has been, so far, replication without errors. The first step in our exercise is to terminate the WAN connectivity between prod and DR. The good news is I can execute the recovery at that point and it won't shut down production. I'm stumped, though, on how to terminate the recovery without the fallback piece. Any thoughts are greatly appreciated. Kat

  6. Kat, one you initiate a Recovery from the SRM console, the SRM engine will try to shutdown the Production Site as the primary step. Since you would split the sites, the DR SRM/vCenter will not be able to talk to the primary site, hence it would not be able to issue any commands to shutdown protected VMs at the primary site before they are recovered on the DR Site.

    You have to be careful here, since you do not want the Protected VMs to be powered off, ensure you test this part in a lab environment before you have your implementation doc ready to do this.

    Secondly, once you have failed over, the failbak is the way to get everything the way it was before. But since this is test, you would not want to failback the changes and it is understandable.

    To achieve this, once you have successfully failed over and done with the testing. You can gracefully shutdown the DR site VMs and un-register all the VMs.

    Now is the time, when you would have to break every relation between the primary and DR site and create everything from scratch. Kill the replication, delete the Site relation, delete the recovery plans and you might even have to re-install SRM and re-initialize the DBs.

    Unfortunately, you would have to go with this disruptive change to get everything back to the way it was. If you have scripting skills in your team, this all can be achieved in minutes.. however it is important that you have your installation & configurations documents updated and ready to be used once you are done with the Recovery which is actually a TEST....