vXpress: Disaster Recovery

Showing posts with label Disaster Recovery. Show all posts

Sunday, December 16, 2012

Using scripts to automate VMware Site Recovery Manager workflows & recovery steps!

While I am trying to cope up with projects, new VMware products, training and new initiatives for the VMware community, I quickly thought of coming back to the audience and talk about something which I have discussed before in one of my post and share my experience around that.

Well, I am talking about an article which I wrote on Guest IP Customization on VMware SRM. I would suggest you read that first here before reading this to make more sense out of what we are trying to achieve with this post. Here is the link - Things to Know About Guest IP Customization in VMware Site Recovery Manager!

Coming back to where I left. We know that certain operating systems, mostly Unix based might not support OS Customization, because of which we are unable to use the Guest IP Customization option in the SRM workflow. This feature allows us to insert a new IP address using the OS API's on a virtual machine at the time of Test Recovery or Disaster Recovery workflow execution. If the OS does not support IP customization then you would have to manually change the IP address on the recovered virtual machine on the DR site, this will impact the overall RTO (Recovery Time Objective) to resume services on the DR site. Though this should not impact a Test Drill much, however this method is prone to human errors and can be painful to execute at the time of disaster.

VMware SRM gives a great feature which can be used to achieve the desired automation and agility for recovering VM's with the required IP settings for Guest OSes which do not support customization. If you ever notice the recovery workflow, you would see that you are able to add pre-power on and post-power on commands to a recovery workflow. These commands can then either talk to a central repository of scripts which can be created in the SRM server or on the recovered virtual machine itself. These commands can be called if you select the respective recovery plan which recovers the virtual machines in question. The screenshot below shows how this window looks like in the SRM recovery workflow:-

In this post we will learn how to use the option "COMMAND ON RECOVERED VM". Here is the situation.

SCENARIO==> You have a virtual machine protected using SRM which is running Oracle Linux. Once this VM gets recovered on the DR site during test recovery or disaster recovery, you want the IP address and other settings to change as soon as the VM comes up on the DR site. Since Guest IP Customization will not work on Oracle Linux you would get the following error if you try to use that method - "Error: The Guest operating system "oracleLinux64Guest" is not supported." How will we solve this issue?

SOLUTION==> To solve this issue we would use the Command on Recovered VM feature of SRM workflow. Here is how you navigate to this window:-

1	Click Recovery Plans in the left pane, and select the recovery plan which has the VM in question.
2	Click the Virtual Machines tab.
3	Right-click the virtual machine and click Configure.
4	Select Pre-Power On Steps or Post Power On Steps in the left pane, and click Add.
5	Select Command on Recovered VM.
6	In the Name text box, type a name for the step. *"Command to call IP Address Change Script."*
7	In the Content text box, type the commands for the step to run. /bin/sh<space>/etc/network-scripts/changeip.sh (I will share the content of the script changeip.sh in a moment)
8	(Optional) Modify the Timeout setting.
9	Click OK to add the step to the recovery plan.
10	Click OK to reconfigure the virtual machine to run the command before or after it powers on. Here is the screenshot from my server:-

Here is how my script chnageip.sh looks like:-

#!/bin/bash
#################################################################
# Injecting network files for DR environment
/bin/cp -p /etc/network-scripts/network /etc/sysconfig/network
/sbin/init 6
#End of script.

Let me describe this script. I am copying the files which I have already kept on the Virtual Machine on my primary site with the new IP settings on the following location - /etc/network-scripts/network. These files are replicated to the DR VM image automatically as a part of my replication plan, whether storage based or vSphere Replication.

With this script I am copying the 'network' file which has all the new settings to the /etc/sysconfig/ which will replace the existing IP settings of the Linux machine to the new IP address which we have defined in our DR Network file. After that the VM would reboot to power on with the new settings.

Similar to this script you can do a bunch of things by calling such scripts on the recovered VM's, such as:-

a) Change any other OS settings such as name, IP, network settings etc.
b) Run cron jobs or schedule tasks.
c) Run configuration files to make changes in an application configuration.
d) Do wonders ;-)

Well now when you run this recovery plan, your VM will power on on the DR site and then as soon as the VMware tools is up on the VM, SRM workflow will call this script and the script will change the network settings of the VM as scripted by you automatically. Automation and agility at its best. :-)

You might have questions about rights to run this script etc.. Remember VMware Tools is used here, so the script should have the rights same as the user account which was used to install VMware Tools, normally its domain admin on 'Windows' and root on 'Linux' OSes.

Well that's all for this time, I hope this would help you get things going on SRM and use the amazing options which SRM provides you to automate stuff. In my upcoming article, I would speak about how to use the option of using "Command on SRM server".

Do let me know if you have any questions or comments. Hope this helps..

Thursday, October 25, 2012

Things to Know About Guest IP Customization in VMware Site Recovery Manager!

With the introduction of Site Recovery Manager 5.0, the customization workflows of the product were improved tenfold as compared to the previous versions of SRM. SRM 5.0 Guest IP customization was a great revolution as this helped reducing the RTO significantly. In the earlier versions of SRM, it use to take a long time for Virtual Machines to get a new IP address during a Test Recovery or Actual Recovery since this was achieved by using utilities like sysprep.

However, with SRM 5.0 the Guest IP customization feature allows you to inject a new IP address to the virtual machine powering on at the DR site within a few seconds. This is done by using the guest OS APIs which are used to push the new IP address as soon as the VM is ready to be powered on. In-fact to achieve this, the virtual machine is first briefly powered on to inject this IP and then it will power on again with the new settings as per the power on priorities and dependencies which you have created in the recovery plan.

The amazing thing is that this can be configured for hundreds of virtual machine by using a pre configured xml file or a convenient GUI option can be used, if you have a smaller environment. To learn about how to configure Guest IP customization refer to the This blog article from VMware Blogs.

Now that you understand the feature, its important that you use and apply this in an environment which can support such a feature. The reason I say this is because of the fact that Guest IP customization does not work on all the Guest Oses which are supported on a vSphere Platform. This will only work on Guest OSes which support Guest IP customization. Though, this feature is supported by most of the Guest Operating Systems, however it is better to check before hand to ensure that you do not face any roadblocks during the implementation. You can get the list of Guest OSes which support Guest Customization on the following link.

In case you get into a situation where you have a Guest OS which is not supported for customization, you would get the following error message in the Recovery Workflow as shown in the screenshot below:-

Error: The Guest operating system "oracleLinux64Guest" is not supported. The value in the "quotes" will be the OS which does not support Customization hence if your recovery workflow has the step to customize the IP of this OS, then the recovery plan will STOP with this error. The end state of the Virtual Machine at the DR site would be in registered mode, however it would be in a powered OFF state.

You can manually power on this VM in the DR site, however the IP address of this machine will remain unchanged from the primary site which could result in a catastrophe, hence please be careful.

Now, let's talk about the remedies and how we can take care of such situations till the time the Guest OS advances and starts supporting the customization option. I believe we can 2 methods to take care of this issue.

Method 1 - We can make this change a manual option, which means that we can easily Add a Message Step to the recovery plan for such Virtual Machines. This message step would be added before the Power On step in the workflow. This message will display that the Virtual machine vNic needs to be disconnected at power on and the virtual machine IP address needs to be changed by the Administrator manually at the time of TEST or Actual RECOVERY. You should also disable the IP Customization step for this machine in the workflow so that the workflow executes successfully.

Method 2 - The second method is more ADMIN friendly as I am going to ask you to script this change and add this to the Post Power on script option in the Recovery Workflow. For this you would need 2 Ethernet configuration files, one for Primary and the other for DR site. Please research and create these files on the basis of the operating system which you have. This would mostly be a Unix flavored OS as most of the Windows Guest OS support Customization. Once you have these files ready, add a script on the Startup of the OS to replace the existing network settings with the new one at the time of , Failover, Failback or Test.

Hope this will help you to tackle such a situation if you come across one.

Monday, October 15, 2012

VMware Site Recovery Manager - Accessing Test Network during Disaster Recovery Drills!

Like most of my blog posts, this topic was also a question which was raised by one of my customer during an SRM Plan & Design engagement. I did not find a blog or a document which speaks about this topic and hence I thought of documenting this on vXpress and help the community use this solution if they face a similar situation.

Well, the topic is pretty much self explanatory, however let me go ahead and dissect it for those who are wondering what is TEST Network with respect to SRM.

The most popular feature of VMware SRM is that, it allows you to perform DR drills, using the Test Recovery option which allows you to Test your DR side Virtual Machines, Applications, Networks and the Workflows which you define in the Recovery Plans. These Recovery Plans are created during the configuration of SRM and they are modern day DR Run-books which execute as soon as you run a Test Recovery of Actual Recovery from the SRM Console. Lets look at the difference between the TEST & RECOVERY highlighted in RED in the screenshot below:-

Once you have created a Recovery Plan, which defines the workflow which need to be executed when you press either of those buttons, you are ready to either perform a

a) Test Drill - Just a test of your DR site virtual machines, to see if your DR solution is actually working. In this process, the Production Virtual Machines keep running on the Primary Site, while copy of these machines in the DR side are mounted on the ESXi servers and are powered on in a snapshot mode (This snapshot is deleted when you cleanup test recovery, so that you do not save any changes on the DR VM's while testing). The replication of data whether Storage Based of vSphere Replication (host based) is not impacted with this Test Drill at any time.

b) Recovery - This button if used, means you actually had a bad day at office... It means you met a disaster, and finally decided that your production site is Down (due to a fire, power outage, earthquake, floods etc). Once you press this button and agree to the warnings, you force the DR machines to power on based on your Recovery Plan and start operations from your DR Site.

Now there is a minor difference in both the cases. In case of Recovery your primary VM's are down, hence you power on your secondary VM's to continue business operations. The Secondary Site network can be an extended network from the Primary site or can be a different sub-net as well. You would not have duplicate Host Names or IP issues since the primary machines are DOWN.

In case of the Test Drill, since the Primary machines are still UP, you power on the DR machines in a ISOLATED TEST NETWORK. This can be created either by choosing the AUTO option while defining DR and Test Networks in the Recovery Plan or by provisioning an ISOLATED VLAN with IP addresses which can be assigned to these test machines and Testing can be performed.

So far I hope it was easy to understand and implement...

Now,since the product has this capability of Test DR Drills, you would want to Test your Recovery Plans, which include, Virtual Machines, Operating Systems, Data, VM Interoperatbility etc, which can be powered on in a bubble environment and tested as and when needed. This can be done even when your production is up and running so this is COOL. However, you need to understand that this testing needs that all the elements which you need to perform a test should be a part of this ISOLATED network, hence anything outside this network cannot be tested or included in this trust zone to avoid DNS conflicts which could lead to data loss/corruption etc. For eg. If you are testing a 3 tier application which has a Web VM which is virtualized and protected via SRM, an application VM (virtualized and protected via SRM), and a database which is PHYSICAL and is not protected via SRM, then you cannot really test the application completely as the physical database cannot run in the Test Mode like VMware Virtual Machines.

Even if you have the capability in your database to run on a snapshot mode, it is not recommended to include that DB in your Test environment unless you are changing the DB networking to the isolated Test Network. Do not create any routes between your Test network and LAN as this can cause trouble which is irreparable.

Phewwww.... Alright, now since you would follow the right rules, lets talk about accessing this test network. Lets say you are capable to test these Applications, VM instances etc and you want your testers to access this environment from your Primary Site (in most of the cases here is where the application teams, users etc would be sitting). You have a couple of options here:-

a) Jumpstart Terminal Server - You can provision a W2K8 R2 VM on the DR site with RDS ( aka termial server) license and allow your testers to access this machine and use the web browser to access the application. This VM can be used without a Terminal Server License if you do not want multiple Testers to access this VM via RDP. This VM would be provisioned with 2 vNics. One connected to your TEST Network Isolated port Group and the other to your DR Site LAN. Needless to say that your Primary site users should have access to the secondary site LAN via MPLS cloud etc.

b) VMware View Desktops - VDI is another way of making this possible, since you can provision desktops in this network PG and ensure that you create a seperate pool for DR testers and allow them to connect when needed.

c) vSphere Client Access - You can allow the Testers to Login to the DR site vCenter with limited access and then can directly launch the console of the Test Virtual Machines and play around. This should be very well planned and tested to avoid any unauthorized access.

d) VMRC Weblink - You can generate a Virtual Machine Remote Console weblink and give them to the Testers to use in case they need to, however this will also give them direct access to the virtual machine files and data which you may or may not want to share.

I am sure you can think of other ways as well, but remember that you think and freeze a method during the planning phase to ensure that you can test your deployment in a pilot before going live in the production environment.

Here are a few screenshots from a PPT which I prepared for explaining this scenario.

SRM Setup between Primary & DR Site using vSphere Replication of Storage Array Based Replication

Performing a Test Recovery which will continue the Storage Replication and Bring the DR Machines up in a Test Network in a Snapshot Mode

The Primary Site has gone down and the Recovery is executed. The business has failed over to the DR Site and the Virtual Machines are connected to the DR Network

Well, I know this might bring up more questions in your mind and if it does then feel free to use the comment column and I will be happy to discuss these options. Choose the best for your DR environment and I can ensure that you would never face any issues whatsoever.

Thursday, September 6, 2012

Using vSphere Replication with VMware Site Recovery Manager !!

Today was the SRM day at office, so I thought lets get all the discussions together and put them in an article which would help others as well. I will do it in a question and answer format to make it easier for the audiences of this article...

Note: - This is primarily focused on how VMware Site Recovery Manager along with vSphere Replication helps customers create a BCP/DR solution for them with some amazing and easy features available in this product from VMware.

----------------------------------------------------------------------------------------------------------------------------------

Question - Will SRM 5 work with different hardware at the DC and DR site? As I understand, in case of failover with SRM VMs cold-boot. So there shouldn’t be a hardware compatibility issue.

Answer - You can have different models of x86 server hardware on 2 sites as the Virtualization layer of vSphere would make it seamless for the virtual machines to boot up on the DR Site and then failback as well.

The identical hardware comes into question when you are implementing Storage Array based replication for SRM to use. In this case, your storage arrays should be identical for the replication technology to successfully replicate Luns from Protected to Recovery Site and back.

However, if you are using vSphere Replication, you would not have to worry about that as SRM then uses the host based replication instead of Storage array based replication.

----------------------------------------------------------------------------------------------------------------------------------

Question - If total replication data size is around 20 TB how can we achieve this using host based replication? It is huge data.

Answer - That’s a great questions and probably a challenge which any organization would face, whether they use, Host based replication or vSphere Replication. Now if you are using vSphere replication, there are 2 ways to solve this issue:-

a) Use the Full Replication option, which literally means that you start the replication on each VM and then let it finish before you could start creating and configuring your recovery plans on the SRM interface. So for a total of 20 TB of data it could take days (depending on the available bandwidth, distance and latency) before this would complete. I do not recommend this method to customers who are using vSphere Replication as they have an easier way of making these images available on the Recovery (DR) Site.

b) Let’s talk about the easier way now. Instead of replicating the entire data over the wire, we can use Physical Couriering method which would save time and bandwidth for the customers. It is as simple as taking a backup / clone of the Virtual Machines which need to be protected on the Protected Site, dump this copy on a USB Drive / Tape or multiple drives/Tapes in your case. Create a MD5 Checksum on these images to ensure consistency(optional step) and ship them across to the Recovery Site. Now, you can seed these images on the target LUNS. Once you are done with this process, configure replication on the Protected Site and point to the seeds as vSphere replication gives you that option during configuration and you are done. Now vSphere Replication will run its magic and just replicate the Delta changes to the Protected site which would be a minimal amount of data as compared to what you would do in the first approach.

----------------------------------------------------------------------------------------------------------------------------------

Question - Suppose we replicate 10 VMs to the DR site and 5 go down at DC, in this case do we have to manually make DR VMs up or it does that automatically?

Answer -The SRM solution primarily looks at providing you a framework to design your BCP/DR environment, hence you would use it in a case where you see either a Disaster coming or a disaster which has already destroyed your primary site(Protected Site). This will allow you to either do a planned migration (if you know that disaster is coming) or a site failover (if the damage is already done). Hence, it is not a VM level recovery solution, but a Site Level recovery solution. However, in your case, you can create 2 recovery plans with 5 VM’s each and execute only one recovery plan instead of failing over the entire site. So you can actually have a recovery plan for each VM and execute the one which you want to recover on the DR site in case of a disaster. However, I would revisit my statement here and recommend that we should first look at high availability solutions such as VMware HA, FT or third party clustering solutions for recovering from VM failures. If it is impossible to recover VM’s at a site then we should look at SRM as a Site failure solution.

----------------------------------------------------------------------------------------------------------------------------------

Question - Suppose customer has a VM in which we do some configuration like configuring IP of DB server so that it can be in sync with DB server. Now if that VM goes down then at DR it will come up and will try to contact the DB server which is in DC. So in that case how can we utilize SRM feature effectively.

Answer - SRM gives you the automation to recover virtual machines at the Recovery Site, however it is important that you provide all the components which are required for the applications running on those virtual machines. Taking the database example, you should either have the database available with the same IP address on the DR site or if the database has a different IP address then you can either script the changes using VB Script to change that setting in the ODBC connection after the VM powers on or you can do it manually. If you script it, then SRM Recovery Plan workflow can accommodate that script for you and execute it. For the IP address of the virtual machines, we can use Guest IP customization and that would do the IP changes on the fly using API’s on Windows. You can use Bulk IP customization if you want to change IP’s on a number of virtual machines at one go and this can all be configured while setting up SRM for the first time. Regarding the DB, you can either Virtualize it so that SRM can replicate it and make it available at the recovery site or you can use DB replication technologies to have the database available at the recovery site.

----------------------------------------------------------------------------------------------------------------------------------

These features really differentiates vSphere Replication from other host based replication technologies and helps the customers implement DR which they were unable to do before.

Alright, I know that might open the "Can of Worms" and make you ask more questions, if that's the case, feel free to use the comment field and we can have some more discussions around this topic.

In addition to this, I would highly recommend the following links from VMware Technical Marketing teams who have done a great job to deep dive into the vSphere Replication technology and discussed "Behind the Scenes" of this feature :-

Advanced vSphere Replication Options for Single VM Replication Performance - by Ken Werneburg

How to Track Replication for Bandwidth Estimation?

Hosting.com shares data on bandwidth, sync times, and RPOs with vSphere Replication

vXpress

Pages

Sunday, December 16, 2012

Using scripts to automate VMware Site Recovery Manager workflows & recovery steps!

Thursday, October 25, 2012

Things to Know About Guest IP Customization in VMware Site Recovery Manager!

Monday, October 15, 2012

VMware Site Recovery Manager - Accessing Test Network during Disaster Recovery Drills!

Thursday, September 6, 2012

Using vSphere Replication with VMware Site Recovery Manager !!

Popular Posts