Sunday, December 16, 2012

Using scripts to automate VMware Site Recovery Manager workflows & recovery steps!

While I am trying to cope up with projects, new VMware products, training and new initiatives for the VMware community, I quickly thought of coming back to the audience and talk about something which I have discussed before in one of my post and share my experience around that.

Well, I am talking about an article which I wrote on Guest IP Customization on VMware SRM. I would suggest you read that first here before reading this to make more sense out of what we are trying to achieve with this post. Here is the link -  Things to Know About Guest IP Customization in VMware Site Recovery Manager!

Coming back to where I left. We know that certain operating systems, mostly Unix based might not support OS Customization, because of which we are unable to use the Guest IP Customization option in the SRM workflow. This feature allows us to insert a new IP address using the OS API's on a virtual machine at the time of Test Recovery or Disaster Recovery workflow execution. If the OS does not support IP customization then you would have to manually change the IP address on the recovered virtual machine on the DR site, this will impact the overall RTO (Recovery Time Objective) to resume services on the DR site. Though this should not impact a Test Drill much, however this method is prone to human errors and can be painful to execute at the time of disaster.

VMware SRM gives a great feature which can be used to achieve the desired automation and agility for recovering VM's with the required IP settings for Guest OSes which do not support customization. If you ever notice the recovery workflow, you would see that you are able to add pre-power on and post-power on commands to a recovery workflow. These commands can then either talk to a central repository of scripts which can be created in the SRM server or on the recovered virtual machine itself. These commands can be called if you select the respective recovery plan which recovers the virtual machines in question. The screenshot below shows how this window looks like in the SRM recovery workflow:-

In this post we will learn how to use the option "COMMAND ON RECOVERED VM". Here is the situation.

SCENARIO==> You have a virtual machine protected using SRM which is running Oracle Linux. Once this VM gets recovered on the DR site during test recovery or disaster recovery, you want the IP address and other settings to change as soon as the VM comes up on the DR site. Since Guest IP Customization will not work on Oracle Linux you would get the following error if you try to use that method - "Error: The Guest operating system "oracleLinux64Guest" is not supported." How will we solve this issue?

SOLUTION==> To solve this issue we would use the Command on Recovered VM feature of SRM workflow. Here is how you navigate to this window:-

Click Recovery Plans in the left pane, and select the recovery plan which has the VM in question.
Click the Virtual Machines tab.
Right-click the virtual machine and click Configure.
Select Pre-Power On Steps or Post Power On Steps in the left pane, and click Add.
Select Command on Recovered VM.
In the Name text box, type a name for the step.
"Command to call IP Address Change Script."

In the Content text box, type the commands for the step to run.
/bin/sh<space>/etc/network-scripts/  (I will share the content of the script in a moment)

(Optional) Modify the Timeout setting.
Click OK to add the step to the recovery plan.
Click OK to reconfigure the virtual machine to run the command before or after it powers on.

Here is the screenshot from my server:-

Here is how my script looks like:-

# Injecting network files for DR environment
/bin/cp -p /etc/network-scripts/network /etc/sysconfig/network 
/sbin/init 6
#End of script.

Let me describe this script. I am copying the files which I have already kept on the Virtual Machine on my primary site with the new IP settings on the following location - /etc/network-scripts/network. These files are replicated to the DR VM image automatically as a part of my replication plan, whether storage based or vSphere Replication.

With this script I am copying the 'network' file which has all the new settings to the /etc/sysconfig/ which will replace the existing IP settings of the Linux machine to the new IP address which we have defined in our DR Network file. After that the VM would reboot to power on with the new settings.

Similar to this script you can do a bunch of things by calling such scripts on the recovered VM's, such as:-

a) Change any other OS settings such as name, IP, network settings etc.
b) Run cron jobs or schedule tasks.
c) Run configuration files to make changes in an application configuration.
d) Do wonders ;-)

Well now when you run this recovery plan, your VM will power on on the DR site and then as soon as the VMware tools is up on the VM, SRM workflow will call this script and the script will change the network settings of the VM as scripted by you automatically. Automation and agility at its best. :-)

You might have questions about rights to run this script etc.. Remember VMware Tools is used here, so the script should have the rights same as the user account which was used to install VMware Tools, normally its domain admin on 'Windows' and root on 'Linux' OSes.

Well that's all for this time, I hope this would help you get things going on SRM and use the amazing options which SRM provides you to automate stuff. In my upcoming article, I would speak about how to use the option of using "Command on SRM server".

Do let me know if you have any questions or comments. Hope this helps..

Sunday, December 9, 2012

A blog series on "Basics of VMware Virtualization" - Share your opinion!!

The idea about this small survey came out of some numerous instances where, college freshers, people in the industry, professionals from other technology sectors etc. have asked me about learning VMware technology.

I have answered these questions in a different way every-time. For someone who was not as lucky as me to join VMware and learn the technology from some amazing colleagues and mentors, how do I create a platform from where they can start from scratch and work there way up to be at par with Certified Professionals. As tempting as it looks for someone who wants to lean, this will take some tremendous amount of work and late nights for me as an author.

In order to ensure that this idea makes sense and would benefit the TO BE VMware community, I would request you to attend this small survey with just 2 questions and let me know your opinion. 

Your response would matter a lot to me.

Here is the link:-

Thanks once again!!

Wednesday, November 7, 2012

Reclaiming Waste Capacity using vCOPS - How to Calculate Waste & Usage Accurately!

This article is coming out of a discussion which I had around a month ago with one of my colleague and then with a customer about the Capacity Management feature of vCenter Operations Manager (vCOPS).

If you are new to vCOPS, I would recommend reading my other posts on vCOPS which would make you familiar with this topic:

vCenter Operations Manager - Solving Performance, Capacity and Configuration Problems!!

Right Sizing vCenter Operations Manager vAPP For Efficient Performance !!

I believe this topic is worth writing about as this might help you understand how Capacity IQ which is now rolled up into vCenter Operations Manager (vCOPS), calculates the usage of resources for a Virtual Machine or for that matter any object in the vCenter. The resource usage ultimately helps the tool to monitor the Capacity Utilization over a period of time. This Capacity utilization leads to calculation of 2 minor badges:-

a) Reclaimable Waste, and
b) Density

These two values then roll up into a Main Badge known as "EFFICIENCY". A score of 100 on efficiency means that you are using the virtual infrastructure in the most appropriate way, and as that score starts reducing, you know that either you have virtual machines which are Over Sized or Under Sized which will lead to waste of resources or performance issues due to resource contention.

Below is a screenshot which shows how the efficiency badge and the sub-badges show in the vCenter Operations Dashboard.

At the end of the day, efficiency is the most important piece of information which the Capacity Management feature of vCOPS provides. From the perspective of an IT buyer, it becomes a tool which helps you to ensure that you do not waste any resources in your infrastructure by following primitive methods of resource allocation to servers and applications.

Hence, this allows you to right size your infrastructure as you operate and manage it. 

For example, a new application which needs to be deployed in your infrastructure needs a Windows 2008 R2 VM, with 4 vCPU and 16 GB of RAM as per the application owner. This might be a practice which is being carried forward by the application owner from the world of physical servers. However, with Virtual it is quite possible that the VM will never use the allocated capacity. The challenge is that how can we capture this data and present it back to the application owner.

vCOPS has the answer - Once this machine is created and the server goes into production, vCOPS would start monitoring this virtual machine on a regular basis and would capture data around utilization of CPU & Memory. After a period of 30 to 45 days, vCOPS would understand the capacity utilization patterns of this virtual machine. After this, a report in vCOPS about Reclaimable Waste will easily tell you about all the virtual machines which are over-sized on CPU or Memory. On the basis of this report you can reclaim the resource and save a lot of money for your organization by increasing the efficiency of the hardware. 

While I say this, it is important that you have the correct settings to monitor the utilization capacity and usage patterns of your virtual infrastructure. In a business environment where the servers work between 9 AM to 6 PM, Monday to Friday, it is important that you capture the utilization patterns during this period to calculate the reclaimable waste and density of the Virtual Machine. In such a scenario if you set the monitoring days and time to 24/7, you will end up capturing a lot of skewed data which does not reflect the correct business cycles. This will ultimately result in a very low efficiency and a huge amount of reclaimable waste which might not be TRUE otherwise.

To avoid such problems, follow the settings on the screenshot mentioned below and you should be good to go.

This would ensure that you capture the right data and process it into valid information which will help you manage capacity in your Virtual Infrastructure. It's important that any decision regarding capacity is not taken in a haste. Rather, we should ensure that we customize the settings for monitoring capacity on the basis of our own environment and then let the tool run for a period of 4 weeks to 6 weeks before you start looking into the results and begin to make changes for the betterment of your Virtual Infrastructure.

Hope this will help you learn more about what vCOPS can do for you and how you can do such tasks accurately.

Sunday, November 4, 2012

Right Sizing vCenter Operations Manager vApp For Efficient Performance !!

Out of a recent engagement on vCenter Operations Manager, the most important discussion point which came out was the sizing of vCOPS vAPP. There were other discussions as well which are interesting and I would write and share about those facts, however I thought this should be a good start towards understanding what vCOPS can do for you and how the solution should be sized for best results.

If you are new to vCOPS and need to understand the basics about vCOPS, you should refer to one of my previous article about vCOPS - vCenter Operations Manager - Solving Performance, Capacity and Configuration Problems!!

After going through that post you would know that vCOPS is available as a packaged vAPP which consists of two Virtual Machines. These machines are called:-

Analytics Virtual Machine and UI Virtual Machine

The diagram below shows the architecture of this vAPP:-

In this article I will not explain each of these components as it is clearly defined in a free VMware MyLearn training on the following link - VMware vCenter Operations Manager Fundamentals [V5.X]. I would highly recommend this training to anyone who wants to learn about vCOPS to understand the basics of this solution.

Now, coming back to sizing these virtual machines. 

Analytics VM - On a high level you need to understand that the Analytics VM is the one which does most of the work and also gets all the data in form of Raw Metrics. All the algorithms regarding the Performance Analytics run in this virtual machine and the same are stored in separate database which are hosted on this virtual machine.

UI (User-Interface) VM - This is where the Capacity IQ is hosted along with the Admin User-Interface, vSphere User-Interface and the Custom User-Interface. The Capacity related data is stored in a database which is a part of this virtual machine.

Now, the question which rises out from here is that how do we size these virtual machines in terms of CPU & Memory resources and how much storage should be assigned to these Virtual Machines to ensure that all the components of this application are able to run successfully and also perform all the tasks which you expect them to.

Since we know that vCOPS extends its offering to vSphere and Non vSphere environments, we need to look at sizing requirements from both perspectives:-

a) Based on the number of VMs – Works well for a vSphere-centric Environment.

b) Based on the number of Metrics – Works well when adding non vSphere adapters

Let's look at the numbers for a vSphere Environment for CPU, RAM & Storage

In case you are using a collector to collect metrics data from Non-vSphere environment, then you would need to calculate and add CPU, Memory & Disk resources on the basis of the recommended numbers below:-

I hope this will help you to size your vCOPS vAPP appropriately at the time of deployment. Right sizing will ensure that you do not face any issues with this application while it's monitoring the performance, showcasing capacity and ensuring standard configurations and compliance in your vSphere & Non-vSphere environment.

Tuesday, October 30, 2012

vCenter Operations Manager - Solving Performance, Capacity and Configuration Problems!!

I have been writing about the cloud infrastructure products of VMware all this while. I believe it is important to look at the management side of things as well. No doubt Virtualization makes things easier for an organization and its IT department; however things are good till the time they are small and easy to manage. As the confidence of such organizations increases towards Virtualization, you would notice a VM sprawl, which might someday defeat the purpose of Virtualization & consolidation.

The result of a sprawl leads to unpredictable behavior of the Virtual Infrastructure, Performance bottlenecks, Waste of expensive resources, complex troubleshooting procedures and issues around other day to day admin activities. I believe any administrator of a medium/large virtual infrastructure would agree to the brief description of the pain points which I have listed above. If you categorize these problems broadly they would fall under 3 Major Categories:-

I - Performance Problems - This will include pain points such as:-

§  How is my overall infrastructure performing?
§  How hard my physical ESXi servers are working?
§  Am I looking at a potential problem which might break my infrastructure?
§  Is there a way I can predict issues and solve them before they are noticed by End Users?
§  Which areas do I look at to troubleshoot any existing issues?
§  Is it a storage issue or is it the hypervisor?

&, the list is never ending..........

II - Capacity Problems - Let's have a look at the pain points of capacity:-

§  Do I have enough Physical capacity (CPU, MEMORY, NETWORK and STORAGE) to support my virtual machines?
§  When do I need to buy more hardware?
§  Am I wasting any resources?
§  Have I sized the virtual machines appropriately?
§  How should I answer questions raised by my CXO's about capacity forecasts, buying decisions etc.??

& many more.....

III - Configuration & Compliance Problems - A more critical problem area, let's see some issues here:-

§  I am asked to follow HIPPA, SOX or PCI compliance policies for my servers, am I compliant?
§  My infrastructure is too big? How do I control changes in my Virtual Infrastructure? 
§  Is there a way to maintain common standards?

The more you dig here... the more issues you would find...

Well, the list above is only a sub set of issues which we face while managing the Virtual Infrastructure. As a reader of this article and an administrator you would be able to add 10 more unique issues to this list when you read this article. However, we still have a bigger issue on hand. I call it the BIGGEST issue. Questions are always raised around:-

There are so many tools in the market who claim that they can help me with such issues. Which one should I chose??

- What should I do with my existing tools? That's a huge investment which I have already made.

- Can I get a Single Pane of Glass to solve all such issues? (The most common one - People are fascinated with a single pane, I don't know why?)

and another long list of questions.........

I hope you are with me so far and not lost into the issues which you are facing in your Virtual infrastructure because now I am going to tell you how you can solve such issues. I must tell you over here that the solution I am going to talk about has been around for a good number of years, however I have taken my own sweet time to start believing in this solution as it has matured over a period of time. Now that I see this solution working for large enterprises, I guess this is a good time that you can look into this for solving issues related to Performance, Capacity, Configuration and Compliance in your virtual infrastructures.

As the headline of my post suggest, I am talking about vCenter Operations Manager a.k.a. vCOPS. As an introduction, I would say that this VMware solution has been stitched together in the recent past, by plucking out best components from various industry standard tools which have been there in the Industry for a long time. Though there are a number of functions available in this solution, I would talk about the major life-savers here:-

Let me explain each one of them in simpler manner:-

Patented Performance Analysis - A set of 9 patented algorithms which look at performance as a behavior and not a threshold. The engine learns the behavior of your infrastructure by monitoring all the performance metrics. It learns the Normal behavior and only alert you if it observes an abnormal behavior which could lead to a potential problem. Overall gives you the HEALTH of the infrastructure and make you take right decisions in real-time. This was acquired as a part of Integrien acquisition back in 2010.

Purpose Built Capacity Planning & Analysis - This is VMware Capacity IQ which is rolled up into this suite. Those who know the power of capacity IQ would know that it is the only tool available today, which can pin-point at things like, oversized & under-sized VMs, wasted resources, time remaining & capacity remaining to provision new workloads and potential RISKS associated to Capacity of your Physical Infrastructure. It will also help you do tasks such as Capacity Trending & Forecasting for better and accurate buying decisions.

Automated Configuration & Compliance - This ability of the Suite is provided by vCenter Configuration Manager which again is around for a while. It was a part of the IONIX product portfolio, however later it was picked up by VMware from EMC to weave it into vCOPS and complete the entire picture. This is one of the strongest solution which I have witnessed for compliance and configuration management and has the capability of working across virtual and physical infrastructure, across OS platforms and across server architectures.

I hope this gives you some insight on VMware vCenter Operations Manager. I know I am leaving you with a few thoughts around what else this solution can do and how it actually does what it promises to. There would be questions around financial implications and licensing models as well. I will leave you with a few links which will help you learn more about vCOPS and at the same time I will come back with a few more articles which will help you use this solution effectively in your infrastructures.

Before I share those links, here is an interesting fact which might impress you, if you are still not impressed by vCenter Operations Manager:-

The latest version of vCOPS called vCOPS 5.6 (launched at VMworld Barcelona), has the capability to work across multiple hypervisor, multiple cloud platforms (private or public) and allows you to build Self-Healing mechanisms using vCenter Orchestrator as the Workflow Orchestration Engine. Except Capacity Management, this solution can extend into your Non-VMware infrastructure for Performance, Compliance & Configuration Management. (Storage, Networks, Other monitoring tools, HP UX, Solaris, Applications - Exchange, Oracle, SQL, Amazon, Azure, XEN, Hyper-V etc)

I sure see a revolution coming our way.. Get your seat belts on & sit tight :-)

Here are the links which you can use to learn more:- 

VMware vCenter Operations Manager Fundamentals [V5.X] - This free eLearning course covers how to install and configure vCenter Operations Manager as well as how to use its many robust features.

Other Technical Resources and Links - Link to demos, videos, documentation etc.

Pricing & Packaging - All you need to know about vCOPS licensing, pricing etc.

Hope this helps... If you liked this article, kindly share with others and let the knowledge spread.......

Thursday, October 25, 2012

Things to Know About Guest IP Customization in VMware Site Recovery Manager!

With the introduction of Site Recovery Manager 5.0, the customization workflows of the product were improved tenfold as compared to the previous versions of SRM. SRM 5.0 Guest IP customization was a great revolution as this helped reducing the RTO significantly. In the earlier versions of SRM, it use to take a long time for Virtual Machines to get a new IP address during a Test Recovery or Actual Recovery since this was achieved by using utilities like sysprep.

However, with SRM 5.0 the Guest IP customization feature allows you to inject a new IP address to the virtual machine powering on at the DR site within a few seconds. This is done by using the guest OS APIs which are used to push the new IP address as soon as the VM is ready to be powered on. In-fact to achieve this, the virtual machine is first briefly powered on to inject this IP and then it will power on again with the new settings as per the power on priorities and dependencies which you have created in the recovery plan.

The amazing thing is that this can be configured for hundreds of virtual machine by using a pre configured xml file or a convenient GUI option can be used, if you have a smaller environment. To learn about how to configure Guest IP customization refer to the This blog article from VMware Blogs.

Now that you understand the feature, its important that you use and apply this in an environment which can support such a feature. The reason I say this is because of the fact that Guest IP customization does not work on all the Guest Oses which are supported on a vSphere Platform. This will only work on Guest OSes which support Guest IP customization. Though, this feature is supported by most of the Guest Operating Systems, however it is better to check before hand to ensure that you do not face any roadblocks during the implementation. You can get the list of Guest OSes which support Guest Customization on the following link.

In case you get into a situation where you have a Guest OS which is not supported for customization, you would get the following error message in the Recovery Workflow as shown in the screenshot below:-

Error: The Guest operating system "oracleLinux64Guest" is not supported. The value in the "quotes" will be the OS which does not support Customization hence if your recovery workflow has the step to customize the IP of this OS, then the recovery plan will STOP with this error. The end state of the Virtual Machine at the DR site would be in registered mode, however it would be in a powered OFF state.

You can manually power on this VM in the DR site, however the IP address of this machine will remain unchanged from the primary site which could result in a catastrophe, hence please be careful.

Now, let's talk about the remedies and how we can take care of such situations till the time the Guest OS advances and starts supporting the customization option. I believe we can 2 methods to take care of this issue.

Method 1 - We can make this change a manual option, which means that we can easily Add a Message Step to the recovery plan for such Virtual Machines. This message step would be added before the Power On step in the workflow. This message will display that the Virtual machine vNic needs to be disconnected at power on and the virtual machine IP address needs to be changed by the Administrator manually at the time of TEST or Actual RECOVERY. You should also disable the IP Customization step for this machine in the workflow so that the workflow executes successfully.

Method 2 - The second method is more ADMIN friendly as I am going to ask you to script this change and add this to the Post Power on script option in the Recovery Workflow. For this you would need 2 Ethernet configuration files, one for Primary and the other for DR site. Please research and create these files on the basis of the operating system which you have. This would mostly be a Unix flavored OS as most of the Windows Guest OS support Customization. Once you have these files ready, add a script on the Startup of the OS to replace the existing network settings with the new one at the time of , Failover, Failback or Test.

Hope this will help you to tackle such a situation if you come across one. 

Friday, October 19, 2012

Using vSphere Replication for Protecting Databases with VMware Site Recovery Manager

I recently came across a question which involved the utilization of vSphere Replication to replicate databases from Protected Site to Recovery Site with VMware Site Recovery Manager as the DR engine. This query had 2 parts to it:- 

  • One, whether vSphere Replication Supports DB replication, &
  • Is it a good option to use vSphere Replication for DB Protection.

To begin with, you can definitely replicate Virtual Machines using vSphere Replication as a part of the VMware SRM 5.x solution. vSphere Replication does not care about what application you are running inside the Virtual Machine, hence we replicate any and every virtual machine which you configure for replication using this method. This is because we do not do the replication at the VMFS file system layer, however we do this from the VMkernel layer by using a vSCSI filter. 

Now, since we are planning to replicate a database, we might want to consider a few things which might help the replication engine to have a crash consistent data stream on the DR virtual machine. 

Since Databases can always create consistency issues when you replicate them using storage based or host based replication, most of the database vendors have their own solutions around replicating the database at the application level.This is definitely the safest option, along with a more traditional option of log shipping. However today I will discuss about other methods which might prove to be successful in your environments and help you save money on DB replication licensing & management overheads.

"With SRM 5.0 and now with SRM 5.1 we heavily rely on VMware Tools for consistency at the OS and application level.  VMware Tools has the ability to issue commands to the operating system such as to set up VSS snapshots.  With 5.1 we have the ability to do a little more than we have in the past, and ask the OS to flush application writers as well as make the OS itself quiescent.  This means for things like databases, messaging platforms, and other applications that have VSS writers, we can ensure a higher level of application recoverability.  When using vSphere Replication we can flush all the writers for the apps and the OS ensuring data consistency for the image used for recovery." - Reference - Ken Werneburg's SRM 5.1 and vSphere Replication as a Standalone Feature.

The only show stopper for us in this case would be that the OS instance of the Virtual Machine running the Database should support VSS (Windows Only). If your DB is indeed on Windows, then you would be able to enable Quiescing using VSS (Volume Shadow Services), while configuring vSphere replication on a virtual machine. The screenshot below shows that option highlighted in red, click on that drop-down and you would have the option to select VSS.

If you are not on Windows, then you should look at alternate solutions, such as Storage Array Based Replication or Log Shipping. In all the cases, I would setup a test environment and double-check the solution as this is not just technology dependent, but also environment dependent.

Hope this helps you take the right decision.

Thursday, October 18, 2012

Best Practices around using RDMs in vSphere!

One of VMware's partner engineer raised this query on an internal group. He wanted to understand and learn the best practices or the Do's and the Don'ts while using RDM (Raw Device Mappings) Luns in a vSphere environment.

I hope you being a reader understand what an RDM is and what role does it play in a vSphere Environment. In case, you are not aware of RDM, then kindly refer to the following document - vSphere 5.x Storage Guide and read about Rae Device Mapping (RDM).

During this discussion we will consider the following requirements which we need to meet :-
  • We need to provision RDM's for more than 20 VM's and size of disks will vary from 1TB to 19TB.
  • The RDM is decided for configuring MSCS on VMs like MS Exchange, MSSQL, File Servers etc.

A usual topic of discussion is choosing between RDM and VMDK. Since we have already solved that mystery, there is not much to worry about. We are already following the best practices around application layer by choosing RDM’s instead of VMDK. Now since we are playing with Luns mapped to your virtual machines, there are a few things we should take care of:-
  1. Choosing between Physical Compatibility Mode & Virtual Compatibility mode for RDM – A physical RDM is more storage array driven and virtual machine controlled. The VMkernel has limited or no role to play and it literally becomes a postman who delivers IOs from the OS to the LUN (just like an OS running on a physical server saving data on a storage LUN). This will restrict you from using VM level snapshots and other file locking technologies of VMKernel. Since you are talking about file sizes of more than 2TB, please ensure you are on VMFS 5 and use Physical compatibility mode only as VMFS 5 does not support RDM with Virtual compatibility mode for Luns greater than 2TB – 512bytes. On the other hand you would be able to use RDM with Physical compatibility mode for up to 64 TB. (VMFS 5 required)
  2. Ensure you have the correct Multipathing and failover settings
  3. Please ensure you are zoned appropriately. Test heavily before pushing things into production
  4. Follow the MSCS on vSphere guide without fail to avoid any last minutes surprises

Well this should help you do the right things with RDM's. Ensure you go through the document which I mentioned before and you should be good to go.

Wednesday, October 17, 2012

"Target disk UUID validation failed" Error while configuring vSphere Replication on a Pre-seeded VMDK

vSphere Replication gives you the option of performing a host based replication of Virtual Machine from one data-center to another over a network link. This feature was introduced with vCenter Site Recovery Manager 5.x. This allowed customers to use vSphere Replication as the primary replication engine to replicate data from one Site to another and then use the SRM engine to provide automation to the entire DR process.

Since, you have the option to set replication per virtual machine, you can also, pre-seed the VMDK files of a virtual machine on a LUN in the target Datastore (by restoring a full image from a backup). This allows you to save time and replication bandwidth, since you do not have to replicate all the data over the WAN. This will allow you to just replicate the changes from Primary Site to DR Site by Syncing both the images.

Most of the customers, who would use the pre-seeding method would register the restored VM's on the DR site and power them on to check if the backup was good and can they pre-seed that image. Once this VM is registered and powered on, you will be asked a question whether this VM "was copied" or "was moved". If you proceed with the default option of "was copied", the UUID of the VMDKs would change to a random value.

Now when you try to setup the first time Sync using the vSphere Replication configuration wizard, this configuration would fail with the following error "Target disk UUID validation failed".

This error comes up because when the replication engine compares the VMDK descriptor files of Source and Destination VMDK files, they both have different UUIDs. This causes the replication configuration and the first time sync to fail.

To solve this issue, you can simply use the ESXi shell or putty session to get the UUID from the descriptor VMDK from the Primary Site VM. Keep this UUID noted as you would need to replace the UUID of the target VMDK descriptor with this source UUID. Once done, you would be able to setup the Replication again using the same seed vmdk without an issues.

Here is how a UUID would look like in a VMDK descriptor:-

ddb.uuid = "60 00 C2 94 dd 43 63 90-18 77 3f 23 6d 8e f0 22" 

Please ensure you do this for all the disks (vmdks) attached to the Virtual Machine in question. Please ensure you have a backup available before you play around with this, in-case you do not have hands on experience.

Monday, October 15, 2012

VMware Site Recovery Manager - Accessing Test Network during Disaster Recovery Drills!

Like most of my blog posts, this topic was also a question which was raised by one of my customer during an SRM Plan & Design engagement. I did not find a blog or a document which speaks about this topic and hence I thought of documenting this on vXpress and help the community use this solution if they face a similar situation.

Well, the topic is pretty much self explanatory, however let me go ahead and dissect it for those who are wondering what is TEST Network with respect to SRM. 

The most popular feature of VMware SRM is that, it allows you to perform DR drills, using the Test Recovery option which allows you to Test your DR side Virtual Machines, Applications, Networks and the Workflows which you define in the Recovery Plans. These Recovery Plans are created during the configuration of SRM and they are modern day DR Run-books which execute as soon as you run a Test Recovery of Actual Recovery from the SRM Console. Lets look at the difference between the TEST & RECOVERY highlighted in RED in the screenshot below:-

Once you have created a Recovery Plan, which defines the workflow which need to be executed when you press either of those buttons, you are ready to either perform a 

a) Test Drill - Just a test of your DR site virtual machines, to see if your DR solution is actually working. In this process, the Production Virtual Machines keep running on the Primary Site, while copy of these machines in the DR side are mounted on the ESXi servers and are powered on in a snapshot mode (This snapshot is deleted when you cleanup test recovery, so that you do not save any changes on the DR VM's while testing). The replication of data whether Storage Based of vSphere Replication (host based) is not impacted with this Test Drill at any time.

b) Recovery - This button if used, means you actually had a bad day at office... It means you met a disaster, and finally decided that your production site is Down (due to a fire, power outage, earthquake, floods etc). Once you press this button and agree to the warnings, you force the DR machines to power on based on your Recovery Plan and start operations from your DR Site.

Now there is a minor difference in both the cases. In case of Recovery your primary VM's are down, hence you power on your secondary VM's to continue business operations. The Secondary Site network can be an extended network from the Primary site or can be a different sub-net as well. You would not have duplicate Host Names or IP issues since the primary machines are DOWN.

In case of the Test Drill, since the Primary machines are still UP, you power on the DR machines in a ISOLATED TEST NETWORK. This can be created either by choosing the AUTO option while defining DR and Test Networks in the Recovery Plan or by provisioning an ISOLATED VLAN with IP addresses which can be assigned to these test machines and Testing can be performed.

So far I hope it was easy to understand and implement...

Now,since the product has this capability of Test DR Drills, you would want to Test your Recovery Plans, which include, Virtual Machines, Operating Systems, Data, VM Interoperatbility etc, which can be powered on in a bubble environment and tested as and when needed. This can be done even when your production is up and running so this is COOL. However, you need to understand that this testing needs that all the elements which you need to perform a test should be a part of this ISOLATED network, hence anything outside this network cannot be tested or included in this trust zone to avoid DNS conflicts which could lead to data loss/corruption etc. For eg. If you are testing a 3 tier application which has a Web VM which is virtualized and protected via SRM, an application VM (virtualized and protected via SRM), and a database which is PHYSICAL and is not protected via SRM, then you cannot really test the application completely as the physical database cannot run in the Test Mode like VMware Virtual Machines.

Even if you have the capability in your database to run on a snapshot mode, it is not recommended to include that DB in your Test environment unless you are changing the DB networking to the isolated Test Network. Do not create any routes between your Test network and LAN as this can cause trouble which is irreparable. 

Phewwww.... Alright, now since you would follow the right rules, lets talk about accessing this test network. Lets say you are capable to test these Applications, VM instances etc and you want your testers to access this environment from your Primary Site (in most of the cases here is where the application teams, users etc would be sitting). You have a couple of options here:-

a) Jumpstart Terminal Server - You can provision a W2K8 R2 VM on the DR site with RDS ( aka termial server) license and allow your testers to access this machine and use the web browser to access the application. This VM can be used without a Terminal Server License if you do not want multiple Testers to access this VM via RDP. This VM would be provisioned with 2 vNics. One connected to your TEST Network Isolated port Group and the other to your DR Site LAN. Needless to say that your Primary site users should have access to the secondary site LAN via MPLS cloud etc.

b) VMware View Desktops - VDI is another way of making this possible, since you can provision desktops in this network PG and ensure that you create a seperate pool for DR testers and allow them to connect when needed. 

c) vSphere Client Access - You can allow the Testers to Login to the DR site vCenter with limited access and then can directly launch the console of the Test Virtual Machines and play around. This should be very well planned and tested to avoid any unauthorized access.

d) VMRC Weblink - You can generate a Virtual Machine Remote Console weblink and give them to the Testers to use in case they need to, however this will also give them direct access to the virtual machine files and data which you may or may not want to share.

I am sure you can think of other ways as well, but remember that you think and freeze a method during the planning phase to ensure that you can test your deployment in a pilot before going live in the production environment. 

Here are a few screenshots from a PPT which I prepared for explaining this scenario.

SRM Setup between Primary & DR Site using vSphere Replication of Storage Array Based Replication

Performing a Test Recovery which will continue the Storage Replication and Bring the DR Machines up in a Test Network in a Snapshot Mode

The Primary Site has gone down and the Recovery is executed. The business has failed over to the DR Site and the Virtual Machines are connected to the DR Network

Well, I know this might bring up more questions in your mind and if it does then feel free to use the comment column and I will be happy to discuss these options. Choose the best for your DR environment and I can ensure that you would never face any issues whatsoever.

Wednesday, October 10, 2012

vShield Endpoint Now Available to vSphere Customers!!

With the Introduction of vSphere 5.1, all the editions (essential plus or higher) of vSphere have the vShield Endpoint component bundled along with them. This basically means that you would no longer have to shell out dollars to use the functionality of Endpoint. This enables you to offload the Anti-Virus tasks to a service virtual machine, which runs on each ESXi server to ensure that all the malicious activities and data can be scanned on this service VM. This protects your virtual machines against virus attacks and other malicious activities. This will also avoid any Storage, CPU or RAM bottlenecks which might be seen in the environment due to traditional Anti-Virus Scans using an anti-virus agent inside each virtual machine.

As mentioned before, with the release of vSphere 5.1, Endpoint functionality is available at no extra cost to customers with valid SnS contract for Essentials Plus or higher. vSphere 5.1.x, 5.0.x and 4.0 U3 customers can download Endpoint from the respective vSphere download pages. No Endpoint license is needed.

Once you have the EndPoint service VM, you can use vShield Manager to configure this for all the ESXi servers in you data-center  Now, you would need to go to your anti-virus vendor and get to the version of antivirus which supports the Endpoint appliance. This will allow you to migrate from the primitive methodology of anti-virus scans and make your virtual infrastructure more robust, secured and efficient.

The diagram below gives you a visualization of how this works using Trend Micro Deep Security:-

Courtesy: Trend's Website

Below is the list of the popular Antivirus vendors who have already developed a solution around vShield Endpoint:-

On the Roadmap (Source: Google Search)

> Symantec Endpoint
> F-Secure
> Sophos
> Lumension

I can see that most of the existing and new security vendors would develop around Virtualization as they all understand that their products need to adopt the Virtualization and Cloud agility as well. Looking at the benefits this is a more futuristic approach of providing endpoint security in a data-centerI can see this change taking us towards the era of, Anti-Virus as a Service (AVaaS) where-in Security vendors would provide customized endpoint products to data-centers and end users as a commodity service. 

Another contribution to the Cloud from VMware. Kudos!!

Update to Article    Monday, December 3rd, 2012

As per the latest market update, Symantec today announced availability of its first anti-malware software protection that supports VMware's security architecture known as vShield, becoming the latest anti-malware vendor to do so following similar moves by Trend Micro, Kaspersky Lab and McAfee, among others.
Symantec Endpoint Protection (SEP) 12.1.2 can be used to scan, detect, block and remediate against anti-malware....

More can be read here -