Introduction to Site Recovery Manager

From vmWIKI
Jump to: navigation, search

Originating Author

Michelle Laverick

Michelle Laverick.jpg

Video Content [TBA]

Introduction to Site Recovery Manager

Version: vCenter SRM 5.0

Before I embark on the book proper I want to outline some of the new features in SRM. This will be of particular interest to previous users, as well as to new adopters, as they can see how far the product has come since the previous release. I also want to talk about what life was like before SRM was developed. As with all forms of automation, it’s sometimes difficult to see the benefits of a technology if you have not experienced what life was like before its onset. I also want at this stage to make it clear what SRM is capable of and what its technical remit is. It’s not uncommon for VMware customers to look at other technologies such as vMotion and Fault Tolerance (FT) and attempt to construct a disaster recovery (DR) use case around them. While that is entirely plausible, care must be taken to build solutions that use technologies in ways that have not been tested or are not supported by VMware.

What’s New in Site Recovery Manager 5.0

To begin, I would like to flag what’s new in the SRM product. This will form the basis of the new content in this book. This information is especially relevant to people who purchased my previous book, as these changes are what made it worthwhile for me to update that book to be compatible with SRM 5.0. In the sections that follow I list what I feel are the major enhancements to the SRM product. I’ve chosen not to include a change¬log-style list of every little modification. Instead, I look at new features that might sway a customer or organization into adopting SRM. These changes address flaws or limitations in the previous product that may have made adopting SRM difficult in the past.

vSphere 5.0 Compatibility

This might seem like a small matter, but when vSphere 5 was released some of the advanced management systems were quickly compatible with the new platform—a situation that didn’t happen with vSphere 4. I think many people underestimate what a huge undertaking from a development perspective vSphere 5 actually is. VMware isn’t as big as some of the ISVs it competes with, so it has to be strategic in where it spends its development resources. Saturating the market with product release after product release can alienate customers who feel overwhelmed by too much change too quickly. I would prefer that VMware take its time with product releases and properly QA the software rather than roll out new versions injudiciously. The same people who complained about any delay would complain that it was a rush job had the software been released sooner. Most of the people who seemed to complain the most viciously about the delays in vSphere 4 were contractors whose livelihoods depended on project sign-off; in short, they were often looking out for themselves, not their customers. Most of my big customers didn’t have immediate plans for a rollout of vSphere 5 on the day of General Availability (GA), and we all know it takes time and planning to migrate from one version to another of any software. Nonetheless, it seems that’s a shake-up in which VMware product management has been effective, with the new release of SRM 5.0 coming in on time at the station.

vSphere Replication

One of the most eagerly anticipated new features of SRM is vSphere Replication (VR). This enables customers to replicate VMs from one location to another using VMware as the primary engine, without the need for third-party storage-array-based replication. VR will be of interest to customers who run vSphere in many branch offices, and yet still need to offer protection to their VMs. I think the biggest target market may well be the SMB sector for whom expensive storage arrays, and even more expensive array-based replication, is perhaps beyond their budget. I wouldn’t be surprised to find that the Foundation SKUs reflect this fact and will enable these types of customers to consume SRM in a cost-effective way.

Of course, if you’re a large enterprise customer who already enjoys the benefits of EMC MirrorView or NetApp SnapMirror, this enhancement is unlikely to change the way you use SRM. But with that said, I think VR could be of interest to enterprise customers; it will depend on their needs and situations. After all, even in a large enterprise it’s unlikely that all sites will be using exactly the same array vendor in both the Protected and Recovery Sites. So there is a use case for VR to enable protection to take place between dissimilar arrays. Additionally, in large environments it make take more time than is desirable for the storage team to enable replication on the right volumes/LUNs, now that VMware admins are empowered to protect their VMs when they see fit.

It’s worth saying that VR is protocol-neutral—and that this will be highly attractive to customers migrating from one storage protocol to another—so VR should allow for replication between Fibre Channel and NFS, for example, just like customers can move a VM around with VMware’s Storage vMotion regardless of storage protocol type. This is possible because, with VR, all that is seen is a datastore, and the virtual appliance behind VR doesn’t interface directly with the storage protocols that the ESX host sees. Instead, the VR appliance communicates to the agent on the ESX host that then transfers data to the VR appliance. This should allow for the protection of VMs, even if local storage is used—and again, this might be very attractive to the SMB market where direct attached storage is more prevalent.

Automated Failback and Reprotect

When SRM was first released it did not come with a failback option. That’s not to say failback wasn’t possible; it just took a number of steps to complete the process. I’ve done innumerable failovers and failbacks with SRM 1.0 and 4.0, and once you have done a couple you soon get into the swing of them. Nonetheless, an automated failback process is a feature that SRM customers have had on their wish lists for some time. Instructions to manage the storage arrays are encoded in what VMware calls Site Recovery Adapters (SRAs). Previously, the SRA only automated the testing and running of SRM’s Recovery Plans. But now the SRAs support the instructions required to carry out a failback routine. Prior to this, the administrator had to use the storage vendor’s management tools to manage replication paths.

Additionally, SRM 5.0 ships with a process that VMware is calling Reprotect Mode. Prior to the reprotect feature it was up to the administrator to clear out stale objects in the vCenter inventory and re-create objects such as Protection Groups and Recovery Plans. The new reprotect feature goes a long way toward speeding up the failback process. With this improvement you can see VMware is making the VM more portable than ever before.

Most VMware customers are used to being able to move VMs from one physical server to another with vMotion within the site, and an increasing number would like to extend this portability to their remote locations. This is currently possible with long-distance live migrate technologies from the likes of EMC and NetApp, but these require specialized technologies that are distance-limited and bandwidth-thirsty and so are limited to top-end customers. With an effective planned migration from SRM and a reprotect process, customers would be able to move VMs around from site to site. Clearly, the direction VMware is taking is more driven toward managing the complete lifecycle of a VM, and that includes the fact that datacenter relocations are part of our daily lives.

VM Dependencies

One of the annoyances of SRM 1.0 and 4.0 was the lack of a grouping mechanism for VMs. In previous releases all protected VMs were added to a list, and each one had to be moved by hand to a series of categories: High, Low, or Normal. There wasn’t really a way to create objects that would show the relationships between VMs, or groupings. The new VM Dependencies feature will allow customers to more effectively show the relationships between VMs from a service perspective. In this respect we should be able to configure SRM in such a way that it reflects the way most enterprises categorize the applications and services they provide by tiers. In addition to the dependencies feature, SRM now has five levels of priority order rather than the previous High, Low, and Normal levels. You might find that, given the complexity of your requirements, these offer all the functionality you need.

Improved IP Customization

Another great area of improvement comes in the management of IP addresses. In most cases you will find that two different sites will have entirely different IP subnet ranges. According to VMware research, nearly 40% of SRM customers are forced to re-IP their VMs. Sadly, it’s a minority of customers who have, or can get approval for, a “stretched VLAN” configuration where both sites believe they make up the same continuous network, despite being in entirely different geographies. One method of making sure that VMs with a 10.x.y.z address continue to function in a 192.168. 1 .x network is to adopt the use of Network Address Translation (NAT) technologies, such that VMs need not have their IP address changed at all.

Of course, SRM has always offered a way to change the IP address of Windows and Linux guests using the Guest Customization feature with vCenter. Guest Customization is normally used in the deployment of new VMs to ensure that they have unique hostnames and IP addresses when they have been cloned from a template. In SRM 1.0 and 4.0, it was used merely to change the IP address of the VM. Early in SRM a command-line utility, dr-ip-exporter, was created to allow the administrator to create many guest customizations in a bulk way using a .csv file to store the specific IP details. While this process worked, it wasn’t easy to see that the original IP address was related to the recovery IP address. And, of course, when you came to carry out a failback process all the VMs would need to have their IP addresses changed back to the original from the Protected Site. For Windows guests the process was particularly slow, as Microsoft Sysprep was used to trigger the re-IP process. With this new release of SRM we have a much better method of handling the whole re-IP process—which will be neater and quicker and will hold all the parameters within a single dialog box on the properties of the VM. Rather than using Microsoft Sysprep to change the IP address of the VM, much faster scripting technologies like PowerShell, WMI, and VBScript can be used. In the longer term, VMware remains committed to investing in technologies both internally and with its key partners. That could mean there will be no need to re-IP the guest operating system in the future.

A Brief History of Life before VMware SRM

To really appreciate the impact of VMware’s SRM, it’s worth it to pause for a moment to think about what life was like before virtualization and before VMware SRM was released. Until virtualization became popular, conventional DR meant dedicating physical equipment at the DR location on a one-to-one basis. So, for every business-critical server or service there was a duplicate at the DR location. By its nature, this was expensive and difficult to manage—the servers were only there as standbys waiting to be used if a disaster happened. For people who lacked those resources internally, it meant hiring out rack space at a commercial location, and if that included servers as well, that often meant the hardware being used was completely different from that at the physical location. Although DR is likely to remain a costly management headache, virtualization goes a long way toward reducing the financial and administrative penalties of DR planning. In the main, virtual machines are cheaper than physical machines. We can have many instances of software—Windows, for example—running on one piece of hardware, reducing the amount of rack space required for a DR location. We no longer need to worry about dissimilar hardware; as long as the hardware at the DR location supports VMware ESX, our precious time can be dedicated to getting the services we support up and running in the shortest time possible.

One of the most common things I’ve heard in courses and at conferences from people who are new to virtualization is, among other things:

We’re going to try virtualization in our DR location, before rolling it out into production.

This is often used as a cautious approach by businesses that are adopting virtualization technologies for the first time. Whenever this is said to me I always tell the individual concerned to think about the consequences of what he’s saying. In my view, once you go down the road of virtualizing your DR, it is almost inevitable that you will want to virtualize your production systems. This is the case for two main reasons. First, you will be so impressed and convinced by the merits of virtualization anyway that you will want to do it. Second, and more important in the context of this book, is that if your production environment is not already virtualized how are you going to keep your DR locations synchronized with the primary location?

There are currently a couple of ways to achieve this. You could rely solely on conventional backup and restore, but that won’t be very slick or very quick. A better alternative might be to use some kind of physical to virtual conversion (P2V) technology. In recent years many of the P2V providers, such as Novell and Leostream, have repositioned their offerings as “availability tools,” the idea being that you use P2V software to keep the production environment synchronized with the DR location. These technologies do work, and there will be some merits to adopting this strategy—say, for services that must, for whatever reason, remain on a physical host at the “primary” location. But generally I am skeptical about this approach. I subscribe to the view that you should use the right tools for the right job; never use a wrench to do the work of a hammer. From its very inception and design you will discover flaws and problems—because you are using a tool for a purpose for which it was never designed. For me, P2V is P2V; it isn’t about DR, although it can be reengineered to do this task. I guess the proof is in the quality of the reengineering. On top of this you should know that in the long term, VMware has plans to integrate its VMware Converter technology into SRM to allow for this very functionality. In the ideal VMware world, every workload would be virtualized. In 2010 we reached a tipping point where more new servers were virtual machines than physical machines. However, in terms of percentage it is still the case that, on average, only 30% of most people’s infrastructure has been virtualized. So, at least for the mid-term, we will still need to think about how physical servers are incorporated into a virtualized DR plan.

Another approach to this problem has been to virtualize production systems before you virtualize the DR location. By doing this you merely have to use your storage vendor’s replication or snapshot technology to pipe the data files that make up a virtual machine (VMX, VMDK, NVRAM, log, Snapshot, and/or swap files) to the DR location. Although this approach is much neater, this in itself introduces a number of problems, not least of which is getting up to speed with your storage vendor’s replication technology and ensuring that enough bandwidth is available from the Protected Site to the Recovery Site to make it workable. Additionally, this introduces a management issue. In the large corporations the guys who manage SRM may not necessarily be the guys who manage the storage layer. So a great deal of liaising, and sometimes cajoling, would have to take place to make these two teams speak and interact with each other effectively.

But putting these very important storage considerations to one side for the moment, a lot of work would still need to be done at the virtualization layer to make this sing. These “replicated” virtual machines need to be “registered” on an ESX host at the Recovery Site, and associated with the correct folder, network, and resource pool at the destination. They must be contained within some kind of management system on which to be powered, such as vCenter. And to power on the virtual machine, the metadata held within the VMX file might need to be modified by hand for each and every virtual machine. Once powered on (in the right order), their IP configuration might need modification. Although some of this could be scripted, it would take a great deal of time to create and verify those scripts. Additionally, as your production environment started to evolve, those scripts would need constant maintenance and revalidation. For organizations that make hundreds of virtual machines a month, this can quickly become unmanageable. It’s worth saying that if your organization has already invested a lot of time in scripting this process and making a bespoke solution, you might find that SRM does not meet all your needs. This is a kind of truism. Any bespoke system created internally is always going to be more finely tuned to the business’s requirements. The problem then becomes maintaining it, testing it, and proving to auditors that it works reliably.

It was within this context that VMware engineers began working on the first release of SRM. They had a lofty goal: to create a push-button, automated DR system to simplify the process greatly. Personally, when I compare it to alternatives that came before it, I’m convinced that out of the plethora of management tools added to the VMware stable in recent years VMware SRM is the one with the clearest agenda and remit. People under-stand and appreciate its significance and importance. At last we can finally use the term virtualizing DR without it actually being a throwaway marketing term.

If you want to learn more about this manual DR, VMware has written a VM book about virtualizing DR that is called A Practical Guide to Business Continuity & Disaster Recovery with VMware Infrastructure. It is free and available online here:

I recommend reading this guide, perhaps before reading this book. It has a much broader brief than mine, which is narrowly focused on the SRM product.

What Is Not a DR Technology?

In my time of using VMware technologies, various features have come along which people often either confuse for or try to engineer into being a DR technology—in other words, they try to make a technology do something it wasn’t originally designed to do. Personally, I’m in favor of using the right tools for the right job. Let’s take each of these technologies in turn and try to make a case for their use in DR.


In my early days of using VMware I would often hear my clients say they intended to use vMotion as part of their DR plan. Most of them understood that such a statement could only be valid if the outage was in the category of a planned DR event such as a power outage or the demolition of a nearby building. Increasingly, VMware and the network and storage vendors have been postulating the concept of long-distance vMotion for some time. In fact, one of the contributors to this book, Chad Sakac of EMC, had a session at VMworld San Francisco 2009 about this topic. Technically, it is possible to do vMotion across large distances, but the technical challenges are not to be underestimated or taken lightly given the requirements of vMotion for shared storage and shared networking. We will no doubt get there in the end; it’s the next logical step, especially if we want to see the move from an internal cloud to an external cloud become as easy as moving a VM from one ESX host in a blade enclosure to another. Currently, to do this you must shut down your VMs and cold-migrate them to your public cloud provider.

But putting all this aside, I think it’s important to say that VMware has never claimed that vMotion constitutes a DR technology, despite the FUD that emanates from its competitors. As an indication of how misunderstood both vMotion and the concept of what constitutes a DR location are, one of these clients said to me that he could carry vMotion from his primary site to his Recovery Site. I asked him how far away the DR location was. He said it was a few hundred feet away. This kind of wonky thinking and misunderstanding will not get you very far down the road of an auditable and effective DR plan. The real usage of vMotion currently is being able to claim a maintenance window on an ESX host without affecting the uptime of the VMs within a site. Once coupled with VMware’s Distributed Resource Scheduler (DRS) technology, vMotion also becomes an effective performance optimization technology. Going forward, it may indeed be easier to carry out a long-distance vMotion of VMs to avoid an impending disaster, but much will depend on the distance and scope of the disaster itself. Other things to consider are the number of VMs that must be moved, and the time it takes to complete that operation in an orderly and graceful manner.

VMware HA Clusters

Occasionally, customers have asked me about the possibility of using VMware HA technology across two sites. Essentially, they are describing a “stretched cluster” concept. This is certainly possible, but it suffers from the technical challenges that confront geo-based vMotion: access to shared storage and shared networking. There are certainly storage vendors that will be happy to assist you in achieving this configuration; examples include NetApp with its MetroCluster and EMC with its VPLEX technology. The operative word here is metro. This type of clustering is often limited by distance (say, from one part of a city to another). So, as in my anecdote about my client, the distances involved may be too narrow to be regarded as a true DR location. When VMware designed HA, its goal was to be able to restart VMs on another ESX host. Its primary goal was merely to “protect” VMs from a failed ESX host, which is far from being a DR goal. HA was, in part, VMware’s first attempt to address the “eggs in one basket” anxiety that came with many of the server consolidation projects we worked on in the early part of the past decade. Again, VMware has never made claims that HA clusters constitute a DR solution. Fundamentally, HA lacks the bits and pieces to make it work as a DR technology. For example, unlike SRM, there is really no way to order its power-on events or to halt a power-on event to allow manual operator intervention, and it doesn’t contain a scripting component to allow you to automate residual reconfiguration when the VM gets started at the other site. The other concern I have with this is when customers try to combine technologies in a way that is not endorsed or QA’d by the vendor. For example, some folks think about overlaying a stretched VMware HA cluster on top of their SRM deployment. The theory is that they can get the best of both worlds. The trouble is the requirements of stretched VMware HA and SRM are at odds with each other. In SRM the architecture demands two separate vCenters managing distinct ESX hosts. In contrast, VMware HA requires that the two or more hosts that make up an HA cluster be managed by just one vCenter. Now, I dare say that with a little bit of planning and forethought this configuration could be engineered. But remember, the real usage of VMware HA is to restart VMs when an ESX host fails within a site—something that most people would not regard as a DR event.

VMware Fault Tolerance

VMware Fault Tolerance (FT) was a new feature of vSphere 4. It allowed for a primary VM on one host to be “mirrored” on a secondary ESX host. Everything that happens on the primary VM is replayed in “lockstep” with the secondary VM on the different ESX host. In the event of an ESX host outage, the secondary VM will immediately take over the primary’s role. A modern CPU chipset is required to provide this functionality, together with two 1GB vmnics dedicated to the FT Logging network that is used to send the lockstep data to the secondary VM. FT scales to allow for up to four primary VMs and four secondary VMs on the ESX host, and when it was first released it was limited to VMs with just one vCPU. VMware FT is really an extension of VMware HA (in fact, FT requires HA to be enabled on the cluster) that offers much better availability than HA, because there is no “restart” of the VM. As with HA, VMware FT has quite high requirements, as well as shared networking and shared storage—along with additional requirements such as bandwidth and network redundancy. Critically, FT requires very low-latency links to maintain the lockstep functionality, and in most environments it will be cost-prohibitive to provide the bandwidth to protect the same number of VMs that SRM currently protects. The real usage of VMware FT is to provide a much better level of availability to a select number of VMs within a site than currently offered by VMware HA.

Scalability for the Cloud

As with all VMware products, each new release introduces increases in scalability. Quite often these enhancements are overlooked by industry analysts, which is rather disappointing. Early versions of SRM allowed you to protect a few hundred VMs, and SRM 4.0 allowed the administrator to protect up to 1,000 VMs per instance of SRM. That forced some large-scale customers to create “pods” of SRM configurations in order to protect the many thousands of VMs that they had. With SRM 5.0, the scalability numbers have jumped yet again. A single SRM 5.0 instance can protect up to 6,000 VMs, and can run up to 30 individual Recovery Plans at any one time. This compares very favorably to only being able to protect up to 1,000 VMs and run just three Recovery Plans in the previous release. Such advancements are absolutely critical to the long-term integration of SRM into cloud automation products, such as VMware’s own vCloud Director. Without that scale it would be difficult to leverage the economies of scale that cloud computing brings, while still offering the protection that production and Tier 1 applications would inevitably demand.

What Is VMware SRM?

Currently, SRM is a DR automation tool. It automates the testing and invocation of disaster recovery (DR), or as it is now called in the preferred parlance of the day, “business continuity” (BC), of virtual machines. Actually, it’s more complicated than that. For many, DR is a procedural event. A disaster occurs and steps are required to get the business functional and up and running again. On the other hand, BC is more a strategic event, which is concerned with the long-term prospects of the business post-disaster, and it should include a plan for how the business might one day return to the primary site or carry on in another location entirely. Someone could write an entire book on this topic; indeed, books have been written along these lines, so I do not intend to ramble on about recovery time objectives (RTOs), recovery point objectives (RPOs), and maximum tolerable downtimes (MTDs)—that’s not really the subject of this book. In a nutshell, VMware SRM isn’t a “silver bullet” for DR or BC, but a tool that facilitates those decision processes planned way before the disaster occurred. After all, your environment may only be 20% or 30% virtualized, and there will be important physical servers to consider as well.

This book is about how to get up and running with VMware’s SRM. I started this section with the word currently. Whenever I do that, I’m giving you a hint that either technology will change or I believe it will. Personally, I think VMware’s long-term strategy will be to lose the “R” in SRM and for the product to evolve into a Site Management utility. This will enable people to move VMs from the internal/private cloud to an external/ public cloud. It might also assist in datacenter moves from one geographical location to another—for example, because a lease on the datacenter might expire, and either it can’t be renewed or it is too expensive to renew.

With VMware SRM, if you lose your primary or Protected Site the goal is to be able to go to the secondary or Recovery Site: Click a button and find your VMs being powered on at the Recovery Site. To achieve this, your third-party storage vendor must provide an engine for replicating your VMs from the Protected Site to the Recovery Site—and your storage vendor will also provide a Site Recovery Adapter (SRA) which is installed on your SRM server.

As replication or snapshots are an absolute requirement for SRM to work, I felt it was a good idea to begin by covering a couple of different storage arrays from the SRM perspective. This will give you a basic run-through on how to get the storage replication or snapshot piece working—especially if you are like me and you would not classify yourself as a storage expert. This book does not constitute a replacement for good training and education in these technologies, ideally coming directly from the storage array vendor. If you are already confident with your particular vendor’s storage array replication or snapshot features you could decide to skip ahead to Chapter 7, Installing VMware SRM. Alternatively, if you’re an SMB/SME or you are working in your own home lab, you may not have the luxury of access to array-based replication. If this is the case, I would heartily recommend that you skip ahead to Chapter 8, Configuring vSphere Replication (Optional).

In terms of the initial setup, I will deliberately keep it simple, starting with a single LUN/ volume replicated to another array. However, later on I will change the configuration so that I have multiple LUNs/volumes with virtual machines that have virtual disks on those LUNs. Clearly, managing replication frequency will be important. If we have multiple VMDK files on multiple LUNs/volumes, the parts of the VM could easily become un-synchronized or even missed altogether in the replication strategy, thus creating half-baked, half-complete VMs at the DR location. Additionally, at a VMware ESX host level, if you use VMFS extents but fail to include all the LUNs/volumes that make up those extents, the extent will be broken at the recovery location and the files making up the VM will be corrupted. So, how you use LUNs and where you store your VMs can be more complicated than this simple example will first allow. This doesn’t even take into account the fact that different virtual disks that make up a VM can be located on different LUNs/volumes with radically divergent I/O capabilities. Our focus is on VMware SRM, not storage. However, with this said, a well-thought-out storage and replication structure is funda¬mental to an implementation of SRM.

What about File Level Consistency?

One question you will (and should) ask is what level of consistency will the recovery have? This is very easy to answer: the same level of consistency had you not virtualized your DR. Through the storage layer we could be replicating the virtual machines from one site to another synchronously. This means the data held at both sites is going to be of a very high quality. However, what is not being synchronized is the memory state of your servers at the production location. This means if a real disaster occurs, that memory state will be lost. So, whatever happens there will be some kind of data loss, unless your storage vendor has a way to “quiesce” the applications and services inside your virtual machine.

So, although you may well be able to power on virtual machines in a recovery location, you may still need to use your application vendor’s tools to repair these systems from this “crash-consistent” state; indeed, if these vendor tools fail you may be forced to repair the systems with something called a backup. With applications such as Microsoft SQL and Exchange this could potentially take a long time, depending on whether the data is inconsistent and on the quantity to be checked and then repaired. You should really factor this issue into your recovery time objectives. The first thing to ensure in your DR plan is that you have an effective backup and restore strategy to handle possible data corruption and virus attacks. If you rely totally on data replication you might find that you’re bitten by the old IT adage of “Garbage in equals garbage out.”

Principles of Storage Management and Replication

In Chapter 2, Getting Started with Dell EqualLogic Replication, I will document in detail a series of different storage systems. Before I do that, I want to write very briefly and generically about how the vendors handle storage management, and how they commonly manage duplication of data from one location to another. By necessity, the following section will be very vanilla and not vendor-specific.

When I started writing the first edition of this book I had some very ambitious (perhaps outlandish) hopes that I would be able to cover the basic configuration of every storage vendor and explain how to get VMware’s SRM communicating with them. However, after a short time I recognized how unfeasible and unrealistic this ambition was! After all, this is a book about VMware’s SRM—storage and duplication (not just storage) is an absolute requirement for VMware’s SRM to function, so I would feel it remiss of me not to at least outline some basic concepts and caveats for those for whom storage is not their daily meat and drink.

Caveat #1: All Storage Management Systems Are the Same

I know this is a very sweeping statement that my storage vendor friends would widely disagree with. But in essence, all storage management systems are the same; it’s just that storage vendors confuse the hell out of everyone (and me in particular) by using their own vendor-specific terms. The storage vendors have never gotten together and agreed on terms. So, what some vendors call a storage group, others call a device group and yet others call a volume group. Likewise, for some a volume is a LUN, but for others volumes are collections of LUNs.

Indeed, some storage vendors think the word L UN is some kind of dirty word, and storage teams will look at you like you are from Planet Zog if you use the word L UN. In short, download the documentation from your storage vendor, and immerse yourself in the company’s terms and language so that they become almost second nature to you. This will stop you from feeling confused, and will reduce the number of times you put your foot in inappropriate places when discussing data replication concerns with your storage guys.

Caveat #2: All Storage Vendors Sell Replication

All storage vendors sell replication. In fact, they may well support three different types, and a fourth legacy type that they inherited from a previous development or acquisition—and oh, they will have their own unique trademarked product names! Some vendors will not implement or support all their types of replication with VMware SRM; therefore, you may have a license for replication type A, but your vendor only supports types B, C, and D. This may force you to upgrade your licenses, firmware, and management systems to support either type B, C, or D. Indeed, in some cases you may need a combination of features, forcing you to buy types B and C or C and D. In fairness to the storage vendors, as SRM has matured you will find that many vendors support all the different types of replication, and this has mainly been triggered by responding to competitors that do as well.

In a nutshell, it could cost you money to switch to the right type of replication. Alternatively, you might find that although the type of replication you have is supported, it isn’t the most efficient from an I/O or storage capacity perspective. A good example of this situation is with EMC’s CLARiiON systems. On the CLARiiON system you can use a replication technology called MirrorView. In 2008, MirrorView was supported by EMC with VMware’s SRM, but only in an asynchronous mode, not in a synchronous mode. However, by the end of 2008 this support changed. This was significant to EMC customers because of the practical limits imposed by synchronous replication. Although synchronous replication is highly desirable, it is frequently limited by the distance between the Protected and Recovery Sites. In short, the Recovery Site is perhaps too close to the Protected Site to be regarded as a true DR location. At the upper level, synchronous replication’s maximum distance is in the range of 400–450 kilometers (248.5–2 80 miles); however, in practice the real-world distances can be as small as 50–60 kilometers (31–37 miles). The upshot of this limitation is that without asynchronous replication it becomes increasingly difficult to class the Recovery Site as a genuine DR location. Distance is clearly relative; in the United States these limitations become especially significant as the recent hurricanes have demonstrated, but in my postage-stamp-sized country they are perhaps less pressing!

If you’re looking for another example of these vendor-specific support differences, HP EVAs are supported with SRM; however, you must have licenses for HP’s Business Copy feature and its Continuous Access technology for this feature and technology to function properly. The Business Copy license is only used when snapshots are created while testing an SRM Recovery Plan. The Continuous Access license enables the replication of what HP rather confusingly calls vdisks in the storage groups.

Caveat #3: Read the Manual

Storage management systems have lots of “containers” which hold other containers and so on. This means the system can be very flexibly managed. You can think of this as being a bit like Microsoft’s rich and varied group structure options in Active Directory. Beware that sometimes this means storage replication is limited to a particular type of container or level. This means you or your storage team has to very carefully determine how you will group your LUNs to ensure that you only replicate what you need to and that your replication process doesn’t, in itself, cause corruption by mismatched replication schedules. Critically, some storage vendors have very specific requirements about the relationships among these various containers when used with VMware SRM. Additionally, some storage vendors impose naming requirements for the names of these objects and snapshots. If you deviate from these recommendations you might find that you can’t even get SRM to communicate with your storage correctly. In a nutshell, it’s a combination of the right type of replication and the right management structures that will make it work—and you can only know that by consulting the documentation provided by your storage vendor. In short, RTFM!

Now that we have these caveats in place, I want to map out the structures of how most storage vendors’ systems work, and then outline some storage planning considerations. I will initially use non-vendor-specific terms. Figure 1.1 is a diagram of a storage array that contains many drives.

Here is an explanation of the callouts in the figure.

A. This is the array you are using. Whether this is Fibre Channel, iSCSI, or NFS isn’t dreadfully important in this case.

B. This shows that even before allowing access, many storage vendors allow disks in the array to be grouped. For example, NetApp refers to this grouping as a disk aggregate, and this is often your first opportunity to set a default RAID level.

C. This is another group, referred to by some vendors as a storage group, device group, or volume group.

D. Within these groups we can have blocks of storage, and most vendors do call these LUNs. With some vendors they stop at this point, and replication is enabled at group type C indicated by arrow E. In this case every LUN within this group is replicated to the other array—and if this was incorrectly planned you might find LUNs that did not need replicating were being unnecessarily duplicated to the recovery location, wasting valuable bandwidth and space.


Figure 1.1 A storage array with many groupings

E. Many vendors allow LUNs/volumes to be replicated from one location to another, each with its own independent schedule. This offers complete flexibility, but there is the danger of inconsistencies occurring between the data sets.

F. Some storage vendors allow for another subgroup. These are sometimes referred to as recovery groups, protected groups, contingency groups, or consistency groups. In this case only LUNs contained in group E are replicated to the other array. LUNs not included in subgroup E are not replicated. If you like, group C is the rule, but group E represents an exception to the rule.

G. This is a group of ESX hosts that allow access to either group C or group E, depending on what the array vendor supports. These ESX hosts will be added to group G by either the Fibre Channel WWN, iSCSI IQN, or IP address or hostname. The vendors that develop their SRA (software that allows SRM to communicate to the storage layer) to work with VMware’s SRM often have their own rules and regulations about the creation of these groupings; for instance, they may state that no group E can be a member of more than one group C at any time. This can result in the SRA failing to return all the LUNs expected back to the ESX hosts. Some vendors’ SRAs automatically allow the hosts to access the replicated LUNs/volumes at the Recovery Site array and others do not—and you may have to allocate these units of storage to the ESX host prior to doing any testing.

This grouping structure can have some important consequences. A good example of this is when you place virtual machines on multiple LUNs. This is a recommendation by VMware generally for performance reasons, as it can allow different spindles and RAID levels to be adopted. If it is incorrectly planned, you could cause corruption of the virtual machines.

In this case, I have created a simple model where there is just one storage array at one location. Of course, in a large corporate environment it is very unlikely that multiple arrays exist offering multiple disk layers with variable qualities of disk IOPS. Frequently, these multiple arrays themselves can be grouped together to create a collection of arrays that can be managed as one unit. A good example of this is the Dell EqualLogic Group Manager in which the first array creates a group to which many arrays can be added. In the Dell EqualLogic SRA configuration the “Group IP” is used as opposed to a specific IP address of a particular array.

In Figure 1.2 the two virtual disks that make up the virtual machine (SCSI 0:0 and SCSI 0:1) have been split across two LUNs in two different groups. The schedule for one group has a latency of 15 minutes, whereas the other has no latency at all. In this case, we could potentially get a corruption of log files, date stamps, and file creation as the virtual machines’ operating system would not be recovered at the same state as the file data.

We can see another example of this in Figure 1.3 if you choose to use VMFS extents. As you may know, this ESX has the capability to add space to a VMFS volume that is either running out of capacity or breaking through the 2TB limitation on the maximum size of a single VMFS volume. This is achieved by “spanning” a VMFS volume across multiple blocks of storage or LUNs. Although in ESX 5 the maximum size for a single VMFS version 5 volume has increased, you might still have extents created from previous installations of ESX.

In this case, the problem is being caused by not storing the virtual machine on two separate LUNs in two separate groups. The impression from the vSphere client would be that the virtual machine is being stored at one VMFS datastore. Unless you were looking very closely at the storage section of the vSphere client you might not notice that the virtual machines’ files were being spanned across two LUNs in two different groups. This wouldn’t just cause a problem with the virtual machine; more seriously it would completely undermine the integrity of the VMFS extent. This being said, VMFS extents are generally frowned upon by the VMware community at large, but they are occasionally used as a temporary band-aid to fix a problem in the short term. I would ask you this question: How often in IT does a band-aid remain in place to fix something weeks, months, or years beyond the time frame we originally agreed? However, I do recognize that some folks are given such small volume sizes by their storage teams that they have no option but to use extents in this manner. This is often caused by quite harsh policies imposed by the storage team in an effort to save space. The reality is that if the storage admins only give you 50GB LUNs, you find yourself asking for ten of them, to create a 500GB extent! If you do, then fair enough, but please give due diligence to making sure all the LUNs that comprise a VMFS extent are being replicated. My only message is to proceed with caution; otherwise, catastrophic situations could occur. This lack of awareness could mean you create an extent, which includes a LUN, which isn’t even being replicated. The result would be a corrupted VMFS volume at the destination. Of course, if you are using the new VR technology this issue is significantly diminished, and indeed the complexity around having to use extents could be mitigated by adopting VR in this scenario.


Figure 1.2 A VM with multiple virtual disks (SCSI 0:0 and SCSI 0:1) stored on multiple datastores, each with a different replication frequency


Figure 1.3 In this scenario the two virtual disks are held on a VMFS extent.

Clearly, there will be times when you feel pulled in two directions. For ultimate flexibility, one group with one LUN allows you to control the replication cycles. First, if you intend to take this strategy, beware of virtual machine files spanned across multiple LUNs and VMFS extents, because different replication cycles would cause corruption. Beware also that the people using the vSphere—say, your average server guy who only knows how to make a new virtual machine—may have little awareness of the replication structure under¬neath. Second, if you go for many LUNs being contained in a single group, beware that this offers less flexibility; if you’re not careful you may include LUNs which do not need replicating or limit your capacity to replicate at the frequency you need.

These storage management issues are going to be a tough nut to crack, because no one strategy will suit everyone. But I imagine some organizations could have three groups, which are designed for replication in mind: One might use synchronous replication, and the other two might have intervals of 30 minutes and 60 minutes; the frequency depends greatly on your recovery point objectives. This organization would then create virtual machines on the right VMFS volumes that are being replicated with the right frequency suited for their recovery needs. I think enforcing this strategy would be tricky. How would our virtual machine administrators know the correct VMFS volumes to create the virtual machines? Fortunately, in vSphere we are able to create folders that contain volumes and set permissions on those. It is possible to guide the people who create VMFS volumes to store them in the correct locations.

One method would be to create storage groups in the array management software that is mapped to different virtual machines and their functionality. The VMFS volume names would reflect their different purposes. Additionally, in the VMware SRM we can create Protection Groups that could map directly to these VMFS volumes and their storage groups in the array. The simple diagram in Figure 1.4 illustrates this proposed approach.

In this case, I could have two Protection Groups in VMware SRM: one for the boot/data VMFS volumes for Exchange, and one for the boot/data VMFS for SQL. This would also allow for three types of SRM Recovery Plans: a Recovery Plan to failover just Exchange, a Recovery Plan to failover just SQL, and a Recovery Plan failover all the virtual machines.


Figure 1.4 Protection Groups approach


Well, that’s it for this brief introduction to SRM. Before we dive into SRM, I want to spend the next four chapters looking at the configuration of this very same storage layer, to make sure it is fit for use with the SRM product. I will cover each vendor alphabetically (EMC, HP, NetApp) to avoid being accused of vendor bias. In time I hope that other vendors will step forward to add additional PDFs to cover the configuration of their storage systems too. Please don’t see these chapters as utterly definitive guides to these storage vendors’ systems. This is an SRM book after all, and the emphasis is squarely on SRM. If you are comfortable with your particular storage vendor’s replication technologies you could bypass the next few chapters and head directly to Chapter 6, Getting Started with NetApp SnapMirror. Alternatively, you could jump to the chapter that reflects your storage array and then head off to Chapter 6. I don’t expect you to read the next four chapters unless you’re a consultant who needs to be familiar with as many different types of replication as possible, or you’re a masochistic. (With that said, some folks say that being a consultant and being a masochistic are much the same thing…)