Configuring the Protected Site

From vmWIKI
Jump to: navigation, search

Originating Author

Michelle Laverick

Michelle Laverick.jpg

Video Content [TBA]

Configuring the Protected Site

Version: vCenter SRM 5.0

Now that the core SRM product is installed it’s possible to progress through the post-configuration stages. Each of these stages depends highly on the previous configuration being completed correctly. It would be correct to assume that this then creates a dependency between each stage such that you must be careful about making changes once the components have been interlinked. Essentially, the post-configuration stages constitute a “workflow.” The first step is to pair the two sites together, which creates a relationship between the Protected Site (NYC) and the Recovery Site (NJ). Then we can create inventory mappings that enable the administrator to build relationships between the folders, resource pools, or clusters and networks between the Protected Site and the Recovery Site. These inventory mappings ensure that VMs are recovered to the correct location in the vCenter environment. At that point, it is possible to configure the array managers. At this stage you make the sites aware of the identities of your storage systems at both locations; the SRM will interrogate the arrays and discover which datastores have been marked for replication. The last two main stages are to create Protection Groups and to create Recovery Plans. You cannot create Recovery Plans without first creating Protection Groups, as their name implies the point to the datastores that you have configured for replication. The Protection Groups use the inventory mappings to determine the location of what VMware calls “placeholder VMs.” These placeholder VMs are used in Recovery Plans to indicate when and where they should be recovered and allows for advanced features such as VM Dependencies and scripting callouts. I will be going through each step in detail, walking you through the configuration all the way so that by the end of the chapter, you should really understand what each stage entails and why it must be completed.

Connecting the Protected and Recovery Site SRMs

One of the main tasks carried out in the first configuration of SRM is to connect the Protected Site SRM to the Recovery Site SRM. It’s at this point that you configure a relationship between the two, and really this is the first time you indicate which is the Protected Site and which is the Recovery Site. It’s a convention that you start this pairing process at the Protected Site. The reality is that the pairing creates a two-way relationship between the locations anyway, and it really doesn’t matter from which site you do this. But for my own sanity, I’ve always started the process from the protected location.

When doing this first configuration, I prefer to have two vSphere client windows open: one on the protected vCenter and the other on the recovery vCenter. This way, I get to monitor both parts of the pairing process. I did this often in my early use of SRM so that I could see in real time the effect of changes in the Protected Site on the Recovery Site. Of course, you can simplify things greatly by using the linked mode feature in vSphere. Although with SRM new views show both the Recovery and Protected Sites at the same time, the benefits of linked mode are somewhat limited; however, I think linked mode can be useful for your general administration. For the moment, I’m keeping the two vCenters separate so that it’s 100% clear that one is the Protected Site and the other is the Recovery Site (see Figure 9.1).

As you might suspect, this pairing process clearly means the Protected Site SRM and Recovery Site SRM will need to communicate to each other to share information. It is possible to have the same IP range used at two different geographical locations. This networking concept is called “stretched VLANs.” Stretched VLANs can greatly simplify the pairing process as well as greatly simplifying the networking of virtual machines when you run tests or invoke your Recovery Plans. If you have never heard of stretched VLANs, it’s well worth brushing up on them, and considering their usage to facilitate DR/BC. The stretched VLAN configuration, as we will see later, can actually ease the administrative burden when running test plans or invoking DR for real. Other methods of simplifying communications, especially when testing and running Recovery Plans, include the use of network address translation (NAT) systems or modifying the routing configuration between the two locations. This can stop the need to re-IP the virtual machines as they boot in the DR location. We will look at this in more detail in subsequent chapters.

Configuring-protected-site- (01).jpg

Figure 9.1 The Protected Site (New York) is on the left; the Recovery Site (New Jersey) is on the right.

This pairing process is sometimes referred to as “establishing reciprocity.” In the first release of SRM the pairing process was one-to-one, and it was not possible to create hub-and-spoke configurations where one site is paired to many sites. The structure of SRM 1.0 prevented many-to-many SRM pairing relationships. Back in SRM 4.0, VMware introduced support for a shared-site configuration where one DR location can provide resources for many Protected Sites. However, in these early stages I want to keep with the two-site configuration.

Installing the SRM and vCenter software on the same instance of Windows can save you a Windows license. However, some people might consider this approach as increasing their dependence on the management system of vCenter. If you like, there is a worry or anxiety about creating an “all-eggs-in-one-basket” scenario. If you follow this rationale to its logical extreme, your management server will have many jobs to do, such as being the

• vCenter server

• Web access server

• Converter server

• Update Manager server

My main point, really, is that if the pairing process fails, it probably has more to do with IP communication, DNS name resolution, and firewalls than anything else. IP visibility from the Protected to the Recovery Site is required to set up SRM.

Personally, I always recommend dedicated Windows instances for the SRM role, and in these days of Microsoft licensing allowing multiple instances of Enterprise and Datacenter Editions on the same hypervisor, the cost savings are not as great as they once were.

When connecting the sites together you always log in to the Protected Site and connect it to the Recovery Site. This starting order dictates the relationship between the two SRM servers.

1. Log in with the vSphere client to the vCenter server for the Protected Site SRM (New York).

2. In the Sites pane, click the Configure Connection button shown in Figure 9.2. Alternatively, if you still have the Getting Started tab available, click the Configure Connection link.

Configuring-protected-site- (02).jpg

Figure 9.2 The status of the New York Site is “not paired” until the Configure Connection Wizard is run.

Notice how the site is marked as being “local,” since we logged in to it directly as though we are physically located at the New York location. If I had logged in to the New Jersey site directly it would be earmarked as local instead.

3. In the Configure Connection dialog box enter the name of the vCenter for the Recovery Site, as shown in Figure 9.3.

When you enter the vCenter hostname use lowercase letters; the vCenter hostname must be entered exactly the same way during pairing as it was during installation (for example, either fully qualified in all cases or not fully qualified in all cases). Additionally, although you can use either a name or an IP address during the pairing process, be consistent. Don’t use a mix of IP addresses and FQDNs together, as this only confuses SRM. As we saw earlier during the installation, despite entering port 80 to connect to the vCenter system, it does appear to be the case that communication is on port 443.

Configuring-protected-site- (03).jpg

Figure 9.3 Despite the use of port 80 in the dialog box, all communication is redirected to port 443.

Again, if you are using the untrusted auto-generated certificates that come with a default installation of vCenter you will receive a certificate security warning dialog box, as shown in Figure 9.4. The statement “Remote server certificate has error(s)” is largely an indication that the certificate is auto-generated and untrusted. It doesn’t indicate fault in the certificate itself, but rather is more a reflection of its status.

4. Specify the username and password for the vCenter server at the Recovery Site.

Again, if you are using the untrusted auto-generated certificates that come with a default installation of SRM you will receive a certificate security warning dialog box. This second certificate warning is to validate the SRM certificate, and is very similar to the previous dialog box for validating the vCenter certificate of the Recovery Site. So, although these two dialog boxes look similar, they are issuing warnings regarding completely different servers: the vCenter server and the SRM server of the Recovery Site. Authentication between sites can be difficult if the Protected and Recovery Sites are different domains and there is no trust relationship between them. In my case, I opted for a single domain that spanned both the Protected and Recovery Sites.

5. At this point the SRM wizard will attempt to pair the sites, and the Complete Connections dialog box will show you the progress of this task, as shown in Figure 9.5, on the Recent Tasks of the Protected vCenter.

6. At the end of the process you will be prompted to authenticate the vSphere client against the remote (Recovery) site. If you have two vSphere clients open at the same time on both the Protected and Recovery Sites you will receive two dialog login box prompts, one for each SRM server. Notice how in the dialog box shown in Figure 9.6 I’m using the full NT domain-style login of DOMAIN\Username. This dialog box appears each time you load the vSphere client and select the SRM icon.

Configuring-protected-site- (04).jpg

Figure 9.4 Dialog box indicating there is an error with the remote server certificate

Configuring-protected-site- (05).jpg

Figure 9.5 Pairing the sites (a.k.a. establishing reciprocity)

At the end of this first stage you should check that the two sites are flagged as being connected for both the local site and the paired site, as shown in Figure 9.7.

Additionally, under the Commands pane on the right-hand side you will see that the Break Connection link is the reverse of the pairing process. It’s hard to think of a use case for this option. But I guess you may at a later stage unpair two sites and create a different relationship. In an extreme case, if you had a real disaster the original Protected Site might be irretrievably lost. In this case, you would have no option but to seek a different site to maintain your DR planning. Also in the Commands pane you will find the option to export your system logs. These can be invaluable when it comes to troubleshooting, and you’ll need them should you raise an SR with VMware Support. As you can see, SRM has a new interface, and even with vCenter linked mode available this new UI should reduce the amount of time you spend toggling between the Protected and Recovery Sites. Indeed, for the most part I only keep my vCenters separated in this early stage when I am carrying out customer demonstrations; it helps to keep the customer clear on the two different locations.

Configuring-protected-site- (06).jpg

Figure 9.6 Entering login credentials for the Recovery Site vCenter

Configuring-protected-site- (07).jpg

Figure 9.7 The sites are connected and paired together; notice how communication to the vCenter in the Recovery Site used port 443.

From this point onward, whenever you load the vSphere client for the first time and click the Site Recovery Manager icon you will be prompted for a username and password for the remote vCenter. The same dialog box appears on the Recovery Site SRM. Although the vSphere client has the ability to pass through your user credentials from your domain logon, this currently is not supported for SRM, mainly because you should use totally different credentials at the Recovery Site anyway. For most organizations this would be a standard practice—two different vCenters need two different administration stacks to prevent the breach of one vCenter leading to a breach of all others.

Configuring Inventory Mappings

The next stage in the configuration is to configure inventory mappings. This involves mapping the resources (clusters and resource pools), folders, and networks of the Protected Site to the Recovery Site. Ostensibly, this happens because we have two separate vCenter installations that are not linked by a common data source. This is true despite the use of linked mode in vSphere. The only things that are shared between two or more vCenters in linked mode are licensing, roles, and the search functionality. The remainder of the vCenter metadata (datacenters, clusters, folders, and resource pools) is still locked inside the vCenter database driven by Microsoft SQL, Oracle, or IBM DB2.

When your Recovery Plan is invoked for testing or for real, the SRM server at the Recovery Site needs to know your preferences for bringing your replicated VMs online. Although the recovery location has the virtual machine files by virtue of third-party replication software, the metadata that comprises the vCenter inventory is not replicated. It is up to the SRM administrator to decide how this “soft” vCenter data is handled. The SRM administrator needs to be able to indicate what resource pools, networks, and folders the replicated VMs will use. This means that when VMs are recovered they are brought online in the correct location and function correctly. Specifically, the important issue is network mappings. If you don’t get this right, the VMs that are powered on at the Recovery Site might not be accessible across the network.

Although this “global default” mapping process is optional, the reality is that you will use it. If you wish, you can manually map each individual VM to the appropriate resource pool, folder, and network when you create Protection Groups. The Inventory Mappings Wizard merely speeds up this process and allows you to set your default preferences. It is possible to do this for each virtual machine individually, but that is very administratively intensive. To have to manually configure each virtual machine to the network, folder, and resource pool it should use in the Recovery Site would be very burdensome in a location with even a few hundred virtual machines. Later in this book we will look at these per-virtual-machine inventory mappings as a way to deal with virtual machines that have unique settings. In a nutshell, think of inventory mappings as a way to deal with virtual machine settings as though they are groups and the other methods as though you were managing them as individual users.

It is perfectly acceptable for certain objects in the inventory mappings to have no mapping at all. After all, there may be resource pools, folders, and networks that do not need to be included in your Recovery Plan. So, some things do not need to be mapped to the Recovery Site, just like not every LUN/volume in the Protected Site needs replicating to the Recovery Site. For example, test and development virtual machines might not be replicated at all, and therefore the inventory objects that are used to manage them are not configured. Similarly, you may have “local” virtual machines that do not need to be configured; a good example might be that your vCenter and its SQL instance may be virtualized. By definition, these “infrastructure” virtual machines are not replicated at the Recovery Site because you already have duplicates of them there; that’s part of the architecture of SRM, after all. Other “local” or site-specific services may include such systems as anti-virus, DNS, DHCP, Proxy, Print, and, depending on your directory services structure, Active Directory domain controllers. Lastly, you may have virtual machines that provide deployment services—in my case, the UDA—that do not need to be replicated at the Recovery Site as they are not business-critical, although I think you would need to consider how dependent you are on these ancillary virtual machines for your day-to-day operations. In previous releases, such objects that were not included in the inventory mapping would have the label “None Selected” to indicate that no mapping had been configured. In this new release, VMware has dispensed with this label. Remember, at this stage we are not indicating which VMs will be included in our recovery procedure. This is done at a later stage when we create SRM Protection Groups. Let me remind you (again) of my folder, resource pool, and network structures (see Figure 9.8, Figure 9.9, and Figure 9.10).

Configuring-protected-site- (08).jpg

Figure 9.8 My vSwitch configuration at the Protected and Recovery

Configuring-protected-site- (09).jpg

Figure 9.9 My resource pool configuration at the Protected and Recovery Sites

Configuring-protected-site- (10).jpg

Figure 9.10 My VM folder configuration at the Protected and Recovery Sites

The arrows represent how I will be “mapping” these resources from the Protected Site to the Recovery Site. SRM uses the term resource mapping to refer to clusters of ESX hosts and the resource pools within.

Finally, it’s worth mentioning that these inventory mappings are used during the reprotect and failback processes. After all, if VMs have been failed over to specific folders, resource pools, and networks, when a failback occurs, those VMs must be returned to their original locations at the Protected Site. No special configuration is required to achieve this—the same inventory mappings used to move VMs from the Protected to the Recovery Site are used when the direction is reversed.

Configuring Resource Mappings

To configure resource mappings, follow these steps.

1. Log on with the vSphere client to the Protected Site’s vCenter.

2. Click the Site Recovery icon.

3. Select the Protected Site (in my case, this is New York), and then select the Resource Mapping tab.

4. Double-click your resource pool or the cluster you wish to map, or click the Configure Mapping link as shown in Figure 9.11.

Configuring-protected-site- (11).jpg

Figure 9.11 In the foreground is the Mapping for DB dialog box where the resource pool in New York is mapped to the NYC_DR\DB resource pool in New Jersey.

Notice how the “Mapping for…” dialog box also now includes the new option to create a new resource pool if it’s needed. Remember that the use of resource pools is by no means mandatory. You can run all your VMs from the DRS-enabled cluster, if you prefer. Once you understand the principle of inventory mappings this becomes a somewhat tedious but important task of mapping the correct Protected Site vCenter objects to the Recovery Site vCenter objects.

Configuring Folder Mappings

In my early days of using SRM, I used to take all the VMs from the Protected Site and dump them into one folder called “Recovery VMs” on the Recovery Site’s vCenter. I soon discovered how limiting this would be in a failback scenario. I recommend more or less duplicating the folder and resource pool structure at the Recovery Site, so it exactly matches the Protected Site. This offers more control and flexibility, especially when you begin the failback process. I would avoid the casual and cavalier attitude of dumping virtual machines into a flat-level folder.

As you can see in Figure 9.12, I have not bothered to map every folder in the Protected Site to every other folder in the Recovery Site. I’ve decided I will never be using SRM to failover and failback VMs in the Infrastructure or Test & Dev VM folder. There’s little point in creating a mapping if I have no intention of using SRM with these particular VMs.

Configuring-protected-site- (12).jpg

Figure 9.12 My folder inventory mappings. Only the folders and resource pools that SRM will need in order to protect the VMs must be mapped.

Configuring Network Mappings

By default, when you run a test Recovery Plan the Recovery Site SRM will auto-magically put the replicated VMs into a bubble network which isolates them from the wider network using an internal vSwitch. This prevents possible IP and NetBIOS in Windows conflicts. Try to think of this bubble network as a safety valve that allows you to test plans with a guarantee that you will generate no conflicts between the Protected Site and the Recovery Site. So, by default, these network settings are only used in the event of triggering your Recovery Plan for real. If I mapped this “production” network to the “internal” switch, no users would be able to connect to the recovered VMs. Notice in Figure 9.13 how I am not mapping the VM Network or Virtual Storage Appliance port group to the Recovery Site. This is because the VMs that reside on that network deliver local infrastructure resources that I do not intend to include in my Recovery Plan.

Networking and DR can be more involved than you first think, and much depends on how you have the network set up. When you start powering on VMs at the Recovery Site they may be on totally different networks requiring different IP addresses and DNS updates to allow for user connectivity. The good news is that SRM can control and automate this process. One very easy way to simplify this for SRM is to implement stretched VLANs where two geographically different locations appear to be on the same VLAN/subnet. However, you may not have the authority to implement this, and unless it is already in place it is a major change to your physical switch configuration, to say the least. It’s worth making it clear that even if you do implement stretched VLANs you may still have to create inventory mappings because of port group differences. For example, there may be a VLAN 101 in New York and a VLAN 101 in New Jersey. But if the administrative team in New York calls their port groups on a virtual switch “NYC-101” and the guys in Chicago call theirs “NJ-101” you would still need a port group mapping in the Inventory Mappings tab.

Configuring-protected-site- (13).jpg

Figure 9.13 Map only the port groups that you plan to use in your Recovery Plan.

Configuring-protected-site- (14).jpg

Figure 9.14 Network mappings can include different switch types if needed.

Finally, in my experience it is possible to map between the two virtual switch types of Distributed and Standard vSwitches (see Figure 9.14). This does allow you to run a lower-level SKU of the vSphere 5 product in the DR location. So you could be using Enter-prise Plus in the Protected Site and the Advanced version of vSphere 5 in the Recovery Site. People might be tempted to do this to save money on licensing. However, I think it is fraught with unexpected consequences, and I do not recommend it; it’s a recipe for negative unforeseen outcomes. For example, an eight-way VM licensed for Enterprise Plus in the Protected Site would not start in the Recovery Site. A version of vSphere 5 that doesn’t support DRS clustering and the initial placement feature would mean having to map specific VMs to specific ESX hosts. So you certainly can map DvSwitches to SvSwitches, and vice versa. To SRM, port groups are just labels and it just doesn’t care. But remember, if VM is mapped from a DvSwitch to the SvSwitch it may lose functionality that only the DvSwitch can provide.

Assigning Placeholder Datastores

As we will see later in this chapter, an important part of the wizard for creating Protection Groups is selecting a destination for placeholders for the Recovery Site. This is a VMFS or NFS volume at the recovery location. When you create a Protection Group at the production site, SRM creates a VMX file and the other smaller files that make up the virtual machine from the Protected Site to the Recovery Site using the placeholder datastore selected in the wizard. It then preregisters these placeholder VMX files to the ESX host at the Recovery Site. This registration process also allocates the virtual machine to the default resource pool, network, and folder as set in the inventory mappings section. Remember, your real virtual machines are really being replicated to a LUN/volume on the storage array at the Recovery Site. You can treat these placeholders as an ancillary VM used just to complete the registration process required to get the virtual machine listed in the Recovery Site’s vCenter inventory. Without the placeholder VMs, there would be no object to select when you create Recovery Plans.

If you think about it, although we are replicating our virtual machines from the Protected Site to the Recovery Site, the VMX file does contain site-specific information, especially in terms of networking. The VLAN and IP address used at the recovery location could differ markedly from the protected location. If we just used the VMX as it was in the replicated volume, some of its settings would be invalid (port group name and VLAN, for example), but others would not change (amount of memory and CPUs).

The main purpose of placeholder VMX files is that they help you see visually in the vCenter inventory where your virtual machines will reside prior to executing the Recovery Plan. This allows you to confirm up front whether your inventory mappings are correct. If a virtual machine does not appear at the Recovery Site, it’s a clear indication that it is not protected. It would have been possible for VMware to create the virtual machine at the Recovery Site at the point of testing the Recovery Plan, but doing it this way gives the operator an opportunity to fix problems before even testing a Recovery Plan.

So, before you begin configuring the array manager of Protection Groups, you should create a small, 5–10GB volume on the storage array of your choice and present that to all the ESX hosts that will perform DR functions. For example, on my EMC NS-120 array I created a 5GB LUN visible to my Recovery Site ESX hosts (esx3/4), called using EMC’s Virtual Storage Console formatted with VMFS, and giving it a friendly volume name of SRM_Placeholders. It’s a good practice to keep the placeholder datastores relatively small, distinct, and well named to stop people from storing real VMs on them. If you wish, you could use datastore folders together with permissions to stop this from happening.

It’s worth stating that if you ever want to run your Recovery Plan (failover) for real, either for planned migration or for disaster recovery, you would need a placeholder datastore at the Protected Site as well for returning to the production location as part of any reprotect and automated failback procedure. This has important consequences if you want to easily use the new automatic failback process or reprotect features. I’d go so far as to say that you might as well create a placeholder volume at both locations at the very beginning.

This placeholder datastore needs to be presented to every ESX host in the cluster that would act as a recovery host in the event of DR. The datastore could be used across clusters if you so wished, so long as it was presented to all the hosts that need to have access to it in the site. For me, each cluster represents an allocation of memory, CPU, network, and storage. In my case, I created placeholder datastores at New Jersey used in the process of protecting VMs in New York, and similarly I created placeholder datastores at New York used in the process of protecting VMs in New Jersey. In most cases you will really need only one placeholder datastore per cluster. As I knew at some stage I would need to do a failover and failback process in SRM it made sense to set these placeholder datastores at this stage, as shown in Figure 9.15.

Remember, the smallest VMFS volume you can create is 1.2GB. If the volume is any smaller than this you will not be able to format it. The placeholder files do not consume much space, so small volumes should be sufficient, although you may wish to leverage your storage vendor’s thin-provisioning features so that you don’t unnecessarily waste space—but hey, what’s a couple of gigabytes in the grand scheme of things compared to the storage footprint of the VMs themselves? On NFS you may be able to have a smaller size for your placeholder datastore; much depends on the array—for example, the smallest volume size on my NetApp FAS2040 is 20GB.

Configuring-protected-site- (15).jpg

Figure 9.15 Placeholder datastores should be on nonreplicated datas-tores available to all hosts in the datacenter or clusters where SRM is in use.

It really doesn’t matter what type of datastore you select for the placeholder VMX file. You can even use local storage; remember, only temporary files are used in the SRM process. However, local storage is perhaps not a very wise choice. If that ESX host goes down, is in maintenance mode, or is in a disconnected state, SRM would not be able to access the placeholder files while executing a Recovery Plan. It would be much better to use storage that is shared among the ESX hosts in the Recovery Site. If one of your ESX hosts was unable to access the shared storage location for placeholder files, it would merely be skipped, and no placeholder VMs would be registered on it. The size of the datastore does not have to be large; the placeholder files are the smaller files that make up a virtual machine, they do not contain virtual disks.

But you might find it useful to either remember where they are located, or set up a dedicated place to store them, rather than mixing them up with real virtual machine files. It is a good practice to use folder and resource pool names that reflect that these place-holder virtual machines are not “real.” In my case, the parent folder and resource pool are called “NYC_DR” at the New Jersey Recovery Site. Once the placeholder datastore has been created, you can configure SRM at the Protected Site and use it to create the “shadow” VMs in the inventory.

1. In SRM, select the Protected Site; in my case, this is the New York site.

2. Select the Placeholder Datastores tab (see Figure 9.16).

3. Click the Configure Placeholder Datastore link.

4. In the subsequent dialog box, select the datastore(s) you created.

Configuring-protected-site- (16).jpg

Figure 9.16 The Placeholder Datastores tab

The dialog box in Figure 9.16 does allow you to add multiple placeholder datastores for each cluster that you have. The choice is yours: one placeholder datastore for al your clusters, or one placeholder datastore for each cluster in vCenter. Your choice will very much depend on your storage layer and policies within your organization. For example, if you are using IP-based storage it will be very easy to present an iSCSI or NFS volume across many VMware clusters. If you’re using Fibre Channel, this could involve some serious work with zoning and masking at the switch and storage management layer. It may be your storage team’s policy that each ESX host in a VMware cluster represents a block of storage or a “pod” that cannot be presented to other hosts outside the cluster.

If you look closely at the screen grab you can see that from New York Site (Local), I am browsing the datastores in the New Jersey vCenter. From there I can locate the datastore I called “NYC_SRM_Placeholders” as the location for the placeholder files. I configured a similar setup at the New Jersey location to facilitate the new automatic failback and reprotect features in SRM.

Configuring Array Managers: An Introduction

The next essential part of SRM post-configuration involves enabling the array manager’s piece of the product. The array manager is often just a graphical front end for supplying variables to the SRA. Of course I’m assuming you have a storage array which is supported for use with SRM. It may be that you don’t, and you would prefer to use VMware’s vSphere Replication (VR) instead. If that’s the case I recommend turning to Chapter 10, Recovery Site Configuration, where I cover VR configuration.

If you do you have a storage array, it’s in the Array Manager pane that you inform SRM what engine you are using to replicate your virtual machines from the Protected to the Recovery Site. In this process, SRA interrogates the array to discover which LUNs are being replicated, and enables the Recovery Site SRM to “mirror” your virtual machines to the recovery array. You must configure each array at the Protected Site that will take part in the replication of virtual machines. If a new array is added at a later stage it must be configured here. The array manager will not show every LUN/volume replicated on the storage array—just the ones used by your ESX hosts. The SRA works this out by looking at the files that make up the VM and only reporting LUNs/volumes which are in use by VMs on ESX hosts. This is why it’s useful once you have set up the replication part of the puzzle to populate LUNs/volumes with VMs.

Clearly, the configuration of each array manager will vary from one vendor to the next. As much as I would like to be vendor-neutral at all times, it’s not possible for me to validate every array manager configuration because that would be cost- and time-prohibitive.

However, if you look closely at the screen grabs for each SRA that I’ve included in this book you can see that they all share two main points. First, you must provide an IP address or URL to communicate with the storage array, and second, you must provide user credentials to authenticate with it. Most SRAs will have two fields for two IP addresses; this is usually for the first and second storage controllers which offer redundant connections into the array, whether it is based on Fibre Channel, iSCSI, or NFS. Sometimes you will be asked to provide a single IP address because your storage vendor has assumed that you have teamed your NIC interfaces together for load balancing and network redundancy. Different vendors label these storage controllers differently, so if you’re familiar with NetApp perhaps the term storage heads is what you are used to, or if it’s EMC CLARiiON you use the term storage processor. Clearly, for the SRA to work there must be a configured IP address for these storage controllers and it must be accessible to the SRM server.

As I stated in Chapter 7, Installing VMware SRM, there is no need now to restart the core SRM service (vmware-dr) when you install or upgrade an SRA. Of course, your environment can and will change over time, and there is room for mistakes. Perhaps, for instance, in your haste you installed the SRA into the Protected Site SRM server, but forgot to perform the same task at the Recovery Site. For this reason, VMware has added a Reload SRAs link, shown in Figure 9.17, under the SRAs tab in the Array Manager pane. If you do install or update an SRA it’s worth clicking this button to make sure the system has the latest information.

Before beginning with the array manager configuration, it is worthwhile to check if there are any warnings or alerts in either the Summary tab or the SRAs tab, as this can prevent you from wasting time trying to configure the feature where it would never be successful. For example, if there is a mismatch between the SRAs installed at either the Protected or the Recovery Site you would receive a warning status on the affected SRA, as shown in Figure 9.18. This information displayed in the SRAs tab of the affected system can also tell you information about supported arrays and firmware.

Similarly, if your SRA has specific post-configuration requirements, and you subsequently fail to complete them, this can cause another status error message. For example, the message “The server fault ‘DrStorageFaultCannotLoadAdapter” was caused by my installation of the IMB SystemStorage SRA and not completing the configuration with the IBMSVCRAutil.exe program. The moral of the story is to not unnecessarily install SRAs that you don’t need. I did because I’m a curious fellow; however, that curiosity often leads to learning something new that I can pass on to my customers and colleagues.

Configuring-protected-site- (17).jpg

Figure 9.17 With the Reload SRAs link, the SRM administrator doesn’t have to restart the core vmware-dr service for changes to take effect.

Configuring-protected-site- (18).jpg

Figure 9.18 To avoid false alarms, ensure that the SRA is installed on all the SRM servers before reloading the SRAs.

Most SRAs work the same way: You supply IP information and user authentication details to the wizard. By supplying to the Protected and Recovery Sites details regarding both IP address and authentication, you allow SRM to automate processes that would normally require the interaction of the storage management team or inter-action with the storage management system. This is used specifically in SRM when a Recovery Plan is tested as the ESX host’s HBAs in the recovery location are rescanned, and the SRA from the storage vendor allows them access to the replicated LUNs/ volumes to allow the test to proceed. However, this functionality does vary from one storage array vendor to another. For example, these privileges in some arrays would allow for the dynamic creation and destruction of temporary snapshots, as is the case with EMC Celerra or NetApp filers. With other vendors someone on the storage team would have to grant access to the LUN and snapshot for this to be successful, as is the case with the EMC CLARiiON.

You might think that allowing this level of access to the storage layer would be deeply political; indeed, it could well be. However, in my discussions with VMware and those people who were among the first to try out SRM, this hasn’t always been the case. In fact, many storage teams are more than happy to give up this control if it means fewer requests for manual intervention from the server or virtualization teams. You see, many storage guys get understandably irritated if people like us are forever ringing them up to ask them to carry out mundane tasks such as creating a snapshot and then presenting it to a number of ESX hosts. The fact that we as SRM administrators can do that safely and automatically without their help takes this burden away from the storage team so that they can have time for other tasks. Unfortunately, for some companies this still might be a difficult pill for the storage team to swallow without fully explaining this to them before the remit of SRA. If there has been any annoyance for the storage team it has often been in the poor and hard-to-find documentation from the storage vendors. That has left some SRM administrators and storage teams struggling to work out the requirements to make the vendor’s SRA function correctly.

Anyway, what follows is a blow-by-blow description of how to configure the array manager for the main storage vendors. If I were you, I would skip to the section heading that relates to the specific array vendor that you are configuring, because as I’ve said before, one array manager wizard is very similar to another. Array manager configuration starts with the same process, regardless of the array vendor.

1. Log on with the vSphere client to the Protected Site’s vCenter—in my case, this is vcnyc.corp.com.

2. Click the Site Recovery icon.

3. Click the Array Managers icon.

4. Click the Add Array Manager button, as shown in Figure 9.19.

Once the array manager configuration has been completed and enabled, you will see in the Recent Tasks pane that it carries out four main tasks for each vCenter that is affected (see Figure 9.20).

Configuring-protected-site- (19).jpg

Figure 9.19 The Add Array Manager button that allows you to input your configuration specific to your SRA

Configuring-protected-site- (20).jpg

Figure 9.20 Updating the array manager configuration or refreshing it will trigger events at both the Protected and Recovery Sites.

Configuring Array Managers: Dell EqualLogic

To configure the array manager for the Dell EqualLogic, resume with these steps.

5. In the Add Array Manager dialog box, enter a friendly name for this manager, such as “Dell Array Manager for Protected Site”.

6. Select Dell EqualLogic PS Series SRA as the SRA Type, as shown in Figure 9.21.

7. Enter the IP address of the group at the Protected Site in the IP Address field; in my case, this is my New York EqualLogic system with the IP address of 172.168.3.69.

8. Supply the username and password for the Dell EqualLogic Group Manager.

9. Complete this configuration for the Partner Group; in my case, this is 172.168.4.69, the IP address of the Group Manager in New Jersey, as shown in Figure 9.22.

Configuring-protected-site- (21).jpg

Figure 9.21 Dell uses the concept of groups as collections of array mem-bers. You may wish to use a naming convention reflecting these group names.

Configuring-protected-site- (22).jpg

Figure 9.22 Configuration of the Protected Site (local group connection parameters) and Recovery Site (partner group replication parameters)

These dialog boxes occasionally require you to scroll down in order to see all the fields.

10. Click Next and then Finish. Once the array manager configuration for the Protected Site is added, it should also add the array manager configuration for the Recovery Site, as shown in Figure 9.23.

The next step is to enable the configuration. If you have used SRM before you will recognize this is a new step in the array manager configuration. It’s designed to give the SRM administrator more control over the array pairs than was previously possible. If you do not enable the pairing you will be unable to successfully create Protection Groups.

Configuring-protected-site- (23).jpg

Figure 9.23 The array manager configuration for both sites. You may want to use a naming convention reflecting the Dell EqualLogic group names.

11. To enable the configuration select the Array Pairs tab on the array configuration object and click the Enable link under the Actions column (see Figure 9.24).

Occasionally, I’ve had to click Enable twice. This appears to be an issue with the way SRM refreshes this page. Once the array manager configuration is in use by Protection Groups it cannot be disabled. Similarly, once a Protection Group is being used by a Recovery Plan it cannot be removed until it is not referenced in a Recovery Plan. This will complete the Remote Array Manager column with the name of the array configuration for the Recovery Site. If you look under the Devices tab you should see the volumes you are replicating to the Recovery Site. Notice in Figure 9.25 how the device or volume is local to the New York Site. Also notice how the blue arrow indicates the volume is being replicated to the remote location of New Jersey. This arrow changes direction when you carry out an automated failback process, with the Reprotect button inverting the replication direction.

Configuring-protected-site- (24).jpg

Figure 9.24 Enabling the configuration of the Dell EqualLogic

Configuring-protected-site- (25).jpg

Figure 9.25 SRM’s new interface shows the replication direction, and is useful when monitoring failover and failback procedures.

Configuring Array Managers: EMC Celerra

EMC has one SRA that covers both the Unisphere range of arrays and the newer VMX series of systems together with “enabler” software for particular types of replication. So, regardless of the generation you possess, you should be able to install and configure it. Installing the EMC SRA VNX Replicator is a relatively simple affair. In this section, I will walk you through the configuration of the EMC Celerra with VMware SRM.

With EMC Celerra systems the SRM server will communicate to the Celerra at the Protected Site (New York) to collect volume information. It’s therefore necessary to configure a valid IP address for the SRM to allow this to occur or allow routing/intra-VLAN communication if your SRM and VSA reside on different networks. This is one of the challenges of installing your SRM and vCenter on the same instance of Windows. Another workaround is to give your SRM two network cards: one used for general communication and the other used specifically for communication to the Celerra. If you have no communication between the SRA and the Celerra you will receive an error message. Before you begin it’s a good idea to confirm that you can ping both the Protected Site array and the Recovery Site array with the Celerra Control Station IP from the Protected Site (New York) SRM server.

To configure the array manager for the EMC Celerra, resume with these steps.

5. In the Add Array Manager dialog box, enter a friendly name for this manager, such as “EMC Celerra for Protected Site”.

6. Select EmcSra as the SRA Type (see Figure 9.26).

Configuring-protected-site- (26).jpg

Figure 9.26 If you have many Celerra systems you may want to develop a naming convention that allows you to uniquely identify them.

7. Enter the IP address of the Control Station at the Protected Site in the IP Address field—in my case, this is my New York Celerra system with the IP address of 172. 168.3.77.

If you are unsure of the IP address of the Control Station for your system, you can locate it in the Unisphere management pages under System Information, as shown in Figure 9.27.

8. Supply the username and password for the Control Station (see Figure 9.28). These dialog boxes occasionally require you to scroll down in order to see all the fields.

Configuring-protected-site- (27).jpg

Figure 9.27 Selecting the Celerra from the pull-down list and clicking the System button will show you the Control Station IP address.

Configuring-protected-site- (28).jpg

Figure 9.28 If you have NFS mount points as well as iSCSI these may be listening on different IP ports.

9. Click Next and then Finish. Once the array manager configuration for the Protected Site is added, you should also add the array manager configuration for the Recovery Site, as shown in Figure 9.29.

The next step is to enable the configuration, as shown in Figure 9.30. If you have used SRM before you will recognize this is a new step in the array manager configuration. It’s designed to give the SRM administrator more control over the array pairs than was previously possible. If you do not enable the pairing you will be unable to successfully create Protection Groups.

10. To enable the configuration select the Array Pairs tab on the array configuration object and click the Enable link under the Actions column.

Occasionally, I’ve had to click Enable twice. This appears to be an issue with the way SRM refreshes this page. Once the array manager configuration is in use by Protection Groups it cannot be disabled.

This will complete the Remote Array Manager column with the name of the array configuration for the Recovery Site. If you look under the Devices tab you should see the volumes you are replicating to the Recovery Site. Notice how the device or volume is local to the New York Site. Also notice how the blue arrow indicates the volume is being replicated to the remote location of New Jersey. This arrow changes direction when you carry out an automated failback process, with the Reprotect button inverting the replication direction (see Figure 9.31).

Configuring-protected-site- (29).jpg

Figure 9.29 Although some array managers ask for the Recovery Site’s IP and authentication details, you still must configure the Recovery Site SRA.

Configuring-protected-site- (30).jpg

Figure 9.30 Enabling the configuration of the EMC Celerra

Configuring-protected-site- (31).jpg

Figure 9.31 The SRM interface shows the replication direction, and is useful when monitoring failover and failback procedures.

Configuring Array Managers: EMC CLARiiON

EMC has one SRA that covers both the Unisphere range of arrays and the newer VMX series of systems together with “enabler” software for particular types of replication. So, regardless of the generation you possess, you should be able to install and configure it. Installing the EMC SRA VNX Replicator is a relatively simple affair. In this section, I will walk you through the configuration of the EMC CLARiiON with VMware SRM.

With EMC CLARiiON systems the SRM server will communicate to the CLARiiON at the Protected Site (New York) to collect volume information. It’s therefore necessary to configure a valid IP address for the SRM to allow this to occur or allow routing/intra-VLAN communication if your SRM and VSA reside on different networks. This is one of the challenges of installing your SRM and vCenter on the same instance of Windows. Another workaround is to give your SRM two network cards: one used for general communication and the other used specifically for communication to the CLARiiON. If you have no communication between the SRA and the CLARiiON you will receive an error message. Before you begin, it’s a good idea to confirm that you can ping both the Protected Site array and the Recovery Site array with the CLARiiON SP A and SP B ports’ IP address from the Protected Site (New York) SRM server.

To configure the array manager for the EMC CLARiiON, resume with these steps.

5. In the Add Array Manager dialog box, enter a friendly name for this manager, such as “EMC Clariion for Protected Site”.

6. Select EMC Unified SRA as the SRA Type, as shown in Figure 9.32.

Configuring-protected-site- (32).jpg

Figure 9.32 If you have many CLARiiON systems you might want to develop a naming convention that allows you to uniquely identify them.

7. Enter the IP address of the storage processors (SPA and SPB) at the Protected Site in the IP Address field—in my case, this is my New York CLARiiON system with the IP addresses 172.168.3.79 and 172.168.3.78.

If you are unsure of the IP address of the storage processors for your system, you can locate it in the Unisphere management pages under System Information.

8. Supply the username and password for the CLARiiON together with the IP address for the SPA and SPB (see Figure 9.33).

Configuring-protected-site- (33).jpg

Figure 9.33 The IP address for SPA and SPB on the New York CLARiiON

These dialog boxes occasionally require you to scroll down in order to see all the fields.

9. Click Next and then Finish. Once the array manager configuration for the Protected Site is added, you should also add the array manager configuration for the Recovery Site, as shown in Figure 9.34.

The next step is to enable the configuration, as shown in Figure 9.35. If you have used SRM before you will recognize this is a new step in the array manager configuration. It’s designed to give the SRM administrator more control over the array pairs than was previously possible. If you do not enable the pairing you will be unable to successfully create Protection Groups.

10. To enable the configuration select the Array Pairs tab on the array configuration object and click the Enable link under the Actions column.

Occasionally, I’ve had to click Enable twice. This appears to be an issue with the way SRM refreshes this page. Once the array manager configuration is in use by Protection Groups it cannot be disabled.

This will complete the Remote Array Manager column with the name of the array configuration for the Recovery Site. If you look under the Devices tab you should see the volumes you are replicating to the Recovery Site. Notice how the device or volume is local to the New York Site. Also notice how the blue arrow indicates the volume is being replicated to the remote location of New Jersey. This arrow changes direction when you carry out an automated failback process, with the Reprotect button inverting the replication direction (see Figure 9.36).

Configuring-protected-site- (34).jpg

Figure 9.34 Although some array managers ask for the Recovery Site’s IP and authentication details, you must configure the Recovery Site SRA.

Configuring-protected-site- (35).jpg

Figure 9.35 Enabling the configuration on the EMC CLARiiON

Configuring-protected-site- (36).jpg

Figure 9.36 SRM can show the replication direction, and is useful when monitoring failover and failback procedures.

Configuring Array Managers: NetApp FSA

To configure the array manager for the NetApp FSA, resume with these steps.

5. In the Add Array Manager dialog box, enter a friendly name for this manager, such as “NetApp Array Manager for Protected Site”.

6. Select NetApp Storage Replication Adapter as the SRA Type, as shown in Figure 9.37.

7. Enter the IP address of the group at the Protected Site in the IP Address field—in my case, this is my New York NetApp system with the IP address of 172.168.3.89 (see Figure 9.38). I used the same IP address for the system as the NFS IP filter for NAS. This may not be the case in larger production systems where the management traffic is placed on separate network interfaces.

Configuring-protected-site- (37).jpg

Figure 9.37 The NetApp SRA uses signal configuration for all its sup-ported storage protocols.

8. Supply the username and password for the NetApp filer.

These dialog boxes occasionally require you to scroll down in order to see all the fields. Most customers like to have separate networks for management and data traffic. This is mainly for security reasons, but performance can also be a concern. Many storage admins will use the management network to copy their own data around, such as software packages, service packs, and firmware updates. When the SRA interrogates the NetApp system, it may find a bunch of interfaces using various address ranges. And when SRM interrogates vCenter, it may find a bunch of ESX VMkernel interfaces using various address ranges. So it’s entirely possible that when SRM needs to mount an NFS datastore (either the SnapMirror destination volume in a real failover, or a FlexClone of that volume in a test failover), it may choose to use an IP address range such as, for example, the management network. NetApp added the NFS filter to ensure that the SRA only reports the desired addresses back to SRM, which would mean that SRM can only choose the IP network you specify. You can actually specify multiple IP addresses if you need to; just separate them with a comma—for example, 192.168.3.88,192.168.3.87. In my case, I have a much simpler configuration where my management network and my NFS network are the same set of team NICs in the filer.

Configuring-protected-site- (38).jpg

Figure 9.38 Entering the IP address of the group at the Protected Site

9. Click Next and then Finish. Once the array manager configuration for the Protected Site is added, you should also add the array manager configuration for the Recovery Site (see Figure 9.39).

The next step is to enable the configuration. If you have used SRM before you will recognize this is a new step in the array manager configuration. It’s designed to give the SRM administrator more control over the array pairs than was previously possible. If you do not enable the pairing you will be unable to successfully create Protection Groups.

10. To enable the configuration select the Array Pairs tab on the array configuration object and click the Enable link under the Actions column (see Figure 9.40).

Occasionally, I’ve had to click Enable twice. This appears to be an issue with the way SRM refreshes this page. Once the array manager configuration is in use by Protection Groups it cannot be disabled.

This will complete the Remote Array Manager column with the name of the array configuration for the Recovery Site. If you look under the Devices tab you should see the volumes you are replicating to the Recovery Site. Notice how the device or volume is local to the New York Site. Also notice how the blue arrow indicates the volume is being replicated to the remote location of New Jersey. This arrow changes direction when you carry out an automated failback process, with the Reprotect button inverting the replication direction (see Figure 9.41).

Configuring-protected-site- (39).jpg

Figure 9.39 If you have multiple arrays, consider a naming convention that allows you to uniquely identify each system.

Configuring-protected-site- (40).jpg

Figure 9.40 Enabling the configuration in NetApp FSA

Configuring-protected-site- (41).jpg

Figure 9.41 SRM can show the replication direction, and is useful when monitoring failover and failback procedures.

Creating Protection Groups

Once you are satisfied with your array manager configuration you’re ready to carry on with the next major step: configuring Protection Groups. Protection Groups are used whenever you run a test of your Recovery Plan, or when DR is invoked for real. Protection Groups are pointers to the replicated vSphere datastores that contain collections of virtual machines that will be failed over from the Protected Site to the Recovery Site. The Protection Groups’ relationships to ESX datastores can be one-to-one. That is to say, one Protection Group can contain or point to one ESX datastore. Alternatively, it is possible for one Protection Group to contain many datastores—this can happen when a virtual machine’s files are spread across many datastores for disk performance optimization reasons or when a virtual machine has a mix of virtual disks and RDM mappings. In a loose way, the SRM Protection Group could be compared to the storage groups or consistency groups you may create in your storage array. However, what actually dictates the membership of a Protection Group is the way the virtual machines utilize the datastores.

TIP: When you create your first Protection Group you might like to have the vSphere client open on both the Protected Site vCenter and the Recovery Site vCenter. This will allow you to watch in real time the events that happen on both systems. Of course, if you are running in linked mode you will see this happening if you expand parts of the inventory.

To configure Protection Groups follow these steps.

1. Log on with the vSphere client to the Protected Site’s vCenter (New York).

2. Click the Site Recovery icon.

3. Select the Protection Groups pane and click the Create Protection Group button, as shown in Figure 9.42.

New to this release is the ability to create folders in the Protection Groups, to allow you to more easily lay out your Protection Groups if you have a significant number of them.

4. In the Create Protection Group dialog box (whether you are using VR or array-based replication), if you have more than one array manager select the one associated with this Protection Group, as shown in Figure 9.43. Then select the pairing of arrays contained within the array manager configuration.

5. Click Next. This should enumerate all the volumes discovered on the arrays in question. If you select the volume names, you should see the VMs contained within those ESX datastores (see Figure 9.44).

Configuring-protected-site- (42).jpg

Figure 9.42 You can now create both Protection Groups and Protection Group folders.

Configuring-protected-site- (43).jpg

Figure 9.43 The EMC Celerra array manager configuration. You may have many array pairs, each hosting many datastores protected by replication.

Configuring-protected-site- (44).jpg

Figure 9.44 Dell EqualLogic datastore containing a number of virtual machines 6. In the Create Protection Group Name and Description dialog box, enter a friendly name and description for your Protection Group. In my case, I’m creating a Protection Group called “Virtual Machines Protection Group.” Click Finish.

At this point, a number of events will take place. First, as the Protection Group is being created the icon of the Protection Group changes, and its status is marked as “Configuring Protection,” as shown in Figure 9.45. Second, at the Recovery Site vCenter you will see the task bar indicate that the system is busy “protecting” all virtual machines that reside in the datastore included in the Protection Group (see Figure 9.46).

Meanwhile, the Recovery Site’s vCenter will begin registering the placeholder VMX files in the correct location in the inventory, as shown in Figure 9.47. As you can see, each

Configuring-protected-site- (45).jpg

Figure 9.45 When Protection Groups are first created their status is modified to “Configuring Protection.”

Configuring-protected-site- (46).jpg

Figure 9.46 During the creation of Protection Groups each affected VM has a task associated with it.

Configuring-protected-site- (47).jpg

Figure 9.47 The Recovery Site’s vCenter begins registering the placeholder VMX files in the correct location in the inventory.

Protect VM event has a “Create virtual machine” event. SRM isn’t so much creating a new VM as it is registering placeholder VMs in the Recovery Site.

You will also have noticed these “new” VMs are being placed in the correct resource pool and folder. If you select one of the placeholder files you can see it only takes up a fraction of the storage of the original VM. You should also see that these placeholders have been given their own unique icon in the vCenter inventory at the Recovery Site. This is new to SRM. Previously, the placeholder VMs just had the standard “boxes in boxes” icon, and that made them difficult to identify. Even with the new-style icon, as shown in Figure 9.48, I still recommend a separate resource pool and/or folder structure so that you can keep these ancillary placeholders separate and distinct from the rest of your infrastructure.

If you browse the storage location for these placeholders you can see they are just “dummy” VMX files (see Figure 9.49). As I mentioned before, occasionally VMware SRM refers to these placeholder VMs as “shadow” VMs. In the Virtual Machines and Template view, at the Recovery Site’s vCenter the VMs have been allocated to the correct folder. SRM knows which network, folder, and resource pool to put the recovery VMs into, because of the default inventory mapping settings we specified earlier.

Configuring-protected-site- (48).jpg

Figure 9.48 Creation of placeholder VMs with the new lightning bolt icon, which should make them easier to distinguish in the vCenter inventory

Configuring-protected-site- (49).jpg

Figure 9.49 Placeholder VMs are created in the datastore specified in the Placeholder tab on the properties of each site.

You should know that if you create a template and store it on a replicated datastore it will become protected as well. This means templates can be recovered and be part of Recovery Plans (covered in Chapter 10) just like ordinary VMs. Templates are not powered on when you run a Recovery Plan, because they can’t be powered on without being converted back to being a virtual machine. As you can see, these placeholder VMs are very different from the VMs you normally see registered to vCenter. If you try to edit them like any VM you will be given a warning (shown in Figure 9.50) that this is not a recommended action.

Configuring-protected-site- (50).jpg

Figure 9.50 The warning dialog box that appears if you try to edit the placeholder VMs listed in the Recovery Site

WARNING: Deleting Protection Groups at the Protected Site vCenter reverses this registration process. When you delete a protected group, it unregisters and destroys the placeholder files created at the Recovery Site. This does not affect the replication cycle of the virtual machines that are governed by your array’s replication software. Be very cautious when deleting Protection Groups. The action can have unexpected and unwanted consequences if the Protection Groups are “in use” by a Recovery Plan. This potential problem is covered later in this book. To understand it at this point in the book would require additional details regarding Recovery Plans that we have not yet discussed. For now, it’s enough to know that if you delete Protection Groups the placeholders get deleted too, and all references to those VMs in the Recovery Plan get removed as well!

Failure to Protect a Virtual Machine

Occasionally, you might find that when you create a Protection Group the process fails to register one or more virtual machines at the Recovery Site. It’s important not to overreact to this situation as the causes are usually trivial ones caused by the configuration, and they are very easy to remedy. The most common cause is either bad inventory mappings, or a

VM that falls outside the scope of your inventory mappings. In this section I will give you a checklist of settings to confirm, which will hopefully fix these problems for you. They amount to the kind of initial troubleshooting you may experience when you configure SRM for the first time.

Bad Inventory Mappings

This is normally caused by a user error in the previous inventory mapping process. A typical failure to protect a VM is shown in Figure 9.51. The error is flagged on the Protected Site with a yellow exclamation mark on the Protection Group, and the virtual machines that failed to be registered.

Configuring-protected-site- (51).jpg

Figure 9.51 A VM failing to be protected because the VM Network port group was not included in the inventory mappings

As a consequence, you will also see errors in the Tasks & Events tab for the affected VMs. The classic clue that a VM has a bad inventory mapping is the “Unable to protect VM <VM name> due to unresolved devices” message shown in Figure 9.52.

This error is usually caused by the virtual machine settings being outside the scope of the inventory mapping settings defined previously, and therefore the Protection Group doesn’t know how to map the virtual machine’s current folder, resource pool, or network membership to the corresponding location at the Recovery Site. A good example is networking, which I just described above.

In the inventory mapping process, I did not provide any inventory mappings for the VM Network port group. I regarded this as a local network that contained local virtual machines that did not require protection. Accidentally, the virtual machine named “fs01” was patched into this network, and therefore did not get configured properly in the Recovery Plan. In the real world this could have been an oversight; perhaps I meant to set an inventory mapping for vlan10 but forgot to. In this case, the problem wasn’t my virtual machine but my bad configuration of the inventory mapping.

Another scenario could be that the inventory mapping is intended to handle default settings where the rule is always X. A number of virtual machines could be held within the Protection Group and could have their own unique settings; after all, one size does not fit all. SRM can allow for exceptions to those rules when a virtual machine has its own particular configuration that falls outside the group, just like with users and groups.

Configuring-protected-site- (52).jpg

Figure 9.52 The unresolved devices error that usually indicates a problem with inventory mappings

If you have this type of inventory mapping mismatch it will be up to you to decide on the correct course of action to fix it. Only you can decide if the virtual machine or the inventory mapping is at fault. You can resolve this match in a few different ways.

• Update your inventory mappings to include objects that were originally overlooked.

• Correct the virtual machine settings to fall within the default inventory mapping settings.

• Customize the VM with its own unique inventory mapping. This does not mean you can have rules (inventory mappings) and exceptions to the rules (custom VM settings). A VM either is covered by the default inventory mapping or is not.

If you think the inventory mapping is good, and you just have an exception, it is possible to right-click the icon in the Protection Group, select Configure Protection in the menu that opens, and offer per-VM inventory settings. If you had a bigger problem—a large number of VMs have failed to be protected because of bad inventory mapping configurations—you can resolve that in the inventory mapping, and then use Configure All to try the protection process again.

I would say the most common reason for this error is that you have deployed a new VM from a template, and the template is configured for a network not covered by the inventory mapping. Another cause can concern the use of SvSwitches. It’s possible to rename the port groups of an SvSwitch to be a different label. This can cause problems for both the inventory mapping and the affected VMs. As a consequence, when the Protection Groups are created for the first time the protection process fails because the inventory mapping was using the old name.

Placeholder VM Not Found

Another error that can occur is if someone foolishly deletes the placeholder that represents a VM in the Recovery Site, as shown in Figure 9.53. It is possible to manually delete a placeholder VM, although you do get the same warning message as you would if you tried to edit the placeholder settings. Nonetheless, these placeholder objects are not protected from deletion. If a rogue vCenter administrator deletes a placeholder you will see a yellow exclamation mark on the Protection Group, together with a “Placeholder VM Not Found” error message.

The quickest way to fix this problem is to choose either the Restore All link or the Restore Placeholder link in the Protection Group interface. The Restore All option rebuilds all the placeholders within the Protection Group, whereas Restore Placeholder fixes just one selected placeholder in the list.

Configuring-protected-site- (53).jpg

Figure 9.53 The “Placeholder VM Not Found” error message caused by accidental deletion of the placeholder in the inventory

VMware Tools Update Error—Device Not Found: CD/DVD Drive 1

Occasionally, the Protection Group can have a VM that displays an error on its own. For example, in Figure 9.54 the VM named “db01” has the error message “Device Not Found: CD-DVD drive1.” This error is relatively benign and does not stop execution of the plan.

This issue was created by a faulty VMware Tools update using Update Manager. The CD-ROM mounted was to a Linux distribution where an automatic mounting and update of VMware Tools failed. The Update Manager was unsuccessful in umounting the .iso file at /usr/lib/vmware/isoimages/linux.iso, but the auto-execution of VMware Tools does not work in the same way with Linux as it does with Windows. With Linux all that happens is that the .iso file is mounted as a CD-ROM device, but it is up to the administrator to extract the .tgz package and install VMware Tools to the guest system. This error was resolved by right-clicking the affected VM, and under the Guest menu selecting “End VMware Tools install.” This triggered an unmounting of the VMware Tools .iso image.

Configuring-protected-site- (54).jpg

Figure 9.54 The old chestnut of connected CD/DVD drives can cause a benign error to appear on the Protection Group.

Delete VM Error

Occasionally, you will want to delete a VM that might also be a member of a Protection Group. The correct procedure for doing this is to unprotect the VM, which will then unregister its placeholder VMX file, and as a consequence remove it from any Recovery Plan. Of course, there’s nothing to stop someone from ignoring this procedure and just deleting the VM from the inventory. This would result in an “orphaned” object in the Protection Group and Recovery Plan, as shown in Figure 9.55.

To fix these VMs, select the affected VM and click the Remove Protection button.

It’s Not an Error, It’s a Naughty, Naughty Boy!

If can you forgive the reference to Monty Python’s The Meaning of Life, the confusing yellow exclamation mark on a Protection Group can be benign. It can actually indicate that a new virtual machine has been created that is covered by the Protection Group. As I may have stated before, simply creating a new virtual machine on a replicated LUN/ volume does not automatically mean it is protected and enrolled in your Recovery Plan. I will cover this in more detail in Chapter 11, Custom Recovery Plans, as I examine how SRM interacts with a production environment that is constantly changing and evolving.

Hopefully with these “errors” you can begin to see the huge benefit that inventory mapping offers. Remember, inventory mappings are optional, and if you chose not to configure them in SRM when you created a Protection Group every virtual machine would fail to be registered at the Recovery Site. This would create tens or hundreds of virtual machines with a yellow exclamation mark, and each one would have to be mapped by hand to the appropriate network, folder, and resource pool.

Configuring-protected-site- (55).jpg

Figure 9.55 The error when a VMware administrator deletes a protected VM without first unprotecting it in SRM

Summary

As you have seen, one of biggest challenges in SRM in the post-configuration stages is network communication. Not only must your vCenter/SRM servers be able to communicate with one another from the Protected Site to the Recovery Site, but the SRM server must be able to communicate with your array manager also. In the real world, this will be a challenge which may only be addressed by sophisticated routing, NATing, intra-VLAN communication, or by giving your SRM server two network cards to speak to both networks.

It’s perhaps worth saying that the communication we allow between the SRM and the storage layer via the vendor’s SRA could be very contentious with the storage team. Via the vSphere client you are effectively managing the storage array. Historically, this has been a manual task purely in the hands of the storage team (if you have one), and they may react negatively to the level of rights that the SRM/SRA needs to have to function under a default installation. To some degree we are cutting them out of the loop. This could also have a negative impact on the internal change management procedures used to handle storage replication demands in the business or organization within which you work. This shouldn’t be something new to you.

In my research, I found a huge variance in companies’ attitudes toward this issue, with some seeing it as a major stumbling block and others as a stumbling block that could be overcome as long as senior management fully backs the implementation of SRM—in other words, the storage team would be forced to accept this change. At the opposite extreme, those people who deal with the day-to-day administration of storage were quite grateful to have their workload reduced, and noted that the fewer people involved in the decision-making process the quicker their precious virtual machines will be online.

Virtualization is a very political technology. As “virtualizationists” we frequently make quite big demands on our network and storage teams that can be deemed as very political. I don’t see automating your DR procedures as being any less political. We’re talking about one of the most serious decisions a business can take with its IT: invoking its DR plan. The consequences of that plan failing are perhaps even more political than a virtualization project that goes wrong.

Of course, it is totally impossible for me to configure every single storage vendor’s arrays and then show you how VMware SRM integrates with them, but hopefully I’ve given you at least a feel for what goes on at the storage level with these technologies together with insight into how SRM configuration varies depending on your storage vendor’s technology. I hope you have enough knowledge now to both communicate your needs to the storage guys as well as understand what they are doing at the storage level to make all this work. In the real world, we tend to live in boxes—I’m a server guy, you’re a storage guy, and he’s a network guy. Quite frequently we live in ignorance of what each guy is doing. Ignorance and DR make for a very heady brew.

Lastly, I hope you can see how important inventory mappings and Protection Groups are going to be in the recovery process. Without them a Recovery Plan would not know where to put your virtual machines in vCenter (folder, resource pool, and network) and would not know on which LUN/volume to find those virtual machine files. In the next chapter we will look at creating and testing Recovery Plans. I’m going to take a two-pronged approach to this topic. Chapter 10 gets you up and running, and Chapter 11 takes Recovery Plans up to their fully functional level. Don’t worry; you’re getting closer and closer to hitting that button labeled “Test my Recovery Plan.”