Veeam Backup & Replication v5 and ESX VAAI with Equallogic Array

By admin, October 20, 2010 4:39 pm

Count Down is on the way, only 6 hours left, it is probably the most anticipating problem in VMware ESX backup world in 2010!

veeam_v5

 

I’ve posted a question in Veeam’s forum asking the following and gained lots of knowledge regarding How Veeam backup works over SAN as well as Equallogic stuffs.

Veeam v5 can tell EQL to take a snapshot locally first using VAAI (ie, super fast), then send that Completed snapshot to Veeam Backup server for de-dupe and compression and then finally save to local storage on the backup server.

=====================================
For example, we have three kind of backup running at 6AM daily AT THE SAME TIME.

VMs are all on Equallogic SAN VMFS volume.

1. Acronis True Image backup inside each VM. (ie, File Level, backup time is 5-10 mins per VM)
2. Veeam Backup SAN (ie, Block Level, backup time is 1-5 mins per VM)
3. Equallogic Snapshot (ie, Block Level, backup time is 1-5 seconds for the whole array)

Will this actually create any problem? I mean LOCKING problem due to concurrent access to the same volume?

But Beginning with Equallogic version 5.0, the PS Series Array Firmware supports VMware vStorage APIs for Array Integration (VAAI) for VMware vSphere 4.1 and later. The following new ESX functions are supported:

• Harddware Assisted Locking – Provides an alternative meanns of protecting VMFS cluster file system metadata, improving the scalability of large ESX environments sharing datastores.

So this shouldn’t be a problem any more.

Any hints of recommendation?

 

I feel very happy now as finally VAAI can be used to greatly increase the snapshot performance and shorten the backup window, together with Veeam’s vStorage API, I am pretty sure the backup time is going to break the records shortly, I will report later after I installed B&R V5 tomorrow.

One more thing, I did try to use ASM/VE to backup and restore VM once for testing, I forgot if it’s with VAAI or not or if the FW is 4.3.7, but it’s quite slow and one thing I also don’t like is to backup snapshot onto the Same Array!

1. Array space is too expensive to place snapshot on it, at least for VM, I am ok with taking array snapshot for lun/volume only though to avoid array failure kind of disaster.

2. It’s not safe to place VM snapshot on the same array, what if the array crashes?

3. As an EQL user, everyone know that “Dirty Bit” problem (ie, once the block is written, there is no way to get it back.) In other words, it’s a great waste of deleted/empty space. Not Until EQL releases the Thin/Thick Space Reclaim feature in the coming FW5.x version, I think the technology is still not mature or ready yet to have VM snapshot to be placed on EQL volume. FYI, 3PAR is the only vendor having true Thin Space Reclaim, HDS reclaim is a “fake ” one, search google about a 3PAR guy commenting about HDS’s similar technology you will see what I mean.

4. Remember EQL use 16MB stripe? It means even there is 1K movement in block, EQL array will take the whole 16MB, so your snapshot is going to grow very very very very huge, what a waste! I really don’t understand why EQL designed their stripe size to 16MB instead of say 1MB, is it because 16MB can give you much better performance?

5. Another bad thing is even you buy a PS6500E with lots of cheap space, but you still can’t use ASM/VM to backup a VM on PS6000XV volume and place the snapshots on PS6500E, it HAS TO be the same array or pool and stay at PS6000XV, so seem there is no solution.

That’s why we finally selected Veeam B&R v5 Enterprise Edition and that’s how I arrived here and encounter all the great storage geeks!
=====================================

Some of the very useful feedbacks: 

 The EQL Snapshot is done on the EQL Hardware Level. The vmware snapshot is done at vmware level, with vaai hardware assisted but not done on the hardware level. (BUT if it is exactly that what you want DELL provides a cool tool named auto snapshot manager vmware edition, this will allow you to trigger a hardware snapshot exactly the same time a vcneter snapshot is triggered – good for volume shadow copy consistency).

The VEEAM Snapshot is done with vCenter Server or ESX/i directly using VAAI or not (dependes on your firmware and esx/i version). Don´t mix around Hardware Snapshots and hardware assisted snapshots, is is NOT the same.

 

That is correct, but at a very small time window, VAAI kicks in, when the snaphot is triggered, at this time VAAI kicks in with the locking mechanism. But then again, don´t mix it up with a hardware snapshot, VAAI can NOT trigger SAN-Vendor-specific hardware snapshots.

You can quite do a lot with Equallogic SAN, if you want it. There are many ways allowing you to trigger hardware snapshots if you want them triggered. So i suggest you take a look at

a) Auto Snapshot Manager VMware Edition
b) latest Host Integration Tools (H.I.T.)

 

Correct, but VEEAM/vmware-triggered snapshots, when doing backups at night, would not really grow that much because the time window for doing the backup is extremely small (when using high speed lan and cbt) – therfor me personally i have no problem with it – and even if it gets a problem: VEEAM has sophisticated mechanisms which allow you to safely break the backup operation if the snapshot grows too huge.

 

 

Update: Official Answer from Equallogic

Good morning, 

So, the question is does VMware’s ESX v4.1 VAAI API allow you to have one huge volume vs. the standard recommendation for more smaller volumes while still maintaining the same performance?

The answer is NO.  

Reason: The same reasons that made it a good idea before, still remain.   You are still bound by how SCSI works.  Each volume has a negotiated command tag queue depth (CTQ).  VAAI does nothing to mitigate this.   Also, until every ESX server accessing that mega volume is upgraded to ESX v4.1,  SCSI reservations will still be in effect.  So periodically, one node will lock that one volume and ALL other nodes will have to wait their turn.   Multiple volumes also allows you to be more flexible with our storage tiering capabilities.   VMFS volumes, RDMs and storage direct volumes can be moved to the most appropriate RAID member.

i.e. you could storage pools with SAS, SATA or SSD drives, then place the volumes in their appropriate pool based on I/O requirements for that VM.

  

So do you mean if we are running ESX version 4.1 on all ESX hosts, then we can safely to use one big volume instead of several smaller ones from now on? 

Re: 4.1. No.  The same overall  issue remains.  When all ESX servers accessing a volume are at 4.1, then one previous bottleneck of SCSI reservation and only that issue is removed.  All the other issues I mentioned still remain.   Running one mega volume will not produce the best performance and long term will be the least flexible option possible.   It would similar in concept to taking an eight lane highway down to one lane.

 

In order to fully remove the SCSI reservation, you need VAAI, so the combination of ESX v4.1 and array FW v5.0.2 or greater will be required.

As a side note, here’s an article which discusses how VMware uses SCSI reservations.  

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1005009

Here’s a brief snippet from the KB.

There are two main categories of operation under which VMFS makes use of SCSI reservations.

The first category is for VMFS data-store level operations. These include opening, creating, resignaturing, and expanding/extending of VMFS data-store.

The second category involves acquisition of locks. These are locks related to VMFS specific meta-data (called cluster locks) and locks related to files (including directories). Operations in the second category occur much more frequently than operations in the first category. The following are examples of VMFS operations that require locking metadata:
    * Creating a VMFS datastore
    * Expanding a VMFS datastore onto additional extents
    * Powering on a virtual machine
    * Acquiring a lock on a file
    * Creating or deleting a file
    * Creating a template
    * Deploying a virtual machine from a template
    * Creating a new virtual machine
    * Migrating a virtual machine with VMotion
    * Growing a file, for example, a Snapshot file or a thin provisioned Virtual Disk

 

Follow these steps to resolve/mitigate potential sources of the reservation: 

a.Try to serialize the operations of the shared LUNs, if possible, limit the number of operations on different hosts that require SCSI reservation at the same time.

b.Increase the number of LUNs and try to limit the number of ESX hosts accessing the same LUN.

c.Reduce the number snapshots as they cause a lot of SCSI reservations.

d.Do not schedule backups (VCB or console based) in parallel from the same LUN.

e.Try to reduce the number of virtual machines per LUN. See vSphere 4.0 Configuration Maximums and ESX 3.5 Configuration Maximums.

f.What targets are being used to access LUNs?

g.Check if you have the latest HBA firmware across all ESX hosts.

h.Is the ESX running the latest BIOS (avoid conflict with HBA drivers)?

i.Contact your SAN vendor for information on SP timeout values and performance settings and storage array firmware.

j.Turn off 3rd party agents (storage agents) and rpms not certified for ESX.

k.MSCS rdms (active node holds permanent reservation). For more information, see ESX servers hosting passive MSCS nodes report reservation conflicts during storage operations (1009287).

l.Ensure correct Host Mode setting on the SAN array.

m.LUNs removed from the system without rescanning can appear as locked.

n.When SPs fail to release the reservation, either the request did not come through (hardware, firmware, pathing problems) or 3rd party apps running on the service console did not send the release. Busy virtual machine operations are still holding the lock.

Note: Use of SATA disks is not recommended in high I/O configuration or when the above changes do not resolve the problem while SATA disks are used. (ie, USE SAS 10K or 15K or EVEN SSD should greatly help!)

 

An updated review from InfoWorld about New EqualLogic firmware takes a load off VMware

Fine Tune Windows Server 2008 R2 TCP setting for Equallogic iSCSI SAN

By admin, October 20, 2010 11:32 am

To show Glbal TCP Parameters:
netsh int tcp show global

1. How to enable and disable TCP Chimney Offload (aka TCP offload) in Windows Server 2008 R2:
netsh int tcp set global chimney=enabled
netsh int tcp set global chimney=disabled

Determine whether TCP Chimney Offload is working, type “netstat –t” the line shows “Offloaded” is with Offloaded feature enabled.

2. How to enable and disable RSS in Windows Server 2008 R2:
netsh int tcp set global rss=enabled

3. Disable TCO Autotuninglevel in Windows Server 2008 R2 for performance gain in iSCSI
netsh interface tcp set global autotuninglevel=disabled

 

Update Jan-24:

I simple enabled everythingand found there is no difference on Equallogic iSCSI IOMeter performance.

TCP Global Parameters
———————————————-
Receive-Side Scaling State          : enabled
Chimney Offload State               : enabled
NetDMA State                        : enabled
Direct Cache Acess (DCA)            : enabled
Receive Window Auto-Tuning Level    : normal
Add-On Congestion Control Provider  : ctcp
ECN Capability                      : enabled
RFC 1323 Timestamps                 : disabled

 

 

DCA should be enabled with multi-core processors whenever RSS is enabled.  It allows for NETDMA clients to indicate that destination data is targeted for a particular CPU cache and this is what we want with high-performance IO like iSCSI.

ECN is Explicit Congestion Notification and is enabled by default and is a little more complex as it tweaks TCP protocol when sending a SYN and is mostly used by routers and firewalls. Since the default is enabled, I’d just set it to the default and be done with it.