Thoughts about Dell Management Plug-In for VMware vCenter (DMPVV)

By admin, April 9, 2012 12:12 am

Honestly, I have repetitively deployed DMPVV mutiple times in order to get it right.

Region-Capture

1. You really need to make sure you have read Dell OpenManage Software Compatibility Matrix before installing any OM software because you need to upgrade the firmware for BIOS/iDrac/Lifecycle to their respective minimum requirement as stated in the guide.

2. DMPVC DOES NOT NEED to use a DHCP server

I even created a W2K8 R2 DHCP server, but found there is a place in menu for me to configurate the fix IP for the appliance.

3. DMPVC cannot start in vCenter saying some wired permission problem, things like Access Denied!

The problem is because I registered the vCenter using IP in DMPVV, but used host name when I login with vSphere client, after I changed hostname to IP, everything worked. Probably it’s my DNS not working properly, anyway, IP is fine for my case.

4. Connection Profile doesn’t work, then I found out you have to turn on Remote Enablement during the installation of OMSA on ESX.

The reason you see that error message about “OMSA is not installed” could be due to that when you installed the OMSA, you didn’t install it with -c option which installs the “Remote Enablement” component of OMSA. And our appliance talks to OMSA thought its remote enablement layer. Without successfully connect to OMSA, the iDrac connection will fail too as we correlate the correct iDrac IP with the server by getting the iDrac IP from OMSA first. Please reinstall the OMSA with –c option and that should solve your issue. Once you pass the connection test from the connection profile, please make sure to run inventory from the Job Queue by clicking the “Run Now”.

That indeed was the fix, and the -c is listed in the user guide for the command line, however there is no explanation in the user guide why it needs to be there, so I did not re-run the OMSA 6.4 installer. Perhaps the OMSA team could add the -c switch to the -x (for express) switch for OMSA, so that it is automatically included? Also according to the OMSA 6.4 install manual, if you run the -x switch it runs the express setup with all options included and ignores any other switches, apparently this is not true.

For OMSA 7.0

Run the following command to perform an express install with Remote Enablement parameters:
sh linux/supportscripts/srvadmin-install.sh -c -x

-c is for Remote Enablement
-x is for Express

Then start the applicable services by running the following command:
sh linux/supportscripts/srvadmin-services.sh start

5. License Disappeared

Sometimes after a reboot or DMPVV reset to factory default, license disappear, I have to re-deploy the whole thing again, well, the last successful re-install only took me 5 minutes as I have done it over 6 times. :)

6. CANNOT contact iDrac (SOLVED, but UNSOLVED for the time being)

My iDRAC subnet is on a separate switch, so COS Service Console obviously won’t work, this was by design as I want to physically separate all network segments, and it’s not routable. I knew in order for default DMPVC to work is to put COS and DRAC network on the same network segment which is NOT SECURE as far as I concern. Why doesn’t DMPVV give us an option to specify the subnet for iDrac and add another network adapter for this purpose?

During the research, I also found out by using Alt+F2, login as readonly with default admin password, then you can perform some network trouble shooting such as ping and tracepath.

Anyway, I still can’t figure out a way to route the traffic from DMPVC to iDrac via vCenter server without using a L3 router or firewall device. Is it possible to use route add on vCenter Windows server to redirect the DMPVV traffic to iDrac? If you know, please let me know.

So I was not able to test the firmware upgrade feature, but I am 100% sure it’s utilizing iDrac’s USC firmware updating feature to fetch firmware from ftp.dell.com and then perform the upgrade on the background, it’s the same as if you reboot the server and press F10 USC.

7. Service Temporarily Unavailable

DMPVC web server always crashes, probably due to I gave it 1GB (reduced from 3GB), after I changed to 2GB, it stopped crashing, but still loading the host page is extremely slow, around 2 minutes. Oh, DMPVV is a resource eater,taking up 2GB of ram fully and 100% CPU cycle when it’s connecting to the host.

The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.

Apache/2.2.3 (CentOS) Server at xxx.xxx.xxx.xxx Port 443

8. Proxy to get firmware

This is the same issue as the above pt 6, see it’s going to bite back anyway, there is no direct connecting to Internet from within DMPVV, so you have to use a proxy server to download firmware, a better way would be adding a 3rd network adapter and connecting to your External port, hope this can be changed in the next release.

Finally, some good readings (total 3 parts) can be found on Virtual Life Style web site.

One of the coolest features I want to highlight is the PXE-less provisioning of the hypervisor to a physical server. This uses a combination of the Lifecycle Controller and iDRAC to deploy an installation ISO to the server. And since it is really tightly integrated with the VMware stack, the host is added to vCenter and configured using Host Profiles automatically, resulting in a true zero-touch deployment of a server. How cool!

In fact, there is a How to video regarding “Auto Discovery & Hypervisor Deployment – Dell Management Plug-In for VMware vCenter ”

One last word to add, I do think Dell Management Plug-In for VMware vCenter (DMPVV) is simply a proxy between ESX host and vCenter, still it should be made free as all the features of DMPVV can be achieved using different Dell server management products together. DMPVV is just a fancy toy that made all of them into a single product instead.

FYI, 12th generation servers such as R720/R620 doesn’t have to use OMSA as it it’s completely agent-less and no longer depends on OMSA agents within ESX hosts in order for DMPVV to work.

Poweredge R710 Firmware Update Causes PCI Interrupt Conflict, H700 Raid Disappeared!

By admin, April 8, 2012 11:33 pm

Dell Management Plugin for vCenter (DMPVC) v1.5 has been released on April 5th with 1 Free Host, so I couldn’t wait to test this management software. I’ve read through the prerequisites and found out many components such as BIOS, iDrac, Raid Controller, LifeCycle Controller, etc are required to be updated to the latest version before DMPVC can properly work.

So the next obvious thing is to go ahead and update my Poweredge R710 server. Of course there are many ways to update the Poweredge firmware running ESX, Repository Manager is one of them, probably the best one, but today I found out the connection to ftp.dell.com is extremely fast (about 30-50M/s), so I took the short cut and use F10 USC (Unified Server Configurator) to perform the update via ftp.

After I pulled the firmware catalog (again, it’s 2 months outdated, but new enough for DMPVC) and selected everything (BIOS, iDrac, H700, Broadcom, LC) but iDrac (v1.8) as I want to do that at last while I am using iDrac Remote Console to perform the updates.

The whole thing took about 1 hour to complete the 12 steps. I noticed something funny when the last reboot started, R710 complained Plug and Play Configuration Error and H700 Raid Menu has completely gone! I booted into BIOS and found H700 was no longer listed in the available storage devices, what?

OK, was it something to do with old iDrac version again? I then use USC to update iDrac to v1.8 and I lost the console when the iDrac restarted. After iDrac came back, I couldn’t login to iDrac web GUI, Damn! I called up data center and asked the staff to change the iDrac password (Ctrl + E), guess what? I still couldn’t login!

Babe! This leave me no choice but physically travel to data center, after 30 minutes, I was sitting in front of the LCD besides my rack and scratching my head, couldn’t figure out a bit why my H700 disappeared and I couldn’t login to iDrac.

Called local Pro-Support (HK) and as usual (this really let me think NO TO PURCHASE ANY MORE PRO-SUPPORT LEVEL SERVICE IN THE FUTURE), useless advice, asking me to pull out the Drac card and insert again, and finally ordered a replacement H700 (to arrive in 1-2 hours) and support staff to come within 3-4 hours.

During the chatting with Pro-Support staff, I googled a bit and found someone had EXACTLY THE SAME SYMPTOM as mine! The problem is after updating the extra 2 Quad Port 5709c Broadcom cards, it reset the Broadcom card to default with ROM Enabled, LOM (on-board Broadcom remains Disabled though, strange), so this ROM CONFLICTS with H700’s ROM, leading to PCI Interrupt Conflicts and blocked H700 Raid Configuration ROM to show up! After Disabled the Broadcom ROM one by one (has to do it 8 times), H700 was back online and the local Pro-Support staff told me they have never encountered such thing before, ok, I must say over the last 10 years, I solved my own hardware problem more than they do, I can do a MUCH BETTER JOB and more knowledge than they are, and why I am paying a premium for the Pro Support service level? @#$@#!!!@!!! No more! Lost for Dell!

For the strange iDrac problem, the solution is to reset it to factory default and reconfigure IP and password again.

So this is my 3rd time nightmare experience when performing firmware upgrade on Poweredge 11th generation server. Never thought a NIC ROM will conflict with Raid card’s ROM and never thought iDrac will block access after upgrading the firmware.

Later, I was able to reproduce the same symptom on another Poweredge R710 with same hardware configuration, ie, H700 disappeared, no more Ctrl + R menu due to Broadcom 5709 ROM conflicts with H700 ROM!

The good thing after all the troubles is I was finally able to use Dell Management Plugin for vCenter to discover the host and do things as it should be.

One thing I noticed R710 Firmware v6.0.7 has a new virtualization feature called SR-IOV, good for 10G/s cards for DCB, but I don’t have those, it will be useful if I upgrade to 10G/s later.

Last but not least, I found vMotion no longer works after the host exit Maintenance Mode, it complains CPU are not compatible with the source host as I tried to vMotion back the VMs, so I can upgrade the next ESX host. It turned out to be the latest BIOS v6.0.7 also updated the CPU Microprocessor Code to Step B, so it’s not the same as v2.0.9, so how am I going to migrate my VMs to this new host and performed the upgrade on my existing ESX host? Luckily, I have the luxury to power down those few VMs for a short period of time, and cold migrate them to the next host, so problem solved, haha…Ultimately it would be great if my client has a 3 nodes cluster of course.

Finally, I forgot iDrac is tightly integrated with LifeCycle Controller, so I should upgrade iDrac first, then BIOS with all the other components such as H700 Raid and Broadcom NICs depends if your iDrac still allows you to login to the web GUI that is, huh?!

In fact, I have updated all the iDrac firmware again to v1.85 today by uploading the .bin file directly via iDrac’s web GUI from a Windows host, it’s much simpler and clean to do that way.

One more thing to take care of is you can’t upgrade the iDrac while you are still connecting to remote console using USC method, it will perform the upgrade but saying something got conflict, so use the above alternative method will solve this problem.

On Windows system, it’s a much easier job, just apply the firmware update one by one at the same time, after everything has been updated, then reboot the server.

Anyhow, I would suggest using Dell Repository Manager to do the job for your next BIOS upgrade on Poweredge 11th generation servers.

How to fit 2.5″ Crucial M4 128GB SSD into Dell Optiplex 990 SSF with Original 3.5″ Disk Attached

By admin, April 8, 2012 6:23 pm

I bought the Optiplex 990 SSF 2.5″ HD Kit (for 2 x 2.5″ hard disks with special split SATA power cables) from Dell almost 3 months ago and thinking putting a SSD in my desktop for WMware workstation VMFS.

I’ve been on/off thinking if I should buy myself another 2.5″ 500GB disk and clone everything over using Acronis just to use this 2.5″ HD Kit, but the problem is the original 3.5″ 500GB WD still carries a 5 years warranty, so why waste it?

If I keep the original 3.5″, then the 990 SSF doesn’t have enough space to fit in any more hard disk. So the real headache is how am I going to put that 2.5″ Crucial M4 SSD in my tiny 990 SSF case?

I’ve tried to put it everywhere, like above the CPU fan blower (yes, I knew it’s a hot zone), but it doesn’t fit somehow, looked again, luckily I found the perfect spot, it’s the bottom left corner, it fits 100% with no more or less space left, haha!

As the power supply is 90% efficient, it’s not hot at all, the Crucial SSD is not producing much heat neither as there is no moving mechanical parts involved. Tested with HWiNFO64, the temperature sensors shows the same with or without this SSD added.

Another mission impossible has been successfully accomplished.

IMG_6500

I moved the VMDK over from my USB 3.0 Lacie disk to Crucial SSD, fired up VMware workstation v8.02, wow! Windows 8 Preview boots in 6 seconds, can’t believe how fast it is!

Finally, I tested with two mostly used desktop SSD benchmark tools, both shows good results.

Also IOMeter shows 4,200 IOPS under 4k 60% random  65% read, although strangely I can easily get 7,200 in R710, probably it’s the way 990 SSF has a much lower bandwidth with its 6Gbps SATA on-board connection, that’s why it’s called Desktop PC instead of Server. Still with 4,200 IOPS, it beats a single Equallogic PS6000XV (14 x 15K SAS in RAID10) at only 0.5% of the cost. Sounds fantastic right? Then why buy Equallogic any more? Ha…if you run it a bit longer, you will see why your Equallogic box pays off, longer means running for 12-24 hours, you will see the single SSD disk dropping to 1/10 of its IOPS and Equallogic sustain all the way like a champ!

Anyway, it’s more than enough for VMware workstation and I am totally satisfied with the result!

crystaldiskmark

asssd

To See is To Believe: Geminoid F in Hong Kong

By admin, April 8, 2012 5:57 pm

Geminoid F is currently in Hong Kong for exhibition.

I saw it from 1 meter away today, the facial expression of Geminoid F is amazing, I would say it looks at least 65% real to me, especially when it sings.

Geminoid F reminds me that I wanted to be an expert in Electronics + Mechanical engineering field years back, to make something like this was also my ultimate goal back then. :)

GF

Dell Client System Update for Optiplex 990 SFF

By admin, April 7, 2012 4:29 pm

I used to go to Dell’s web site and download the firmware update one by one for my Optiplex 990 SFF.

When I asked Dell local support if there is a tool that can automatically scan my PC and download all the appropriate firmware updates all together at once, the answer is there is none (the standard answer as usual).

So as usual I never believe such BS, googled around and located this tool: Dell Client System Update.

It’s funny that if you work for Dell and your post is technical support for desktop, you should know this tools by heart, but the reality is always the opposite, sigh.

igam9j4vc2_j0tp-qxrr6q39502

Dell OpenManage Essentials (OME)

By admin, April 7, 2012 12:15 pm

Well, it’s the annual upgrade season, I have finally performed all the firmware updates on our server and storage, it’s kind of a nightmare (another article follow up shortly). Every time I upgrade the firmware on Poweredge, Powervault, PowerConnect or Equaulogic, it’s a “Press and Pray” session, YAKE!!!

So the old rule of thumb applies, if there is nothing happen to your machine or you don’t need that particular feature, DON’T DO IT!

Anyway, during the way performing all kinds of upgrade, I found Dell has quietly released two new tools.

Dell OpenManage Essentials and Dell Power Center (allows you to measure data center power usage, and it’s not free, but has 60 days trial), from what it described, Dell OpenManage Essentials is the next generation of monitoring and management tool, so you don’t need that crappy DMC (oh…dead after only 2 revisions as expected), well IT Assistant is fine though (long live, still using it, latest version is v8.9).

In additional, it seemed you can also integrate Repository Manager with OME to update firmware on ESX, so why do we pay for Dell Management Plug-in for vCenter after all? Well, you may argue it’s unified solution that you can do everything without leaving vCenter and looking cool, but who cares, as long as we get the jobs done, whatever is cost effective comes first in my theory!

Btw, why does Dell have so many system management tools for the same goal and over lapping each other? There is an Overview if you are interested. It took me almost forever to really get familiar with each of them, OpenMange, iDrac, Life Cycle Controller, F10 USC, Repository Manager, DMC, IT Assistant, DMP for VC, and now Dell OpenManage Essentials, can’t Dell produce one for all? I really do hope OME is the final one that has everything integrated and WORKED (Hopefully that is).

Dell Hardware Management Products
•Integrated Dell Remote Access Controller (iDRAC) with Lifecycle Controller (LC)
•Dell Chassis Management Controller (CMC) for blade servers
•Dell OpenManage Server Administrator (OMSA)
•Dell OpenManage Client Instrumentation (OMCI)
•Basic management utilities with IPMI

Dell Consoles
•Dell OpenManage Essentials (OME)
•Dell Management Console (DMC)
•Dell IT Assistant (ITA)
•Dell Remote Access Configuration Tool (DRACT)
•Dell OpenManage Power Center (OM PC)

Dell Services
•Managed Services
•Professional Services
•SaaS Management

Dell Tools and Utilities
•Update Utilities
–Dell Repository Manager (DRM)
–Dell OpenManage Server Update Utility (SUU)
–Dell OpenManage Systems Build and Update Utility (SBUU)
–Dell Update Packages (DUP)
–Dell Client Configuration Toolkit (CCTK)

•Customer Scripts and Processes
–Dell OpenManage Deployment Toolkit (DTK)
–RACADM
–IPMI

ToolIntegration With Third Party Consoles
•Microsoft System Center Operations Manager (SCOM) Server Management Pack Suite
•Dell Lifecycle Controller Integration (DLCI) pack for Microsoft System Center Configuration Manager (ConfigMgr)
•Dell Server PRO Management Pack for Microsoft System Center Virtual Machine Manager (SCVMM)
•Dell Management Plug-in for VMware vCenter
•BMC Software

Connections With Third Party Consoles
•Dell OpenManage Connection for Computer Associates Network and Systems Management (CA NSM)
•Dell Smart Plug-in (SPI) for HP Operations Manager for Windows
•Dell OpenManage Connection for IBM Tivoli Netcool/OMNIBus
•Dell OpenManage Connection for HP OpenView NNM
•Dell OpenManage Connection for Tivoli Enterprise Console

Firefox iDrac Certificate Problem: sec_error_reused_issuer_and_serial

By admin, April 7, 2012 11:27 am

I found I can no longer use Firefox to access my iDrac page after upgrading the iDrac 6 firmware to latest v1.85.

The particular message is “sec_error_reused_issuer_and_serial”.

Googled it a bit, removed cert8.db and key3.db, some iDrac page worked, some doesn’t as the old and new iDrac cert shares the same digital signature.

Finally, found this from the iDrac 6 v1.85 release note:

* iDRAC default certificate expire date changed to 2023, to get this updated certificate clear the “Preserve Configuration flag” option while updating iDRAC firmware through GUI. Make sure you delete cache from the GUI (IE as well as Firefox).

Firefox web browser might encounter an error if the certificate contains the same serial number as another certificate. Use this ink or the following procedure to resolve the same.

Workaround:

Delete your old exception and use temporary exceptions for subsequent visits to the iDRAC page.

To delete your old exception:
1. On the Firefox window, click the “Firefox” button and then click “Options.”
For Windows XP, click Tools and then “Options.”
For Linux OS, click “Edit” and then “Preferences.”

2. Select the “Advanced” panel.

3. Click on the Encryption tab.

4. Click “View Certificates” to open the Certificate Manager window.

5. In the Certificate Manager window click the “Servers” tab.

6. Identify the item that corresponds to the site that generates the error.
Note: The Certificate Authority (CA) for that server – the CA name appears above the site name.

7. Click on the server certificate that corresponds to the site that generates the error and press “Delete.”

8. Click OK when prompted to delete the exception.

9. Click the “Authorities” tab and select the the item that corresponds to the CA that you noted earlier and then press “Delete.”

10.Click OK when prompted to delete the exception.

To add a temporary exception to allow access to the page:

When you go to the iDRAC page, you will be presented with an Untrusted error.
Click on the “I Understand the Risks” link at the bottom of the error,
Click on Add Exception… to open the Add Security Exception window.
Click Get Certificate to fill in the Certificate Status section of the Add Security Exception window.
Click to un-check the Permanently store this exception item.
Click Confirm Security Exception to close the Add Security Exception window.

The iDRAC page will load now.

Dell Management Plug-In for VMware vCenter (v1.5) is FREE!

By admin, April 5, 2012 4:40 pm

Yes, it’s FINALLY Free (for 1 host only though) and the manual can be found here, and it’s available for download now (It’s a 2GB zip and I have to use Filezilla directly as the browser download always breaks and can’t resume), but that’s more than enough, simply manually switch host one after another as all I want is to update the firmware of Poweredge servers running ESX. :)

I really do hope Dell will make Dell Management Plug-In for VMware vCenter completely free of charge as it can be only used on Dell’s Poweredge server, but not IBM or HPs, so the users are essentially Dell’s customer anyway and it’s a management tool that everyone needs!

29gofolncqmhuvhgev_dmw77959

Dell Management Plug-In for VMware vCenter Feature List

Features Description
Inventory Detail Complete PowerEdge Server Details

  • Memory – Quantity and Type
  • NIC
  • PSU
  • Processors
  • RAC
  • Warranty Info
  • Server- and Cluster-level Views
BIOS and Firmware Update Deployment BIOS and Firmware

  • Baselines and Templates
  • Updates staged from VMware vCenter
Built-in Deployment Wizard Dell Servers Show up as a Bare Metal Server

  • Set configs of BIOS and Firmware updates

Profile Settings and Template

  • RAID
  • Server Name
  • IP Address

Hypervisor Templates for ESX and ESXi 4.1 and later release

Alert Control Level

  • Set Manual Approval Requirement
  • Allow the Tool to Automatically Remediate
Online Warranty Info Server Warranty Info via VMware vCenter

  • Service Provider
  • Warranty Type
  • Service Dates on Server or Cluster Level
Cluster Level Overview of Dell Servers

  • High Level Summary
  • Expanded View
  • Firmware
  • Warranty
  • Power
  • Reports
    • Sortable
    • Filterable
    • Exportable to CSV format

Veeam vs vRanger, The Battle is On, Round 1, KO!

By admin, April 5, 2012 4:15 pm

Round 1, KO!

I saw Anton from Veeam KOed vRanger last year in VMTN, although it’s not professional, but I like to see the pros and cons of both products and consumers have the rights to know the truth of negative side of each products.

Today I came across several articles posted by vRanger citing Veeam’s drawbacks and pitfalls and calling Veeam “a small company from Russia” (that’s a bit over and mean really). Well, I would say they are pretty true,  and I am not sure if those has been taking care  of in B&R v6 as I am still using v5. Anton if you saw this, please let me know if v6 has addressed the followings.

Btw, I know you act really quick in Veeam’s post around the web. I am really not sure how did you find all of them so quickly, is it by magic? :)

Critics about Veeam from vRanger can be also found here and there.

Don’t worry Anton, I have confident in Veeam and I am sure you don’t mind this can be discussed in public.

How a Poorly-Designed Architecture for Data Backup will Undermine a Virtual Environment — A Close Look at Veeam by Jason Mattox
Posted by kellyp on Jun 1, 2010 1:00:00 AM

At Vizioncore, we do not often cite our competitors by name in public. Our philosophy is that it is our job to provide expertise on virtual management requirements and the capabilities offered by Vizioncore for addressing those requirements.

However, members of our team also do have occassion to look in depth at competitive products. When the result is a fact-based assessment of how a competitor’s approach contrasts with Vizioncore, it seems to serve the larger community to put the information in public. The purpose of this is to help members of the community to better understand the real differences in approach and to better appreciate the value built into the Vizioncore product portfolio.

In this case, the competitor that we looked at is a small company called Veeam. Veeam offers an all-in-one product for backup, replication and recovery of VM images. They are privately held, based in Russia, and report having about 6K customers as of now. This compares to Vizioncore’s 20K+ customer level as announced in March 2010, with Vizioncore operating as a wholly-owned subsidiary of Quest Software. Quest is a public company, obligated to provide audited reporting of company financials.

The comparison and contrast between Veeam’s implementation of image-based backup and restore and vRanger Pro 4.5 appears below. This analysis is written by Jason Mattox, one of the co-inventors of the original vRanger Pro product. Jason continues to provide guidance and direction to new versions of Vizioncore technology products, including vRanger Pro 4.5.

We hope that you find Jason’s comments and insights educational. In a part 2 of this posting, we will offer more details on how vRanger Pro 4.5 and the Data Protection Platform (DPP) foundation on which it is built, contrasts with Veeam’s approach.

**************************************************

I have had the opportunity to take a deep, first-hand look at the Veeam 4.1.1 product over the last few weeks. My personal opinion – admittedly biased – is that they have a product built on a poor foundation. The problems with their architecture – and the potential result of the data protection not operating well and actually undermining an organization’s virtual environment – include the following:

Psuedo service-based architecture:  You install the product, it installs itself as a service, and you think, “okay good it’s a service based architecture.”  But it’s not:  Here is a simple test you can do on your own to prove the product is not a full service-based architecture.  Start a restore job.  Then log off Windows; the product will ask you: “Are you sure?” This is because if you log off Windows, it will cancel your running restores since it’s not running though the service. Another test you can try, is to attempt backup and restore at the same time; you cannot.  If the product was a true service based architecture, your backup and restore jobs would just be submitted to the service and the product wouldn’t care about the two functions running at the same time.

Lack of data integrity in the backup archive: Create a job in their product that contains, for example, 20 VMs. When you back up all the VMDKs, then Veeam puts all of the backup data into a single file.  Also, when you run incremental backups, they update this single large file directly.

When you have a single large file that needs to be updated the chance of corruption is high. Database manufacturers know this; products like SQL and Exchange write all their changes to log files first and then on a controlled event they post the changes to the single large file, the DB. Veeam does not implement this best practice, but rather updates a single 30, 40 or even 500 GB file directly instead of staging the data next to the file, then posting the data to the file once successful.

This is their Synthetic Full implementation – the entire basis for their product – and why we object to it so strenuously in terms of the risk that it introduces into customer environments.

Their argument in favor of Synthetic Full appears to have been that it enables backups to be faster. We believe that there are other, better methods available for speeding backup which do not risk the integrity of the backup repository. Methods including Active Block Mapping (ABM), now shipping in vRanger Pro 4.5. In beta test environments, our testers have reported that vRanger Pro backup is far faster than Veeam. However, your mileage will vary and we welcome more test reports from organizations testing both products.

Another argument in favor of Synthetic Full which has been offered by Veeam, is that it helps speed restore. Again, we agree with the goal but not with the method used to get there.

In vRanger Pro, we offer a Synthetic Restore process which has been in the product for some time. Our restore has been faster than Veeam’s for as long as we’ve been aware of Veeam. Our performance on restore was also improved in the 4.5 release, to be even faster than before.

Problems with updating a single file in the backup repository: Those of you familiar with database implementations – and the very good reasons for staging updates rather than writing them directly – will understand some of these problems immediately. This approach is especially problematic for image-based backup, and I’d like to offer some reasons as to why:

Tape space requirements – Because the original file is updated with every backup pass, the entire file must be written to tape every time. There is no method offered for moving just the new data to tape. This makes the ’sweep-to-tape’ process lengthy, and increases the number of tape cartridges required significantly. Tape management is, likewise, more difficult. The process of locating tapes, scanning to find data, and performing restore is, likewise, more difficult and lengthy.

Problems working with Data Domain de-duplication storage and similar storage appliances – Because the original file is amended with every backup pass, the appliance cannot be efficient in de-duplicating and replicating the backup data.

Finding and restoring individual VMs from the backup job – Because the backup file includes more than one VM, it is not named intuitively to enable easy browse and restore of the VMs required by the admin.

Overhead in the process of creating and managing simultaneous backup and recovery jobs – it’s just harder to do:  In their product, if you create a single backup job of let’s say 30 VMs they will backup one VM at a time. To perform more backup jobs at one time, you must create more jobs.  For each job, you must step through the entire backup wizard, which is time-consuming. The same holds true for restore jobs: for each VM, you must step through the entire restore wizard to create and submit a job to restore a VM. This isn’t that bad most of the time, but for disaster recovery scenarios or in situations in which entire ESX servers must be rebuilt, this simply isn’t that practical.

Feature called de-dupe is really something else: De-dupe in their product is not true de-dupe, but is perhaps better described as a template-based backup. Here’s what they do: they define a base VM – this being a set of typical files or blocks typically found in a VMDK – and they use this as the comparison for their full backup of the VM. For example, if you have two Windows guests then they do not have to backup up the Windows configuration because it is already in the base VM template.

However, there are some important limitations of their approach which include:

Their de-dupe is only good within the backup job. The more jobs you create, the less beneficial the de-dupe is because blocks are duplicated between and among backup jobs. If you need to create more backup jobs to gain better backup performance across multiple VMs, then the de-duplication benefit goes down.

Their de-dupe is defined with a base VM – and does not change with the configuration of the guests. If you have two Exchange servers being protected in the same job, then all of the blocks for the Exchange configuration will be included twice – even if they are identical.

Our own implementation of de-duplication is pending delivery later this year. We have developed true global, inline deduplication designed to offer maximum de-duplication benefits. It’s in test now. Our architecture, which includes keeping backup files intact and untouched once written into the repository, has been a key in enabling our de-duplication to function with true de-dupe capabilities.

Lack of platform scalability:  To scale out their product in virtual application mode, LAN free mode, ESXi network or ESXi replication, they have to install their product many times. To make it possible to manage all of the deployments, they offer an API layer and provide a ASP.net web page so that customers can go to check job status for their many installs.  This console does not allow you do create or edit jobs, but is a monitor. They call this their Enterprise console.

ESXi replication is network-exhaustive:   In their implementation for ESXi, their product reads from the vStorage API over the network uncompressed to their console, then it writes the resulting data over to the other target ESXi host over the vStorage API uncompressed.

What’s wrong with this? In the first place, the vStorage API was not designed for the WAN link; it was designed for backups which were meant for the LAN. The other issue is that the traffic is uncompressed; WAN links are not cheap so compression is a key feature that’s needed. Also, if you look at the resources needed for this, just a single replication job can consume 50-80% CPU of a 2 CPU VM. So if you think about how you would scale this out from a bandwidth and installation point of view, this doesn’t seem practical.

Use of unpublished VMware API calls:  If you have ever used the Datastore browser from the vCenter client, this process uses an internal API that’s not exposed to 3rd parties called the NFC. Here is what they have done: they are impersonating the vCenter client and using the internal NFC API to work with VMs.

So, here’s the risk:  VMware may trace a reported problem with a VM back to a 3rd party product that is using an unpublished vCenter API by impersonating the vCenter client. Will VMware be okay with this? Might VMware get a little strange with you and their ability to support you and your environment?

If you want to verify this for yourself just look in the logs of their product for ”NFC”, look at the target datastore for files that are not VMware related. Ask them how do they transfer and modified files in the datastore that are not your normal VMware files?

Why the stakes are high in virtual environments: Virtual environments are some of the fastest-growing and most dynamic environments in the world. As virtual servers continue to gain momentum in terms of their adoption rate, administrators are presented with the big challenge of keeping ever-expanding virtual resources monitored and under efficient management. At Vizioncore, we want to enable this momentum to continue by offering data protection and management capabilities which are purpose-built for images, with foundational capabilities designed to ensure that protection methods are — and remain — affordable, resource-efficient, and easy to use and operate. No matter how large your virtual environment grows.

Update April 7, 2012

Got a reply from Veeam’s support, their suggest is to use Full Backup instead of Synthetic Backup method, kind of avoid answer my question directly, well, this part does suck anyway. After all. nobody want to copy that hundred GB of vbk everyday after the backup session over a 100Mbps or eve 1Gbps link. Another problem is Full Backup tends to keep a lot more retention copies than Synthetic, this uses more space as well.

During the first run of a forward incremental backup (or simply incremental backup), Veeam Backup & Replication creates a full backup file (.vbk). At subsequent backups, it only gets changes that have taken place since the last performed backup (whether full or incremental) and saves them as incremental backup files (.vib) next to the full backup.

Incremental backup is the best choice if company regulation and policies require you to regularly move a created backup file to tape or a remote site. With incremental backup, you move only incremental changes, not the full backup file, which takes less time and requires less tape. You can initiate writing backups to tape or a remote site in Veeam Backup & Replication itself, by configuring post-backup activities.

Strange High Latency (Read) After Equallogic Firmware Upgrade (Solved!)

By admin, April 3, 2012 2:30 pm

I have performed the firmware upgrade today on one of the PS6000XVs to latest v5.2.2.

Everything worked as it should be, VMs stayed as solid as steel (no ping lost), the contoller failover took about 10 seconds (ie, ping to group ip had a 10 seconds black out) and the whole upgrade took about 10 minutes to complete as expected.

Caution: Controller failed over in member myeql-eql01

Caution: Firmware upgrade on member myeql-eql01 Secondary controller.
Controller Secondary in member myeql-eql01 was upgraded from firmware version Storage Array Firmware V5.0.2 (R138185)

Caution: Firmware upgrade on member myeql-eql01 Primary controller.
Controller Primary in member myeql-eql01 was upgraded from firmware version to version V5.2.2 (R229536)

However there are various problem started to occur after the upgrade, mainly high TCP Retransmit, high disk latency (read) and a fan of active controller module failed. Besides, EQL battery temperature also went up by 5 degrees comparing to its original state. (something is going on on background contributes to this raise for sure)

1. High TCP Retransmit (SOLVED)

The IOMeter benchmark dropped by almost 90% and high TCP Retransmit starts to occur, re-installed MEM on ESX Hosts, reboot, still the same.

Then I reboot the PowerConnect 5448 switches one by one, this solved the problem completely, but why Equallogic Firmware Upgrade requires the switch gears to be rebooted? Was something cached in the switch, ARP, MAC? I don’t know really, may be this is the time we say “Wow, It Worked~ It’s Magic!”

2. High Disk Read Latency (Remain UNSOLVED)

This PS6000XV used to have below 6ms Latency, it’s now 25-30ms on average, and the funny thing is whenever the IOPS is extreme high in 9,000 range (I use IOMeter to push my array to its max), the latency becomes really low in 5ms range.

Vice versa, whenever the OPS is extreme low in 5 to 70 where I stopped the IOMeter, the latency jumps sky high in 130-120ms range.

All these were performed using the latest SAN HQ v2.2 Live View tool, I really liked it much!

All individual volumes latency added together is still 5-6ms, so where does the extra 20 something HIDDEN latency coming from?

Contacted US EQL support as local support has no clue what so ever as usual, he told me it could be due to meta data remapping process going on background after the firmware upgrade, and I need to wait a few to 24 hours for it back to normal. To be honest, I’ve near heard such thing nor I can googled about this (ie, disk meta data needs to be remapped after firmware upgrade)

Ok, things are still the same after almost 48 hours, so I doult this is the problem, ps aux shows no process is going on at the array.

Remember my controller temperature also went up by almost 25% indicating somthing is working heavily on the storage processor, so could this be an additional good indicator show that my PS6000XV is still doing some kind of background meta data integrity checking whenever it senses IOPS is low, so it boost the meta data integrity checking process, so we see that high latency?

Anyway, this problem remains as mystery, I don’t have any performance issue and this can be ignored for the time being, and I think only time can tell the truth when the background disk I/O thing completes its job and latency hopefully will back to normal.

In fact, I hate to say this, but I strongly suspect it’s a bug in Equallogic firmware v5.2.2, if you do have any idea, please drop me a line, thanks.

3. Fan of Active Controller Module Failed (SOLVED)

When active controller failover, Module 0 Fan1 went bad, it turns out to be FALSE ALARM, the fan returns to normal after the 2nd manual failover again.

Oh…the ONLY Good thing out of the whole firmware upgrade is TCP Retransmit is now 0% 99.9999% of the time and I do sense IOPS is 10% higher than before as well.

I saw a single spike of 0.13% only once in the past 24 hours, um… IT’S JUST TOO GOOD TO BE TRUE, SOUNDS TOO SUSPICIOUS to me as the TCP Retransmit used to be in 0.2% range all the time.

Update May 1, 2012

The mystery has been SOLVED finally after almost 1 complete month!

I disabled Delayed Ack on all my ESX hosts for the cluster, after reboot the hosts one by one, I witness High Latency issue has gone forever! It’s back to 3.5-5.0ms normal range. (after 12:30pm)

fixed

The high read latency problem was indeed due to Delayed Ack which was enabled on ESX 4.1 (by default). As it was also stated by Don Williams (EQL’s VMware specialist), Delayed Ack adds artificial (or Fake) high (or Extra) latency to your Equallogic SAN, that’s why we saw those false positive on SANHQ.

In other words, SANHQ was deceived by the fake latency number induced by Delayed Ack, which leads to the strangeness of this problem.

It’s nothing to do with our switch setting, but still this doesn’t explain why EQL firmware v5.0.2 or before doesn’t have this problem, so it might still related to a firmware bug in v5.1.x or v5.2.x that triggered the high latency issues in those ESX/ESXi hosts with Delayed Ack enabled (by default).

Finally, IOmeter IOPS shows 10-20% increase in performance after upgrading to firmware v5.2.2. (actually with or without disabling the Delayed Ack)

Again, I am greatly appreciated for the help from Equallogic’s technical support team, we have done many direct WebEX sessions with them and they have been always patient and knowledgeable, know their stuff, especially Ben who’s the EQL performance engineer, he also gave me an insight lecture in using SANHQ, I’ve learnt many useful little stuff in helping me troubleshooting my environment in the future!

and of course, Joe, who’s responsible for EQL social media, it proved Equallogic is a great company who cares their customer no matter large or small, this really makes me feel warm. :)

Pages: Prev 1 2 3 4 Next