Poweredge R710 Firmware Update Causes PCI Interrupt Conflict, H700 Raid Disappeared!
Dell Management Plugin for vCenter (DMPVC) v1.5 has been released on April 5th with 1 Free Host, so I couldn’t wait to test this management software. I’ve read through the prerequisites and found out many components such as BIOS, iDrac, Raid Controller, LifeCycle Controller, etc are required to be updated to the latest version before DMPVC can properly work.
So the next obvious thing is to go ahead and update my Poweredge R710 server. Of course there are many ways to update the Poweredge firmware running ESX, Repository Manager is one of them, probably the best one, but today I found out the connection to ftp.dell.com is extremely fast (about 30-50M/s), so I took the short cut and use F10 USC (Unified Server Configurator) to perform the update via ftp.
After I pulled the firmware catalog (again, it’s 2 months outdated, but new enough for DMPVC) and selected everything (BIOS, iDrac, H700, Broadcom, LC) but iDrac (v1.8) as I want to do that at last while I am using iDrac Remote Console to perform the updates.
The whole thing took about 1 hour to complete the 12 steps. I noticed something funny when the last reboot started, R710 complained Plug and Play Configuration Error and H700 Raid Menu has completely gone! I booted into BIOS and found H700 was no longer listed in the available storage devices, what?
OK, was it something to do with old iDrac version again? I then use USC to update iDrac to v1.8 and I lost the console when the iDrac restarted. After iDrac came back, I couldn’t login to iDrac web GUI, Damn! I called up data center and asked the staff to change the iDrac password (Ctrl + E), guess what? I still couldn’t login!
Babe! This leave me no choice but physically travel to data center, after 30 minutes, I was sitting in front of the LCD besides my rack and scratching my head, couldn’t figure out a bit why my H700 disappeared and I couldn’t login to iDrac.
Called local Pro-Support (HK) and as usual (this really let me think NO TO PURCHASE ANY MORE PRO-SUPPORT LEVEL SERVICE IN THE FUTURE), useless advice, asking me to pull out the Drac card and insert again, and finally ordered a replacement H700 (to arrive in 1-2 hours) and support staff to come within 3-4 hours.
During the chatting with Pro-Support staff, I googled a bit and found someone had EXACTLY THE SAME SYMPTOM as mine! The problem is after updating the extra 2 Quad Port 5709c Broadcom cards, it reset the Broadcom card to default with ROM Enabled, LOM (on-board Broadcom remains Disabled though, strange), so this ROM CONFLICTS with H700’s ROM, leading to PCI Interrupt Conflicts and blocked H700 Raid Configuration ROM to show up! After Disabled the Broadcom ROM one by one (has to do it 8 times), H700 was back online and the local Pro-Support staff told me they have never encountered such thing before, ok, I must say over the last 10 years, I solved my own hardware problem more than they do, I can do a MUCH BETTER JOB and more knowledge than they are, and why I am paying a premium for the Pro Support service level? @#$@#!!!@!!! No more! Lost for Dell!
For the strange iDrac problem, the solution is to reset it to factory default and reconfigure IP and password again.
So this is my 3rd time nightmare experience when performing firmware upgrade on Poweredge 11th generation server. Never thought a NIC ROM will conflict with Raid card’s ROM and never thought iDrac will block access after upgrading the firmware.
Later, I was able to reproduce the same symptom on another Poweredge R710 with same hardware configuration, ie, H700 disappeared, no more Ctrl + R menu due to Broadcom 5709 ROM conflicts with H700 ROM!
The good thing after all the troubles is I was finally able to use Dell Management Plugin for vCenter to discover the host and do things as it should be.
One thing I noticed R710 Firmware v6.0.7 has a new virtualization feature called SR-IOV, good for 10G/s cards for DCB, but I don’t have those, it will be useful if I upgrade to 10G/s later.
Last but not least, I found vMotion no longer works after the host exit Maintenance Mode, it complains CPU are not compatible with the source host as I tried to vMotion back the VMs, so I can upgrade the next ESX host. It turned out to be the latest BIOS v6.0.7 also updated the CPU Microprocessor Code to Step B, so it’s not the same as v2.0.9, so how am I going to migrate my VMs to this new host and performed the upgrade on my existing ESX host? Luckily, I have the luxury to power down those few VMs for a short period of time, and cold migrate them to the next host, so problem solved, haha…Ultimately it would be great if my client has a 3 nodes cluster of course.
Finally, I forgot iDrac is tightly integrated with LifeCycle Controller, so I should upgrade iDrac first, then BIOS with all the other components such as H700 Raid and Broadcom NICs depends if your iDrac still allows you to login to the web GUI that is, huh?!
In fact, I have updated all the iDrac firmware again to v1.85 today by uploading the .bin file directly via iDrac’s web GUI from a Windows host, it’s much simpler and clean to do that way.
One more thing to take care of is you can’t upgrade the iDrac while you are still connecting to remote console using USC method, it will perform the upgrade but saying something got conflict, so use the above alternative method will solve this problem.
On Windows system, it’s a much easier job, just apply the firmware update one by one at the same time, after everything has been updated, then reboot the server.
Anyhow, I would suggest using Dell Repository Manager to do the job for your next BIOS upgrade on Poweredge 11th generation servers.