Yesterday was like a long fight, different parts started to fall apart within a few hours, first it was one of the ESX host, then Equallogic, finally Group Manager and iDrac problem, people say Shxt happens, this fits exactly to my case!
One thing I’m glad that Dell fulfilled it’s promise this time and fixed everything within the 4 hours pro-support contract (hardware wise of course), poor guy has to go to NOC twice and worked till almost mid-night with me working remotely.
So please let me list them accordingly:
1. ESX Host:
I suddenly received a host fail alert, vCenter shows the problem host got disconnected, all the VMs on it also went grey out. Funny thing is all VMs can be still pingable and function perfectly normal as if there is nothing wrong.
Telnet/SSH Even Console hung completely, there was no way to login using root, openmanage doesn’t load. Later I found out a 15K 146GB disk failed in a RAID1 configuration from iDrac system log.
Worst enough, the replaced disk did not start to rebuild. Later Dell’s technician went into Megaraid BIOS utility and found out he has to manually add back the disk. I suspect the problem is due to the replaced disk is a Fujisu where as the faulty disk is a Hitachi, that’s why they don’t work together initially. (they should in theory, but in reality NO)
At this stage, since there is no way to remove the live VM or do a vMotion, I have no choice but to power down the host manually. Even more strange, HA didn’t kick in, all the VM did not restart on other hosts in the cluster even after 5 mins.
The whole rebuild took about 15 minutes, thanks to RAID1. The rebuild status in Openmange shows it’s always 33% while the disk light stopped blinking (meaning completed), funny! After reboot again, the optimal status can be verified in Megaraid BIOS, also reflects in Openmanage later, so this means Openmanage takes time to fetch the status from different hardware parts.
So I still have no clue why the faulty disk in a RAID1 caused the ESX host to be non-responsive.
2. Equallogic:
I received the following notice multiple times via Email, Group Manager shows it’s Information type and it’s in Green, I’ve sensed there must be something wrong, so I called Dell EQL support, as expected, the local support knows nothing about it.
—————————————–
INFO event from storage array eql01
subsystem: SP
event: 14.2.22
time: Fri Jan 10 12:10:30 2014
I/Os containing bad blocks were read from drive 10 and successfully reconstructed in the last 8 minutes.
—————————————–
After approximately 6 hours, the following faulty alert confirmed my previous worry.
—————————————–
ERROR event from storage array eql01
subsystem: SP
event: 14.4.22
time: Fri Jan 10 20:11:15 2014
Disk drive 10 failed in RAID LUN 0.
—————————————–
So the previous notice is actually EQL’s Predictive Failure in Action!!!
SANHQ also generated the similar alert.
Warning conditions:
- 1/10/2014 8:10:50 PM to 1/10/2014 8:12:50 PM
- Warning: Member eql01 RAID Set Is Degraded
- Warning: Member eql01 RAID set is degraded because a disk drive failed or was removed.
- Warning: Member eql01 RAID More Spares Expected
- Warning: Member eql01 The current RAID configuration requires more spare drives then are currently available.
- Warning: Member eql01 has a failed drive in slot 10
With the replacement disk, reconstruction immediately took place, and the process took about 1 hour to complete, again, thanks to RAID1.
3. EQL Group Manager
As I need to verify if the replaced EQL disk has successfully changed to a hot spare, then I found out I can no longer login to EQL Group Manager due to some strange Java error, no matter if it’s IE or Firefox. The Java version is v7 u45, then I’ve tried different versions until I figured out only v7 u17 worked. My conclusion is EQL firmware plays a big role in this case, as I am still using v5.2.2, so EQL probably hard coded the requirement into their application, anyway, Java JRE verion always produces nasty problem in my environment one way another, so I’ve decided not to upgrade it for sure.
4. iDrac
Back to the Disconnected Host with faulty disk, I found I can no longer login to iDrac Web UI, IE works but producing all sorts of problem, not to mention the console doesn’t show up at all with its ActiveX stuff. I’ve even tried to removed the iDRAC cert from advanced option, reboot the managed machine, won’t help at all, and it turns out a simple Content Cache Clear in Firefox solved the problem completely! Ridiculous Really!
If it still doesn’t work, do a soft rest by “racadm racreset soft”
5. Veeam
Yes, it’s not finished yet, I also found Veeam’s schedule job stopped working as I am still using V5.0.1, there is a Veeam KB and an update (v5.0.2) for this issue, but I can’t explain why it’s been working for 3+ years and suddenly stopped working with no reason, so I’ve removed all the old backup and created a New Full Backup, truth will tell by tomorrow morning and I shall verify the Schedule Job again by then.
Update: I have to install the update in order to solve the schedule job doesn’t run problem. Also do remember to close all the extra TPC/UDP ports that’s been re-enabled by the upgrade of Veeam B&R program. (Potential Risk: Veeam Agent, NFS and Windows Shares in particular)
Updated:
Restarting the management agents on ESX may help:
-
Log in to your ESX Server as root (by su -) from either an SSH session or directly from the console of the server.
-
Type “service mgmt-vmware restart”.
Caution: Ensure Automatic Startup/Shutdown of virtual machines is disabled before running this command or you risk rebooting the virtual machines.
-
Press Enter.
-
Type “service vmware-vpxa restart”.
-
Press Enter.
-
Type “logout” and press Enter to disconnect from the ESX host.