Strange High Latency (Read) After Equallogic Firmware Upgrade (Solved!)
I have performed the firmware upgrade today on one of the PS6000XVs to latest v5.2.2.
Everything worked as it should be, VMs stayed as solid as steel (no ping lost), the contoller failover took about 10 seconds (ie, ping to group ip had a 10 seconds black out) and the whole upgrade took about 10 minutes to complete as expected.
Caution: Controller failed over in member myeql-eql01
Caution: Firmware upgrade on member myeql-eql01 Secondary controller.
Controller Secondary in member myeql-eql01 was upgraded from firmware version Storage Array Firmware V5.0.2 (R138185)
Caution: Firmware upgrade on member myeql-eql01 Primary controller.
Controller Primary in member myeql-eql01 was upgraded from firmware version to version V5.2.2 (R229536)
However there are various problem started to occur after the upgrade, mainly high TCP Retransmit, high disk latency (read) and a fan of active controller module failed. Besides, EQL battery temperature also went up by 5 degrees comparing to its original state. (something is going on on background contributes to this raise for sure)
1. High TCP Retransmit (SOLVED)
The IOMeter benchmark dropped by almost 90% and high TCP Retransmit starts to occur, re-installed MEM on ESX Hosts, reboot, still the same.
Then I reboot the PowerConnect 5448 switches one by one, this solved the problem completely, but why Equallogic Firmware Upgrade requires the switch gears to be rebooted? Was something cached in the switch, ARP, MAC? I don’t know really, may be this is the time we say “Wow, It Worked~ It’s Magic!”
2. High Disk Read Latency (Remain UNSOLVED)
This PS6000XV used to have below 6ms Latency, it’s now 25-30ms on average, and the funny thing is whenever the IOPS is extreme high in 9,000 range (I use IOMeter to push my array to its max), the latency becomes really low in 5ms range.
Vice versa, whenever the OPS is extreme low in 5 to 70 where I stopped the IOMeter, the latency jumps sky high in 130-120ms range.
All these were performed using the latest SAN HQ v2.2 Live View tool, I really liked it much!
All individual volumes latency added together is still 5-6ms, so where does the extra 20 something HIDDEN latency coming from?
Contacted US EQL support as local support has no clue what so ever as usual, he told me it could be due to meta data remapping process going on background after the firmware upgrade, and I need to wait a few to 24 hours for it back to normal. To be honest, I’ve near heard such thing nor I can googled about this (ie, disk meta data needs to be remapped after firmware upgrade)
Ok, things are still the same after almost 48 hours, so I doult this is the problem, ps aux shows no process is going on at the array.
Remember my controller temperature also went up by almost 25% indicating somthing is working heavily on the storage processor, so could this be an additional good indicator show that my PS6000XV is still doing some kind of background meta data integrity checking whenever it senses IOPS is low, so it boost the meta data integrity checking process, so we see that high latency?
Anyway, this problem remains as mystery, I don’t have any performance issue and this can be ignored for the time being, and I think only time can tell the truth when the background disk I/O thing completes its job and latency hopefully will back to normal.
In fact, I hate to say this, but I strongly suspect it’s a bug in Equallogic firmware v5.2.2, if you do have any idea, please drop me a line, thanks.
3. Fan of Active Controller Module Failed (SOLVED)
When active controller failover, Module 0 Fan1 went bad, it turns out to be FALSE ALARM, the fan returns to normal after the 2nd manual failover again.
Oh…the ONLY Good thing out of the whole firmware upgrade is TCP Retransmit is now 0% 99.9999% of the time and I do sense IOPS is 10% higher than before as well.
I saw a single spike of 0.13% only once in the past 24 hours, um… IT’S JUST TOO GOOD TO BE TRUE, SOUNDS TOO SUSPICIOUS to me as the TCP Retransmit used to be in 0.2% range all the time.
Update May 1, 2012
The mystery has been SOLVED finally after almost 1 complete month!
I disabled Delayed Ack on all my ESX hosts for the cluster, after reboot the hosts one by one, I witness High Latency issue has gone forever! It’s back to 3.5-5.0ms normal range. (after 12:30pm)
The high read latency problem was indeed due to Delayed Ack which was enabled on ESX 4.1 (by default). As it was also stated by Don Williams (EQL’s VMware specialist), Delayed Ack adds artificial (or Fake) high (or Extra) latency to your Equallogic SAN, that’s why we saw those false positive on SANHQ.
In other words, SANHQ was deceived by the fake latency number induced by Delayed Ack, which leads to the strangeness of this problem.
It’s nothing to do with our switch setting, but still this doesn’t explain why EQL firmware v5.0.2 or before doesn’t have this problem, so it might still related to a firmware bug in v5.1.x or v5.2.x that triggered the high latency issues in those ESX/ESXi hosts with Delayed Ack enabled (by default).
Finally, IOmeter IOPS shows 10-20% increase in performance after upgrading to firmware v5.2.2. (actually with or without disabling the Delayed Ack)
Again, I am greatly appreciated for the help from Equallogic’s technical support team, we have done many direct WebEX sessions with them and they have been always patient and knowledgeable, know their stuff, especially Ben who’s the EQL performance engineer, he also gave me an insight lecture in using SANHQ, I’ve learnt many useful little stuff in helping me troubleshooting my environment in the future!
and of course, Joe, who’s responsible for EQL social media, it proved Equallogic is a great company who cares their customer no matter large or small, this really makes me feel warm.