Strange High Latency (Read) After Equallogic Firmware Upgrade (Solved!)

By admin, April 3, 2012 2:30 pm

I have performed the firmware upgrade today on one of the PS6000XVs to latest v5.2.2.

Everything worked as it should be, VMs stayed as solid as steel (no ping lost), the contoller failover took about 10 seconds (ie, ping to group ip had a 10 seconds black out) and the whole upgrade took about 10 minutes to complete as expected.

Caution: Controller failed over in member myeql-eql01

Caution: Firmware upgrade on member myeql-eql01 Secondary controller.
Controller Secondary in member myeql-eql01 was upgraded from firmware version Storage Array Firmware V5.0.2 (R138185)

Caution: Firmware upgrade on member myeql-eql01 Primary controller.
Controller Primary in member myeql-eql01 was upgraded from firmware version to version V5.2.2 (R229536)

However there are various problem started to occur after the upgrade, mainly high TCP Retransmit, high disk latency (read) and a fan of active controller module failed. Besides, EQL battery temperature also went up by 5 degrees comparing to its original state. (something is going on on background contributes to this raise for sure)

1. High TCP Retransmit (SOLVED)

The IOMeter benchmark dropped by almost 90% and high TCP Retransmit starts to occur, re-installed MEM on ESX Hosts, reboot, still the same.

Then I reboot the PowerConnect 5448 switches one by one, this solved the problem completely, but why Equallogic Firmware Upgrade requires the switch gears to be rebooted? Was something cached in the switch, ARP, MAC? I don’t know really, may be this is the time we say “Wow, It Worked~ It’s Magic!”

2. High Disk Read Latency (Remain UNSOLVED)

This PS6000XV used to have below 6ms Latency, it’s now 25-30ms on average, and the funny thing is whenever the IOPS is extreme high in 9,000 range (I use IOMeter to push my array to its max), the latency becomes really low in 5ms range.

Vice versa, whenever the OPS is extreme low in 5 to 70 where I stopped the IOMeter, the latency jumps sky high in 130-120ms range.

All these were performed using the latest SAN HQ v2.2 Live View tool, I really liked it much!

All individual volumes latency added together is still 5-6ms, so where does the extra 20 something HIDDEN latency coming from?

Contacted US EQL support as local support has no clue what so ever as usual, he told me it could be due to meta data remapping process going on background after the firmware upgrade, and I need to wait a few to 24 hours for it back to normal. To be honest, I’ve near heard such thing nor I can googled about this (ie, disk meta data needs to be remapped after firmware upgrade)

Ok, things are still the same after almost 48 hours, so I doult this is the problem, ps aux shows no process is going on at the array.

Remember my controller temperature also went up by almost 25% indicating somthing is working heavily on the storage processor, so could this be an additional good indicator show that my PS6000XV is still doing some kind of background meta data integrity checking whenever it senses IOPS is low, so it boost the meta data integrity checking process, so we see that high latency?

Anyway, this problem remains as mystery, I don’t have any performance issue and this can be ignored for the time being, and I think only time can tell the truth when the background disk I/O thing completes its job and latency hopefully will back to normal.

In fact, I hate to say this, but I strongly suspect it’s a bug in Equallogic firmware v5.2.2, if you do have any idea, please drop me a line, thanks.

3. Fan of Active Controller Module Failed (SOLVED)

When active controller failover, Module 0 Fan1 went bad, it turns out to be FALSE ALARM, the fan returns to normal after the 2nd manual failover again.

Oh…the ONLY Good thing out of the whole firmware upgrade is TCP Retransmit is now 0% 99.9999% of the time and I do sense IOPS is 10% higher than before as well.

I saw a single spike of 0.13% only once in the past 24 hours, um… IT’S JUST TOO GOOD TO BE TRUE, SOUNDS TOO SUSPICIOUS to me as the TCP Retransmit used to be in 0.2% range all the time.

Update May 1, 2012

The mystery has been SOLVED finally after almost 1 complete month!

I disabled Delayed Ack on all my ESX hosts for the cluster, after reboot the hosts one by one, I witness High Latency issue has gone forever! It’s back to 3.5-5.0ms normal range. (after 12:30pm)

fixed

The high read latency problem was indeed due to Delayed Ack which was enabled on ESX 4.1 (by default). As it was also stated by Don Williams (EQL’s VMware specialist), Delayed Ack adds artificial (or Fake) high (or Extra) latency to your Equallogic SAN, that’s why we saw those false positive on SANHQ.

In other words, SANHQ was deceived by the fake latency number induced by Delayed Ack, which leads to the strangeness of this problem.

It’s nothing to do with our switch setting, but still this doesn’t explain why EQL firmware v5.0.2 or before doesn’t have this problem, so it might still related to a firmware bug in v5.1.x or v5.2.x that triggered the high latency issues in those ESX/ESXi hosts with Delayed Ack enabled (by default).

Finally, IOmeter IOPS shows 10-20% increase in performance after upgrading to firmware v5.2.2. (actually with or without disabling the Delayed Ack)

Again, I am greatly appreciated for the help from Equallogic’s technical support team, we have done many direct WebEX sessions with them and they have been always patient and knowledgeable, know their stuff, especially Ben who’s the EQL performance engineer, he also gave me an insight lecture in using SANHQ, I’ve learnt many useful little stuff in helping me troubleshooting my environment in the future!

and of course, Joe, who’s responsible for EQL social media, it proved Equallogic is a great company who cares their customer no matter large or small, this really makes me feel warm. :)

24 Responses to “Strange High Latency (Read) After Equallogic Firmware Upgrade (Solved!)”

  1. mel says:

    I was contemplating if to upgrade our bunch of production EQL 6010 array firmware from 5.0.7 to 5.2.2. Now after reading your post i think i will hold back for a while. What tempts me to upgrade is the reboot of the secondary controller for no apparent reason. It’s a bug, too many bugs one after another. Had too many bad experience with EQL bugs. Thanks for your post and advice, been reading your blog for a while since we had out EQL a year ago. Dilemma now.

    Btw, did perform a stage upgrade to major release first before performing minor version release 5.2.2? or you plunge directly from 5.0.2 to 5.22?

  2. admin says:

    I’ve upgrade directly from v5.02 to v5.2.2 and yes I waited for this major firmware release for over a year. :)

    To be fair, v5.0.2 didn’t have any issues, not even any minor ones for the past 18 months, I would say generally speaking EQL firmware is quite reliable and their US tech support knows what they are talking about for sure.

    The only serious firmware fault was v5.0.0, Equallogic has to contact all their clients around the globe to fix that issue within the week, it was quite scary at that time, but since then EQL firmware has been thoroughly tested, and I have confidence in their product.

    I also read today someone upgrade their EQL firmware using group manager GUI and suddenly controller won’t failover, so controllers ends with different firmware version, that’s bad! In fact, I experienced once myself as well in v4.3.7, the solution? quite easy, manually upgrade the firmware again using TFTP and CLI. :)

    The reason I need to upgrade fw above 5.1 is for the future upgrade of vSphere 5.1.

    Anyway, I am not too worried about the high latency in PS6000XV, somehow it’s below 20ms now after 3 days, so probably the APLB (Auto Performance Load Balancing) feature introduced in v5.1 is working hard on background even this pool only has one member. Well, consider it a bug may be…haha.

  3. Danny says:

    Just check with my 6000XV after read your post, but do not notice such problems, may be I only have 8 VMs on 2 ESXi hosts, the loading is not high.

    Danny

  4. Niels says:

    We had the same problem with 1 unit when we did an upgrade from 5.0.2 towards 5.1.2. The latency went from 4-5ms towards 40ms and stayed static at this number. Increasing/decreasing the number of LUNs/VMs didn’t change a thing.

    After opening a case with a lot of troubleshooting but all by all useless advice from DELL we decided to migrate the VM’s from that unit and downgrade back towards 5.0.8 and do some intensive testing.

    The latency increase went away as soon as we downgraded back to 5.0.8. Even when we did intensive testing we never reached the previous read latency. After that we performed another upgrade towards 5.1.2 and the latency was back to it’s normal readings. We haven’t tried with any higher firmware after this. Reading this post about 5.2.2 does bring up questions if we will face the same problem again.

    If you can, try downgrading and upgrading again and see if this will also resolve your problem. I have a feeling they changed something in their firmware regarding the latency formula.

  5. admin says:

    Neil, thanks for reporting a similar case and I shall wait for the next release instead of downgrade for the time being.

    Btw, it’s back to below 20ms, but still no where near 5ms like before.

  6. Darking says:

    Hi.

    I saw the exact same change in read latency when i upgraded all my arrays to 5.1.2 (note not 5.2.2).

    I never managed to find out what caused it, and support could not find anything wrong either.

    Darking

  7. admin says:

    It seemed to me I am not the only one on the list.

    Could this background hidden latency cause by APLB (Automatic Performance Load Balancing, introduced in v5.1) ?

  8. Darking says:

    I would think so, and that was the support engineers guess too.

    The volume latency is in the 5-10ms range, and i only see the latency on the group and array level.

    As you previously mentioned, under load, the latency of the arrays goes down, i did extensive testing on this, and it seems to be true.

    I have also tested a whole bunch on my hosts, and i cannot detect any sorts of latency issues at all, so i suspect its merely the way the arrays are reporting latency back SANHQ, or that volume and array latency have no comparison at all.

  9. admin says:

    I am still waiting for EQL L2 support’s feedback on this issue as they need to look my EQL Diags and SANHQ archive in details, but I tend to think it’s a v5.x bug now as I am seeing more and more people with high latency after upgrading to v5.x.

    See the similar case at Veeam forum.

    “Alex, what firmware version are you running on the EqualLogic? I think the latest 5.2.1 seems to be a bit buggy, coz after upgrading the latency under SAN HQ has gone through the roof and Veeam backup performance has dropped from 70-130MB/sec to around 3-5MB/sec average. Oddly this only appears to be from the Veeam (Win2k8 R2) server, ESXi 5.0 latest updates seems to run fine. “

    Also on VMTN.

    One group i had was running 4.3.8 for a long time and this one showed nearly NO latency. Till this weekend were i upgraded this group to 5.2.1. Take a look at SANHQ yourself:

    You see the significant raise at the end of the graph.

    This is weird, eh? Now does SANHQ display wrong latency information with 5.2.1 – or did SANHQ display wrong latency information with 4.3.8? Or does FW 5 CAUSE higher latency?

  10. Derek says:

    We too have seen the same thing with high read latency. We are running about 100 volumes on 2 PS300’s with Qlogic HBA’s. All was fine when we were using 4.3.7. However after updating to 5.1.2 the high latency readings in SAN HQ 2.2 started. When there is very low activity we even see latencies of up to 6000 ms! However when we drill down to the individual volume readings it’s more or less normal ranging from 6 or 7 ms to 80 ms.
    We raised a case at EQL and they dug into it. Their conclusion was that probably it was SAN HQ. If we were really experiencing latencies of a couple of seconds like SAN HQ suggested we would be having big problems. It’s interesting however to note that we are using Qlogic HBA’s in all our physical and virtual servers. The old ones (4010 in physical servers) show the high latencies when there’s low IO. The new one’s (4062) don’t. The old ones even show average IO rates of 0 and latencies of 0. It might be that SAN HQ is misinterpreting this when calculating average values? We however do have doubts about our overall performance though, especially with Exchange 2003 and are investigating that at the moment together with EQL.

  11. JoeSatDell says:

    Hi, It’s Joe with Dell EqualLogic in the USA. Regarding your issue #2 and the High Disk Latency, I would suggest, that if you are still having the issue to re-open your support case and ask to have the case esclated to the EqualLogic Performance Engineers team to take a closer look.

    Thank you

  12. admin says:

    Joe, thanks for your msg. Btw, my case (#407415) is currently reviewed by John Reguera, a L2 EQL Enterprise Technical Support Consultant, he mentioned on Apr. 15th. “Any type of performance issue can be time consuming to resolve due to the complexity of the environment. Your patience is appreciated.”, so I shall wait a bit longer.

    Anyway, I do think it’s a firmware bug in v5.1.x/v5.2.x as I can see more and more EQL customers having this similar high latency issue after upgrading the firmware to v5.1 or above.

    The good thing is that it may be a false alarm as I see no performance issue at all, in fact, the latest firmware v5.2.2 results in 10% more IOPS than before.

  13. admin says:

    Update: after talking to an Equallogic Performance Engineer, he suggested me to turn off Delayed Ack in ESX.

    I also found Don suggested the same and he actually mentioned about the Artificial Latency (in SANHQ?).

    http://commweb-ps3.us.dell.com/support-forums/storage/f/3775/t/19427573.aspx

    Based on that data you don’t have a SAN performance problem. I suspect you have Nagle and/.or Delayed ACK enabled on the server. This can cause artificial latency because the server is holding on to packets and delaying ACKs on writes

    http://en.community.dell.com/support-forums/storage/f/3775/t/19434500.aspx

    Delayed ACK and LRO should definitely be turned off. Delayed ACK / LRO aren’t “broken” but in iSCSI environments they can add artificial latency to IO.

    Finally, found another PDF “Best Practices for VMware on EQL Conf 10-2010

    on page 58, it is said Delayed Ack Adds

    False Performance Reporting
    – File Copy in the COS or Guest OS is Deceptive
    – Queues Aren’t Used Fully
    › Latency, Rather Than Throughput Measured
    › Delayed ACK Latency Worse When TCP Session Isn’t Heavily Used

    But it suggested “Should Always Be Enabled When It Can Be”, I think it’s a typo right? should always be Disabled.

    Finally, this still doesn’t explain why the latency was normal prior to FW v5.2.2 upgrade with the Delayed Ack Enabled.

  14. Sal says:

    Hi there,

    I have the EXACT same problem as you are reporting. We use our EQLs to support our MS SQL 2008 cluster environment.

    I upgraded our PS6000XV from 5.0.2 to 5.2.2 and everything seems great, except that in SAN HQ the READ latency is around 100 times higher on average than normal (normal being before the firmware upgrade).

    Also, just as you experienced, when the SAN is under load, e.g. transaction log backup, the average latency goes down substantially. This correlates with the theory that the read latency isn’t “actually” affected for normal operations.

    I think this is a bug with 5.2.2, one other thing my technician has told me to do is to upgrade the Dell HIT to 4.0, currently I am running 3.4.2. Before I go ahead and take the downtime to do this, can anyone tell me if this is useless and won’t solve my problem?

    What version of the Dell Host Integration Tools are you guys running?

    FYI, I have also opened a support case with Equallogic: CASE 423614

  15. admin says:

    I am using HIT 4.0 for Windows, but the problem is related to VMware ESX as my Windows platform hardly use EQL box.

    Anyway, I do think it’s a bug in FW v5.1.x/v5.2.x, but the funny thing is Don (EQL VMware specialist) posted on VMTM saying enabling Delayed ACK will cause artificial latency, this explained why we do not have any performance issue, but only seeing the fake high latency on SANHQ.

    I shall schedule a time slot for disabling Delayed Ack in ESX and report back my findings later.

  16. Darking says:

    The whole nagle /Delayed Ack is a wild goose chase.

    I was told exactly the same, and turned it off on my entire vmware enviroment, and it had no effect at all.

    In my mind, something with the reporting of performance numbers has changed in the firmware between 5.0.7 and 5.1.2 that causes SANHQ to report elevated values. As previously stated, im not seeing any latency issues on the volumes, nor anything on my vmware or other directly connected hosts.

    Its a shame they cannot figure out what it is, but i suppose they are limited in what statistics/diags the arrays show them.

  17. Darking says:

    Oh and since dell is looking into these, my support case number is Case # 00321924 Tom Dewey (Technical Support Engineer) looked into my case.

  18. admin says:

    Update:

    The mystery has been SOLVED finally after almost 1 complete month!

    For details, please refer to the update I’ve added to the end of this blog article.

  19. Darking says:

    Really odd that it didnt fix my issue, i suppose our loads might just be a bit different :)

  20. admin says:

    Darking, I would suggest you to re-open the case with Equallogic support and escalate your case to their Performance Engineer.

    Btw, I just found out Disabling Delayed Ack on ESX also improved IOPS (and some more on VMTN) and Delayed Ack was re-enabled by default in ESXi 5 somehow, mentioned somewhere in VMTN as well.

  21. Darking says:

    Yeah i think i need to, because i just rechecked my entire enviroment, and everything but linux servers have disabled tcp ACK delay. and i havnt really found any instructions for it anywhere

  22. Don Williams says:

    Re: Derek and 4010 w/high latency @ low IO. This one I can answer, since I’ve seen it for years. The QLogic cards tend to have the Nagle Algortithm enabled. On low IO periods, IO’s are held until more IOs are pending. The array is waiting for the ACK, the timer is running so the reported latency goes up. During normal IO nagle isn’t in effect. If you google for nagle delay you’ll find a number of articles about it.

    here’s some info on how to check it and disable it using SANSURFER.

    http://www.jwertheimer.com/help/qlogic/iSCSI_Help/configuring_an_HBAs_firmware_values.htm

    Re: Delayed ACK. It’s not a one size fits all solution. So many variables with IO loads, etc..

    I can say that it has helped many customers.

    Also, you have to run a vm-support after the reboot. I’ve noticed with ESXi v5 that it doesn’t consistently set the value to “0″. So if you saw no change, it may not actually be disabled. Or only some of the LUNs will show disabled.

    If you untar the vm-support file and go to the “commands” sub-directory do a #grep Delayed * | more

    You’ll get a list of “nodes” that refer to the LUNs. It will end with =”0″ or =”1″. One (1) is enabled.

    Sometimes I’ve had to remove the discovery address, reboot, add it back in, then select “Advanced” and scroll down to verify that “Delayed ACK” doesn’t have a check next to it.

    rescan and then re-do vm-support to check the settings again.

  23. Darking says:

    Thanks Don, i will have a look into this.

    Btw. ive seen related high latency issues with Qlogic cards, in our case it was a special type of card that was the only one available for IBM pseries 520 and 550’s.

    They did not allow for any management of the HBA, and the support had to recommend us switching to normal Intel NICs instead.. which worked perfectly fine for our needs. I guess its one of the dangers of having vendor specific hardware, that functionality sometimes can be limited compared to just having bought a generic qlogic HBA. Pseries servers will not boot/enable with a non-IBM labled adapter though (we tried!!)

  24. admin says:

    This is what included in the latest EqualLogic Customer Connection (2013 Feb) Question of the Month:

    Q: In VMware ESX environments, why does EQL suggest disabling Delayed ACK and Large Receive Offload (LRO)?

    A: With a virtual environment, you want all data to be acknowledged and available quickly. When you use any receive offload technology, you are filling a big buffer before interrupting the CPU for processing. This “delay” can add artificial latency. As a result, ESXi v5.x will sometimes generate latency alarms on Datastores when Delayed ACK and/or LRO are enabled. In SANHQ you can see this effect. During very low I/O periods, the reported latency is high. Conversely, when under moderate load, the latency is low.

Leave a Reply