Storage I/O Control (SIOC) Causing VM to Fail
Recently, I encountered a strange situation that sharp at 2am which is the backup window (Acronis agent running inside VM), one of the VM failed to function on occasion, I have to reboot it in order to RTO. CPU on the VM went to 100% for a few hours and became non-responsive to ping gradually.
However I am still able to login to console, but cannot launch any program, reboot is ok though.
There are tons of Red Alert Error under Event Log (System), most of them are related to I/O problem, seemed harddisk on EQL SAN is having bad block or so.
Event ID: 333
An I/O operation initiated by the Registry failed unrecoverable. The Registry could not read in, or write out, or flush, one of the files that contain the system’s image of the Registry.
Event ID: 2019
The server was unable to allocate from the system nonpaged pool because the pool was empty.
Event ID: 50
{Delayed Write Failed} Windows was unable to save all the data for the file . The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere.
Event ID: 57
The system failed to flush data to the transaction log. Corruption may occur.
I couldn’t’ find the exact reason during the preliminary investigation and email exchange with EQL technical support returns nothing.
Event ID: 2019
Unable to read the disk performance information from the system. Disk performance counters must be enabled for at least one physical disk or logical volume in order for these counters to appear. Disk performance counters can be enabled by using the Hardware Device Manager property pages. The status code returned is in the first DWORD in the Data section.
Then suddenly I found there is an vCenter alert saying there is a non-direct storage congestion on the volume where the VM locates. Right away I figured out it’s related to SIOC, checked the IO latency during 2AM confirmed this. It’s the SIOC throttled back (over 30ms) or forcing the volume to use less latency during the backup windows, so this cause the backup software (Acronis) and Windows somehow crashed.
I’ve disabled SIOC on that particular volume for 3 days already, everything runs smooth so far, it seemed I have solved the mystery.
If you have encountered something like this, please do drop me a line, thanks!
Update: Dec 23, 2011
The same problem occurred again, so it’s definitely not related to SIOC. Funny thing is it happened on the same time when scheduled antivirus and Acronis backup windows started together. So I’ve changed the Acronis backup windows to a later time because I think these two I/O intensive programs were competing with each other.
I do hope this is the root of the problem and will observe more.
Update: Jan 14, 2012
I think I’ve nailed the problem finally, no more crash since Dec 23, 2011. Last night I observed a very interesting fact that the VM CPU went to 90% at 2am and last for 15 mins. ah…I realized it’s the weekly scheduled virus scanning that’s causing huge I/O and latency, it even some of the service on this VM stopped responding during the busy time.
So I’ve decided to remove the weekly scan completely, it’s useless anyway.
Update: Jan 18, 2013
The above procedure (removed weekly scheduled virus scanning) does prove it’s the cause of the problem after all.