Saturday, December 15, 2007

Powerpath 5.0.2 was effectively the problem

When donwgrading to powerpath 4.5 the problems disapeared
We rebooted the SPs more than 30 times during this maintenance window
My conclusion after this bad experience is that EMC ELab support matrix
is wrong, we wont trust that never again, and we will look forward to move
away from powerpath to adopt mpxio, after rigorous testing for sure

So there was two problems happenning randomly, caused by powerpath 5.0.2 under Solaris10 update 3 with all EMC recommended solaris patches and kernel settings, more often when i/o are happening I used a script to generate i/os reading /dev/random using dd in a loop and writing to files on LUNS :
- After the owning SP reboots and causes the LUNs to tresspass and comes backs up
when the LUNs should tresspass back to the owning SP, they dont although OS log
messages says owning SP reappeared on the fabric, at that time, its like if powerpath
was triggering a hang in the ssd driver, because even commands accessing local drives
on the SunFire480 ( fiber channel drives ) are hanging and im not able to kill them even with
kill -9, we have to reboot the OS to fix, and even the reboot hangs so we panic the OS at the OBP with the sync command, that gives us also a crash dump that Sun analyzed and concluded the hang was coming from the powerpath driver
- ufs errors related to ufs logging , while the first trespass occurs from the owning SP to the other one, you have to umount the volume and mount it again to fix

I escalated that case in EMC support they have a hard time admitting they have a problem
They didnt update the Elab yet to remove Powerpath 5.0.2 + Solaris10 from there

No comments: