Saturday, December 15, 2007

Powerpath 5.0.2 was effectively the problem

When donwgrading to powerpath 4.5 the problems disapeared
We rebooted the SPs more than 30 times during this maintenance window
My conclusion after this bad experience is that EMC ELab support matrix
is wrong, we wont trust that never again, and we will look forward to move
away from powerpath to adopt mpxio, after rigorous testing for sure

So there was two problems happenning randomly, caused by powerpath 5.0.2 under Solaris10 update 3 with all EMC recommended solaris patches and kernel settings, more often when i/o are happening I used a script to generate i/os reading /dev/random using dd in a loop and writing to files on LUNS :
- After the owning SP reboots and causes the LUNs to tresspass and comes backs up
when the LUNs should tresspass back to the owning SP, they dont although OS log
messages says owning SP reappeared on the fabric, at that time, its like if powerpath
was triggering a hang in the ssd driver, because even commands accessing local drives
on the SunFire480 ( fiber channel drives ) are hanging and im not able to kill them even with
kill -9, we have to reboot the OS to fix, and even the reboot hangs so we panic the OS at the OBP with the sync command, that gives us also a crash dump that Sun analyzed and concluded the hang was coming from the powerpath driver
- ufs errors related to ufs logging , while the first trespass occurs from the owning SP to the other one, you have to umount the volume and mount it again to fix

I escalated that case in EMC support they have a hard time admitting they have a problem
They didnt update the Elab yet to remove Powerpath 5.0.2 + Solaris10 from there

Monday, December 3, 2007

EMC Powerpath SE 5.0.2 flakyness

During Yesterdays tests, we were able to reproduce the problem 3 times, and we proved the leadville driver was not in cause here. After having reproduced the problem 2 times, I deinstalled Leadville emulex hba driver from Solaris10 OS by removing SUNWemlx[s,u] and EMLXemlxu emulex pagkages. Then I installed lpfc 6.02f and HBAnyware like on old machines after a couple of SPB, then a SPA, then another SPB reboots, the problem happened again.

Next week we will reproduce the problem again, now that we have a good methodology to do it :
reboot:
SPB,SPB,SPB,SPA,SPB
if it the problem doesnot happen, continue :
SPB,SPB,SPA,SPB,SPA,SPB

After every reboot, we wait until all Luns come back to their owning SP
so each reboot takes about 10 minutes

When we successfully reproduce it, I will then downgrade powerpath from 5.0.2 to 4.5
and try again, after that ;
If the problem is
there : install 120011 kernel patch
not there : change lpfc to leadville and make sure the problem is not coming back

Saturday, December 1, 2007

I WAS WRONG ON THAT : Sun Leadville Emulex driver and EMC Clariion is not a good mix

This week I identified an obscur eproblem to be caused by a bug in the Sun Emulex Leadville driver that Sun wont fix in Solaris 10 from what the support told me. This bug occurs on a host with only 1 hba using leadville and powerpath to access volumes on a Clariion array. When the Clariion SP owning the LUNs reboots, volumes get trespassed and when it comes back online, volumes gets trespassed back to the default SP. One of the LUN then becomes innaccessible causing hard hang on all processes accessing it, processes are not killable even with -9

We are rolling back to emulex lpfc driver after I prove this is leadville's fault tomorrow