I start by trying to log on to things and checking all the monitors... but what I generally find is that when it all starts going to hell, the monitors are part of what is making it go to hell... the more the system degrades, more alerts are fired that drag the system down.
Then, if it's take more than 10 minutes (rare), I consider turning off the front-end load balancers so that the backends can recover.
If it's taken more than 15 minutes... I start rebooting boxes. Not the database one, but the web servers which tend to start firing memkill and nuking processes I care about.
Today involved 3 of the 4 web servers getting kicked in the teeth.
That's pretty much it.
I start by trying to log on to things and checking all the monitors... but what I generally find is that when it all starts going to hell, the monitors are part of what is making it go to hell... the more the system degrades, more alerts are fired that drag the system down.
Then, if it's take more than 10 minutes (rare), I consider turning off the front-end load balancers so that the backends can recover.
If it's taken more than 15 minutes... I start rebooting boxes. Not the database one, but the web servers which tend to start firing memkill and nuking processes I care about.
Today involved 3 of the 4 web servers getting kicked in the teeth.