BVLog Bryan Voss’ mental synchronization point

9Mar/090

1.) Is it turned on?

I was recently contacted by EMC Support saying they had not received a health report from our new secondary Centera cluster in a while. They had tried dialing into the cluster via modem, but were not getting a response. They asked me to reset the modem on the cluster to ensure that it was working correctly so they could dial in and check things out.

As soon as I hung up the phone, I brought up my Centera Viewer client and tried to login to the cluster. No response. No ping response either. As I walked down the hall to the datacenter, I was reviewing network connectivity for the cluster in my mind. If a system isn't working correctly, blame it on the network, right?

Once in the datacenter, I opened the back of the rack and found the modem dead. No lights at all. After checking cables, it occurred to me that I wasn't feeling any breeze from the fans in all the nodes. A quick glance told me that there were no lights on the back of the cluster. I walked around to the front and found no lights there either.

As possible causes for a complete power failure to the rack began whizzing through my head, one tidbit floated to the surface: About two weeks before, we had been coordinating with the Maintenance department on moving our datacenter power feeds to a new powerhouse the hospital recently built. We have big APC UPSes that will power the datacenter for a few minutes until generators kick in. Since Maintenance wasn't sure how long it would take to reroute power through the new powerhouse and generators were out of the question, we had to prepare for the worst and assume the UPSes would drain and shut down before power was restored. One of the steps we took was powering down all non-critical systems. Since the new Centera was a replication target and replication was not in full swing yet, I decided to power it down for the move.

Of course, I'm sure you've already determined the problem. We forgot to power it back up! Since the Centera was new, I had not yet added it to our Nagios monitoring system and was not paying much attention to it. I powered the cluster up and sheepishly called EMC Support to report my little flub.

Take-aways (don'tcha love biz-speak terms like that?):

  • Even experienced tech guys like me fall victim to noob shenanigans like forgetting to check power on a system before diving into troubleshooting.
  • Add systems to your monitoring solution early, even if they're not in production yet. You can always disable alerting for that particular system until it's in production, and it's a good shakedown to make sure your thresholds are reasonable. It will also tell you if you maybe shut down the system and forget to turn it back on! (Like anybody would ever do something like that...)
Comments (0) Trackbacks (0)

No comments yet.


Leave a comment

No trackbacks yet.