BVLog Bryan Voss’ mental synchronization point

29Jun/100

Big Data

I just added another 15 terabytes of disk to one of the SANs at work that I manage. Woohoo! Always fun dealing with lots of storage. Now off to provision some new datastores for VMWare ESX.

15Jun/100

Clariion hosts showing as unmanaged

I have had several Windows servers connected to an EMC Clariion SAN via both Fibre Channel & iSCSI show up as unmanaged, even though they all have Navisphere Agent installed and running. After some investigation, I found that all of the hosts have multiple NICs, either for cluster heartbeat purposes or for iSCSI connectivity. In Navisphere, right-clicking the host and choosing "Update Now" gave an error which included the IP of one of the private interfaces. In other words, the agent is binding to the wrong adapter.

Solution:

  1. Create a file named "agentid.txt" under the Navisphere Agent directory.
  2. The first line of the file should contain the server's fully-qualified hostname.
  3. The second line should contain the IP address that Navisphere should use to contact the server. This determines which adapter will be used.
  4. Stop/start Navisphere Agent service. Do not restart the service, as that doesn't seem to work.

In Navisphere, right-click on the host and click "Update Now". It should show up as managed.

9Mar/090

1.) Is it turned on?

I was recently contacted by EMC Support saying they had not received a health report from our new secondary Centera cluster in a while. They had tried dialing into the cluster via modem, but were not getting a response. They asked me to reset the modem on the cluster to ensure that it was working correctly so they could dial in and check things out.

As soon as I hung up the phone, I brought up my Centera Viewer client and tried to login to the cluster. No response. No ping response either. As I walked down the hall to the datacenter, I was reviewing network connectivity for the cluster in my mind. If a system isn't working correctly, blame it on the network, right?

Once in the datacenter, I opened the back of the rack and found the modem dead. No lights at all. After checking cables, it occurred to me that I wasn't feeling any breeze from the fans in all the nodes. A quick glance told me that there were no lights on the back of the cluster. I walked around to the front and found no lights there either.

As possible causes for a complete power failure to the rack began whizzing through my head, one tidbit floated to the surface: About two weeks before, we had been coordinating with the Maintenance department on moving our datacenter power feeds to a new powerhouse the hospital recently built. We have big APC UPSes that will power the datacenter for a few minutes until generators kick in. Since Maintenance wasn't sure how long it would take to reroute power through the new powerhouse and generators were out of the question, we had to prepare for the worst and assume the UPSes would drain and shut down before power was restored. One of the steps we took was powering down all non-critical systems. Since the new Centera was a replication target and replication was not in full swing yet, I decided to power it down for the move.

Of course, I'm sure you've already determined the problem. We forgot to power it back up! Since the Centera was new, I had not yet added it to our Nagios monitoring system and was not paying much attention to it. I powered the cluster up and sheepishly called EMC Support to report my little flub.

Take-aways (don'tcha love biz-speak terms like that?):

  • Even experienced tech guys like me fall victim to noob shenanigans like forgetting to check power on a system before diving into troubleshooting.
  • Add systems to your monitoring solution early, even if they're not in production yet. You can always disable alerting for that particular system until it's in production, and it's a good shakedown to make sure your thresholds are reasonable. It will also tell you if you maybe shut down the system and forget to turn it back on! (Like anybody would ever do something like that...)
18Apr/080

1.27%, baby!

Average CPU usage

We had VMWare run an analysis of 43 servers in our datacenter (about 1/3 of our total) that we thought would be good candidates for virtualization. They ran a monitoring box that collected performance stats for a couple of weeks and compiled the results to report back to us. The final results proved that we are in an absolutely ridiculous state right now. They told us we can consolidate those 43 servers down to 2 or 3 ESX servers running at around 15-20% CPU utilization. Given the way we have to space servers in the racks now due to power and cooling limitations, we could potentially consolidate 4 racks down to a few servers.

How is it that we have so many servers sitting there practically idle sucking up power and cooling 24x7? It comes down the the specs provided by our vendors and the fact that we generally purchase hardware and software as a package deal from the vendor. The vendors spec out the latest and greatest hardware and we just blindly accept what they suggest. After all, they're the experts, right?

We left a lot of servers out of the analysis. Primarily database servers and servers with special hardware like Brooktrout fax boards that can't be virtualized. There are also several systems that the vendors specifically said they would not support under virtual environments. That's another hurdle that we have to overcome: virtualization acceptance. There is some movement in that direction from our vendors, but many are still clueless when we ask about it. There are also a couple of cases that I am aware of where vendors say they cannot support a virtual environment due to licensing restrictions on third party code that they include with their products. That should subside over time as virtualization becomes a standard deployment platform.

One day, enterprise applications will be provided as self-contained virtual appliances that we deploy on a virtualization layer. The hypervisor is becoming the OS and the OS is becoming merely a set of APIs between the hypervisor and the application. Sure, there is a lot of friction from companies like Microsoft that have made their monopolies on operating systems, but the times they are a changin'.

14May/070

Server sprawl

Just heard that we will be getting 7 more Linux boxes in as part of a new system we're implementing. We already have 9 in the rack for that particular system and it's not live yet. Our primary software vendor, McKesson, appears to be standardizing on Red Hat Linux and Oracle lately. On one hand, it's nice to see Linux taking a larger percentage of the datacenter, but on the other hand, the server sprawl is ridiculous. The particular system we're implementing has three DB servers and five frontend webservers as well as several ancillary servers. These are dual-core, dual-CPU systems with 8GB RAM. Granted, some of that is for a test system, but still, four webservers with four hyper-threaded cores each should handle several orders of magnitude more traffic than our ~200-bed facility could ever generate.

Why we can't virtualize all this is a question that I've been asking, but the answer is always, "McKesson doesn't support that platform." So what? They support RHEL on Intel. It doesn't matter if it's bare metal or virtual. As long as we manage the virtualization platform properly, performance loss should be negligible. It's ridiculous to see a rack 75% full of servers just to support one system. We could conceivably fill two racks with ESX servers and run the entire datacenter on them. As it stands, our new datacenter is around 75% full already. We're running close to 200 servers. Our UPS and air conditioning capacity are inching closer to max and we're still getting in more servers. This has got to stop at some point.

I suspect part of the problem is that we're buying our hardware and software as one cohesive system. Every new application brings along several new servers. I don't know if our software vendors even support purchasing just the software without the associated hardware.

For one of the systems I maintain, the various applications don't play well together on the same box, so the vendor splits it out into several 1U servers. These servers are sitting at 99% idle all day. That is an ideal candidate for virtualization. I suspect 90% of our systems are similar, in that they run on separate servers just to avoid compatibility issues or performance concerns. The fact that the vendors can't get their own applications to work nicely with each other is ridiculous, but I can't understand why they aren't actively pursuing virtualization as a target platform. Why not deliver your application as several virtual machine images pre-configured to work together and all burned on a single DVD? We can just plop them onto an ESX system and start them up.

Maybe other industries are already getting to this point. Maybe vendors are already looking into that sort of solution. I just need to see some results soon. Since I started in this position 2.5 years ago, we have added about 80 servers and a mass of infrastructure to try to support them. At some point, the daily care and feeding of all these servers is going to overwhelm the operations staff unless we get more people in to help out or automation of tasks gets a lot easier. Just applying OS updates is getting difficult to keep up with. We try to install updates on all internal servers at least every 90 days, but sometimes that slips when we're busy with other projects. That's one issue that virtualization doesn't help with. In fact, the easier it is to throw another VM into the mix, the less consideration is given to the maintenance associated with that server.

And while I'm on a rant, storage requirements are another problem. We can't possibly hope to recover from a disaster using tape. At our current rate, it would take around a year to recover data from tape back to disk. We're moving towards a second datacenter a few blocks away with a secondary SAN and Centera to provide some disaster recovery. Surely other healthcare facilities are experiencing the same issues. Why can't we work together to provide reciprocal redundancy? We could co-locate two racks of equipment at another facility's datacenter across the state and put a couple of racks of their equipment in our datacenter. As long as our connectivity is sufficient, we should be able to save considerable money over building a secondary datacenter ourselves. Are we so competitive with each other that we can't work together to save our patients some money? I realize that sanctity of patient data is a factor to consider, but if we can encrypt data when writing to tape, surely we can encrypt data when writing to a redundant SAN across a fiber link.

Probably a pie-in-the-sky vision. Too many good old boys smoking cigars together and slapping each other on the backs during a round of golf to expect anything to change during my career. I should probably just be glad that I'm getting more servers to support and stop whining about saving money and making things easier. Grumble, grumble...

Filed under: datacenter, linux No Comments
18Jan/070

Welcome to the datacenter/sauna

What I did today:

Datacenter temperature graph - 2007/01/17

Whee! Our new high-dollar datacenter got a little steamy after the chiller on the roof decided to quit. Apparently nobody has bothered configuring it to alert when it goes down. I was actually in the middle of performing a scheduled maintenance from home at around 4:30 am when I noticed that things were acting funny and decided to go in and take a look.

When I opened the door to the datacenter, I was hit with a balmy 108 degree Fahrenheit wave of air. I dragged in a big fan and propped it in the door to help move some of the air out and then scrambled around paging people for a few minutes. Funny how people's pagers mysteriously stop working in the middle of the night and then work fine the next day.

Several systems had already shut down due to the heat. A couple of the guys showed up and we started manually shutting down the remaining systems until Maintenance was able to get the chiller working again and the temperature started dropping.

Nice shakedown run for our IP KVM solution. In our old datacenter, we had a KVM in each rack. It took around 20 minutes to manually shut down 120 servers with 3-4 people working together. When we moved to the new datacenter, we added Avocent IP KVMs and removed the rack KVMs. We've also added about 60 new servers. With three people working on shutting down servers, we were able to get about 20 servers down in 20 minutes. Obviously, the KVM solution needs some reassessment.

Anyway, we're back in business and things are once again running smoothly. Once the chiller came back up, it did an admirable job of cooling the place down.

Filed under: datacenter No Comments
29Dec/060

Racks and cables and wireties, oh my!

I spent almost all day at work yesterday recabling an entire rack while it was running. The whole rack is devoted to our medical imaging (PACS) system, so it's very painful to schedule downtime.

We moved the rack to our new datacenter about a month or so ago. In the process, we pulled the rackmount UPSes out and connected directly to the central redundant UPS system. The rack was originally configured with one UPS per server (!), meaning the bottom half of the rack was filled with UPSes and the top with servers. We installed vertical PDUs in the back of the rack on either side and had to swap the power cables out on each server. Since we were already at the end of our scheduled downtime, we had to frantically get everything back up and running without cleaning up the rat's nest of cables in the back.

I was pulling a test server out of the rack yesterday and got into the back to disconnect it. It was such a mess that I decided to spend the time to clean it up. Thankfully, the power and network are redundant on that rack (we're slowly working towards doing this on all racks). The rack was staged by the vendor and shipped to us pre-cabled. I suspect that a large part of the expense involved in purchasing the system went towards all the cables and wireties jammed into the back of that thing. Why use a three foot cable when you can use a 10 foot? I was practically wading in snipped wireties by the time I got all the old cabling out.

Early in the process, I was merrily pulling cables when I heard a knock on the datacenter door. The rack I was working on just happened to be near the door, otherwise I never would have heard anything over the blasting fans in the room. I opened the door and there was our PACS admin (the guy who handles the clinical side of the medical imaging system). Apparently, I had disconnected the power cable on the database server's SCSI tray. Surprisingly, the database stopped responding. We spent 30 minutes bringing it back up and calling vendor support to make sure everything was back to normal.

After that mishap, I was much more careful about pulling cables. With the PDUs we're using, the power cables fall out at the slightest wiggle. The PDUs come with a bracket that you can add to wiretie the cables down. I hadn't mounted the brackets yet due to the time constraints we had when originally moving the rack. So, I spent the time yesterday mounting the brackets and tying everything down snugly.

When I finished all the cabling, I closed the rack and cleaned up the mess on the floor. I was about to roll my cart full of old cables out the door and leave for the day when I stopped. I went back to the rack, opened the back doors and just spent a couple of minutes admiring the niceness of all the clean cabling. With a sigh of satisfaction, I closed the doors and went on my way. It's those little moments of relishing a job well done that make the daily hassles worthwhile.

24Oct/060

Sun Blackbox

Just heard about the Sun Blackbox. A complete datacenter in a standard 20-foot shipping container. Pretty sweet stuff. Too bad this wasn’t available before we started our new datacenter construction at work! :^)

I can see some definite disaster recovery possibilities with the Blackbox, although it’s pricey at $2 million. A rental market will probably spring up pretty quickly.

Filed under: datacenter No Comments
23Oct/060

FM-200 Fire Suppression System

Here’s the fire suppression system we’re installing in the new datacenter at work: http://www.e1.greatlakes.com/wfp/common/jsp/index.jsp

Check out the video. Pretty impressive compared to water sprinklers.

Filed under: datacenter No Comments