ssh login via shared key
Since I only need to set this up the first time I get a new PC/server online, I figured I might as well document it here to make it easier to remember.
Very briefly, we're creating a new key with ssh-keygen. (Don't run this if you have an existing key you want to use). We're then copying the public portion of the key to [server]'s authorized_keys, which will allow us to login without a password from now on. Many assumptions are made here, so if you run into problems, google "ssh key login" for more info.
Use wisely.
ssh-keygen
cat ~/.ssh/id_rsa.pub | ssh [server] 'mkdir -p .ssh ; cat >> .ssh/authorized_keys'
More shell scripting
for file in Pyramis*.ps ; do name=`echo $file | cut -d'.' -f 1` ; ps2ascii $file > $name.txt ; done
Converting a bunch of Postscript files to ASCII text files. Posted here for my future reference. (And anybody else that may happen to be interested.) It's a fairly basic example, but I tend to fumble around on these if I haven't done much shell scripting in a while.
We're iterating through the list of files named Pyramis*.ps . The filename ($file) is passed through cut to chop off the extension (.ps) and the resulting name is assigned to the $name variable. We then run the original filename ($file) through ps2ascii to do the actual conversion and write the output to $name with a .txt extension. We end up with a bunch of files with the same name as the original Postscript file, but with a .txt extension.
Groovy.
Why commandline is important
Some interesting BASHing I did today:
find . -type f -print | cut -d'/' -f 2 | uniq | sort > content
for files in *; do echo $files; done | sort | grep -v content | grep -v -f content | xargs -n 1 rm -rf
The first line finds all subdirectories of the current directory that contain files and prints the list to the file contents. The second line deletes all the subdirectories that are not contained in the file content.
I had to do this to clean up a huge number of orphaned batch directories left by our document imaging system (Windows, but I'm running Cygwin and pointing to a drive mapped to the DI fileserver). The vendor provides a GUI app to do this, but it takes the larger part of eternity to map out all the directories, then you have to click on each directory and click delete. Pretty useless for the thousands of directories I had to deal with. My commandline solution took about 5 minutes to work out and around 15 seconds to run. Several hours of menial mouse clicking saved.
I had to shut down a release process in order to make sure I didn't catch any legitimate directories that had been created but not yet populated. The next iteration will probably include a bit more logic to only include directories that are more than a day old or so. A simple -mtime tweak to the find command should do it. That will enable me to run it while the entire system is live.
grep for yesterday’s syslog entries
I was asked by a coworker how to grep syslog files for entries from the past 24 hours. Although it is simple to do manually, I thought it might be nice to put together a simple script to do the work for her. Here's what I came up with:
#/bin/sh
# bvoss 2008/05/08 grep for yesterday's date in any syslog-formatted file
# (May 7)
yesterday="`date -d "-24 hours" | cut -b 5-10`"
grep "$yesterday" $1
Just put it somewhere in the path (/usr/local/bin works nicely) and "chmod +x" to make it executable. Works on any syslog-formatted file with month and day at the beginning of each line. Syntax is grep_yesterday [file]. It returns all entries from midnight until 23:59:59 yesterday.
1.27%, baby!

We had VMWare run an analysis of 43 servers in our datacenter (about 1/3 of our total) that we thought would be good candidates for virtualization. They ran a monitoring box that collected performance stats for a couple of weeks and compiled the results to report back to us. The final results proved that we are in an absolutely ridiculous state right now. They told us we can consolidate those 43 servers down to 2 or 3 ESX servers running at around 15-20% CPU utilization. Given the way we have to space servers in the racks now due to power and cooling limitations, we could potentially consolidate 4 racks down to a few servers.
How is it that we have so many servers sitting there practically idle sucking up power and cooling 24x7? It comes down the the specs provided by our vendors and the fact that we generally purchase hardware and software as a package deal from the vendor. The vendors spec out the latest and greatest hardware and we just blindly accept what they suggest. After all, they're the experts, right?
We left a lot of servers out of the analysis. Primarily database servers and servers with special hardware like Brooktrout fax boards that can't be virtualized. There are also several systems that the vendors specifically said they would not support under virtual environments. That's another hurdle that we have to overcome: virtualization acceptance. There is some movement in that direction from our vendors, but many are still clueless when we ask about it. There are also a couple of cases that I am aware of where vendors say they cannot support a virtual environment due to licensing restrictions on third party code that they include with their products. That should subside over time as virtualization becomes a standard deployment platform.
One day, enterprise applications will be provided as self-contained virtual appliances that we deploy on a virtualization layer. The hypervisor is becoming the OS and the OS is becoming merely a set of APIs between the hypervisor and the application. Sure, there is a lot of friction from companies like Microsoft that have made their monopolies on operating systems, but the times they are a changin'.
tar over ssh
It's occasionally useful to copy a bunch of files from one server to another via ssh. There are various methods to accomplish this task, but one that I like to use is tar over ssh. Unfortunately, I don't use if often enough to remember all the appropriate switches offhand. I just had to use it this morning and had to search around to find the right info, so I'm posting it here for posterity.
tar cjvf - * | ssh username@remoteserver "(cd /target/dir ; tar xjvf -)"
Remote CD eject
Ok, I'm probably being a n00b, but I think it's just plain cool to sit down at a PC and ssh into a server, type "eject /dev/cdrom", and see the CD tray pop out on the server across the room.
Maybe I'm just easily entertained.
Windows Network Load Balancing: easier than I thought
Several months ago, we had a vendor come in to implement a Windows failover cluster for our document imaging system MS SQL server. The implementation failed. The vendor tech who was attempting to set up the cluster attributed the failure to an underscore character in our internal domain name. Not sure whether that was the cause of the problem, but we ended up reverting back to our single server setup after 30 hours of downtime. The whole experience left me wary of Windows clustering in general.
We are now in the process of implementing a Windows Network Load Balancing cluster for a 3-node term. server setup on a new app. Both myself and a fellow sysadmin came into the situation expecting problems. Since it's a new application that is not yet live, we figured we could weather any problems without having to worry about downtime. As it turns out, NLB clustering is almost dead simple.
The vendor tech who was supposed to be assisting us with the cluster setup joined us on a conference call 30 minutes late, mumbled his way through some email looking for something, then emailed us some documentation and basically said, "Here, read this and call me back in an hour so we can set up the cluster." I looked at my coworker, we both shrugged, and hung up the phone. We naively expected the vendor to be a lot more helpful.
After going to lunch, we came back and skimmed through the documentation a few minutes before calling the tech back. He tried walking us through manually configuring each server's network adapters, but we ran into problems with trying to do the setup with a single adapter on each server connected to the switch. It was obvious that the tech was not familiar with clustering and was just reading through the documentation and telling us what to do. After fumbling around for an hour or so, we told the tech we would call him back after connecting the second network adapter on all three servers.
I had been reading ahead a bit and discovered that Microsoft provides a Network Load Balancing Manager app as part of its Server 2003 admin pack. We removed all the mess and got the servers back to a clean network config, then used NLB Manager to build the cluster from scratch. Once we realized the difference between the primary cluster IP and the dedicated IP (hint: the primary cluster IP is the same on all nodes; the dedicated IP is a second unique IP assigned to each node to allow them to talk to each other), we got the whole thing set up in just a few minutes.
We called the vendor tech back and said, "Ok, it's working now." He assumed he was responsible for getting it going and we just let him bumble happily on with that assumption as we got off the phone. We proceeded to test the cluster by making RDP connections from several PCs to the cluster name. The first server in the cluster accepted around four connections before the second server began picking up new connections. The whole thing worked pretty much flawlessly from that point on.
We had originally built the cluster with two servers while the primary users worked on building the app using the third term. server. We later added the third server to the cluster without problems. After promoting it to priority 1, we were able to connect via RDP and it immediately started sharing the load. Nice!
We're looking at how easy it was to set up and coming up with all sorts of uses for this new tool in our kit. Now, I'm not so wary about Windows clustering. I may even build a couple of virtual machines and attempt to put together a test failover cluster myself. If all goes well, I'll just implement the failover cluster on our document imaging system myself. With the level of "assistance" we're getting from vendors, we should probably just plan on implementing future changes ourselves.
Redundancy? We don’t need no stinkin’ redundancy!
We recently experienced a hard drive failure on one of our critical Linux servers. The server stopped responding on a weekend, of course. (Why do failures inevitably occur outside the hours that I'm normally in the office?) Since the server has two drives and was staged by the application vendor, I just assumed it was set up with a RAID1 mirror and at worst I would have to remove the failed drive and reboot to get it back up and running in degraded state. It turns out the drives were set up as a RAID0 volume with no redundancy. When the single drive failed, it took the whole volume down.
I was eventually able to get the drive back online by reseating it and resetting the RAID adapter. I called vendor support to ask why the volume was set up with no redundancy and the answer was, "Our staging group doesn't configure servers that way. We always set them up with redundant RAID volumes."
"Well, thanks for the info, but I have a server here with a RAID0 volume that was provided by your staging group," I said.
"Sorry, there must be some mistake. See, we don't set servers up that way."
"(Sigh) Ok, thanks for your time." Weekend support was obviously not going to be any help.
I went through all the other servers that were a part of that system and were staged at the same time. All but one were configured as RAID0. We had received three additional severs that were staged later. They were configured with RAID1 redundancy, rather than RAID0.
We called our vendor rep Monday morning, and explained that the industry standard is to set your RAID volumes up with redundancy since hard drives tend to fail on occasion. He initiated an investigation into the problem and eventually admitted that there was a period of time that all Linux servers they shipped were configured as RAID0 rather than RAID1, but that the issue had been resolved. (The guy in their staging group who was setting them up that way was probably promoted to a manager or something, and the new guy knew more about industry standards.)
We asked them to provide us with a plan on transitioning to RAID1 on all the affected servers, but have not received a response yet. I suspect we will have to do it ourselves. Sigh. These vendors don't seem to have any contact with reality at times.
Disk space and treemaps
One of the things a sysadmin must occasionally struggle with is disk space. (I just provisioned 500GB to that filesystem a year ago and it's already less than 10% free??) Although just adding more disk space is a brute force method of resolving the immediate issue, it's usually a good idea to find out what is taking up the space and whether it can be reduced by deleting large unnecessary or infrequently used files. I have a couple of tools that are useful for providing the info needed to do some cleanup.
SequoiaView is a handy way to get a quick overview of a filesystem and see if any particular files or directories are taking up the majority of available space. Files are displayed as rectangles sized according to the relative amount of space they consume. This makes it easy to find things like Windows servicepack installers and other temporary files hanging around in temp directories taking up a lot of space.
SequoiaView is a free Windows application.
Another useful tool that I have found is JDiskReport, which provides various charts depicting largest files, oldest files, types of files, distribution of files based on modification time, etc.
JDiskReport is a free cross-platform Java-based app which can be installed or run via Java WebStart.
I generally start with SequoiaView to get a quick overview of large files, then use JDiskReport if I need to get more detailed info.
So there you go. Download some utils and get started cleaning up those old crufty files that are taking up all your space.