grep for yesterday’s syslog entries
I was asked by a coworker how to grep syslog files for entries from the past 24 hours. Although it is simple to do manually, I thought it might be nice to put together a simple script to do the work for her. Here's what I came up with:
#/bin/sh
# bvoss 2008/05/08 grep for yesterday's date in any syslog-formatted file
# (May 7)
yesterday="`date -d "-24 hours" | cut -b 5-10`"
grep "$yesterday" $1
Just put it somewhere in the path (/usr/local/bin works nicely) and "chmod +x" to make it executable. Works on any syslog-formatted file with month and day at the beginning of each line. Syntax is grep_yesterday [file]. It returns all entries from midnight until 23:59:59 yesterday.
1.27%, baby!

We had VMWare run an analysis of 43 servers in our datacenter (about 1/3 of our total) that we thought would be good candidates for virtualization. They ran a monitoring box that collected performance stats for a couple of weeks and compiled the results to report back to us. The final results proved that we are in an absolutely ridiculous state right now. They told us we can consolidate those 43 servers down to 2 or 3 ESX servers running at around 15-20% CPU utilization. Given the way we have to space servers in the racks now due to power and cooling limitations, we could potentially consolidate 4 racks down to a few servers.
How is it that we have so many servers sitting there practically idle sucking up power and cooling 24x7? It comes down the the specs provided by our vendors and the fact that we generally purchase hardware and software as a package deal from the vendor. The vendors spec out the latest and greatest hardware and we just blindly accept what they suggest. After all, they're the experts, right?
We left a lot of servers out of the analysis. Primarily database servers and servers with special hardware like Brooktrout fax boards that can't be virtualized. There are also several systems that the vendors specifically said they would not support under virtual environments. That's another hurdle that we have to overcome: virtualization acceptance. There is some movement in that direction from our vendors, but many are still clueless when we ask about it. There are also a couple of cases that I am aware of where vendors say they cannot support a virtual environment due to licensing restrictions on third party code that they include with their products. That should subside over time as virtualization becomes a standard deployment platform.
One day, enterprise applications will be provided as self-contained virtual appliances that we deploy on a virtualization layer. The hypervisor is becoming the OS and the OS is becoming merely a set of APIs between the hypervisor and the application. Sure, there is a lot of friction from companies like Microsoft that have made their monopolies on operating systems, but the times they are a changin'.
tar over ssh
It's occasionally useful to copy a bunch of files from one server to another via ssh. There are various methods to accomplish this task, but one that I like to use is tar over ssh. Unfortunately, I don't use if often enough to remember all the appropriate switches offhand. I just had to use it this morning and had to search around to find the right info, so I'm posting it here for posterity.
tar cjvf - * | ssh username@remoteserver "(cd /target/dir ; tar xjvf -)"
Remote CD eject
Ok, I'm probably being a n00b, but I think it's just plain cool to sit down at a PC and ssh into a server, type "eject /dev/cdrom", and see the CD tray pop out on the server across the room.
Maybe I'm just easily entertained.
Windows Network Load Balancing: easier than I thought
Several months ago, we had a vendor come in to implement a Windows failover cluster for our document imaging system MS SQL server. The implementation failed. The vendor tech who was attempting to set up the cluster attributed the failure to an underscore character in our internal domain name. Not sure whether that was the cause of the problem, but we ended up reverting back to our single server setup after 30 hours of downtime. The whole experience left me wary of Windows clustering in general.
We are now in the process of implementing a Windows Network Load Balancing cluster for a 3-node term. server setup on a new app. Both myself and a fellow sysadmin came into the situation expecting problems. Since it's a new application that is not yet live, we figured we could weather any problems without having to worry about downtime. As it turns out, NLB clustering is almost dead simple.
The vendor tech who was supposed to be assisting us with the cluster setup joined us on a conference call 30 minutes late, mumbled his way through some email looking for something, then emailed us some documentation and basically said, "Here, read this and call me back in an hour so we can set up the cluster." I looked at my coworker, we both shrugged, and hung up the phone. We naively expected the vendor to be a lot more helpful.
After going to lunch, we came back and skimmed through the documentation a few minutes before calling the tech back. He tried walking us through manually configuring each server's network adapters, but we ran into problems with trying to do the setup with a single adapter on each server connected to the switch. It was obvious that the tech was not familiar with clustering and was just reading through the documentation and telling us what to do. After fumbling around for an hour or so, we told the tech we would call him back after connecting the second network adapter on all three servers.
I had been reading ahead a bit and discovered that Microsoft provides a Network Load Balancing Manager app as part of its Server 2003 admin pack. We removed all the mess and got the servers back to a clean network config, then used NLB Manager to build the cluster from scratch. Once we realized the difference between the primary cluster IP and the dedicated IP (hint: the primary cluster IP is the same on all nodes; the dedicated IP is a second unique IP assigned to each node to allow them to talk to each other), we got the whole thing set up in just a few minutes.
We called the vendor tech back and said, "Ok, it's working now." He assumed he was responsible for getting it going and we just let him bumble happily on with that assumption as we got off the phone. We proceeded to test the cluster by making RDP connections from several PCs to the cluster name. The first server in the cluster accepted around four connections before the second server began picking up new connections. The whole thing worked pretty much flawlessly from that point on.
We had originally built the cluster with two servers while the primary users worked on building the app using the third term. server. We later added the third server to the cluster without problems. After promoting it to priority 1, we were able to connect via RDP and it immediately started sharing the load. Nice!
We're looking at how easy it was to set up and coming up with all sorts of uses for this new tool in our kit. Now, I'm not so wary about Windows clustering. I may even build a couple of virtual machines and attempt to put together a test failover cluster myself. If all goes well, I'll just implement the failover cluster on our document imaging system myself. With the level of "assistance" we're getting from vendors, we should probably just plan on implementing future changes ourselves.
Redundancy? We don’t need no stinkin’ redundancy!
We recently experienced a hard drive failure on one of our critical Linux servers. The server stopped responding on a weekend, of course. (Why do failures inevitably occur outside the hours that I'm normally in the office?) Since the server has two drives and was staged by the application vendor, I just assumed it was set up with a RAID1 mirror and at worst I would have to remove the failed drive and reboot to get it back up and running in degraded state. It turns out the drives were set up as a RAID0 volume with no redundancy. When the single drive failed, it took the whole volume down.
I was eventually able to get the drive back online by reseating it and resetting the RAID adapter. I called vendor support to ask why the volume was set up with no redundancy and the answer was, "Our staging group doesn't configure servers that way. We always set them up with redundant RAID volumes."
"Well, thanks for the info, but I have a server here with a RAID0 volume that was provided by your staging group," I said.
"Sorry, there must be some mistake. See, we don't set servers up that way."
"(Sigh) Ok, thanks for your time." Weekend support was obviously not going to be any help.
I went through all the other servers that were a part of that system and were staged at the same time. All but one were configured as RAID0. We had received three additional severs that were staged later. They were configured with RAID1 redundancy, rather than RAID0.
We called our vendor rep Monday morning, and explained that the industry standard is to set your RAID volumes up with redundancy since hard drives tend to fail on occasion. He initiated an investigation into the problem and eventually admitted that there was a period of time that all Linux servers they shipped were configured as RAID0 rather than RAID1, but that the issue had been resolved. (The guy in their staging group who was setting them up that way was probably promoted to a manager or something, and the new guy knew more about industry standards.)
We asked them to provide us with a plan on transitioning to RAID1 on all the affected servers, but have not received a response yet. I suspect we will have to do it ourselves. Sigh. These vendors don't seem to have any contact with reality at times.
Disk space and treemaps
One of the things a sysadmin must occasionally struggle with is disk space. (I just provisioned 500GB to that filesystem a year ago and it's already less than 10% free??) Although just adding more disk space is a brute force method of resolving the immediate issue, it's usually a good idea to find out what is taking up the space and whether it can be reduced by deleting large unnecessary or infrequently used files. I have a couple of tools that are useful for providing the info needed to do some cleanup.
SequoiaView is a handy way to get a quick overview of a filesystem and see if any particular files or directories are taking up the majority of available space. Files are displayed as rectangles sized according to the relative amount of space they consume. This makes it easy to find things like Windows servicepack installers and other temporary files hanging around in temp directories taking up a lot of space.
SequoiaView is a free Windows application.
Another useful tool that I have found is JDiskReport, which provides various charts depicting largest files, oldest files, types of files, distribution of files based on modification time, etc.
JDiskReport is a free cross-platform Java-based app which can be installed or run via Java WebStart.
I generally start with SequoiaView to get a quick overview of large files, then use JDiskReport if I need to get more detailed info.
So there you go. Download some utils and get started cleaning up those old crufty files that are taking up all your space.
The case of the disappearing eth0
There have been a couple of occasions in the past week that I have lost an ethernet interface when swapping machines around. Looking back into my murky past, I can recall a couple of other times that I probably encountered the same issue. Don't recall how I resolved the issue before, but I have a definite solution now. I figured I should note it here so I can look it up later and so others can benefit from it.
Scenario 1: I build a Debian virtual machine using VMWare Workstation on my laptop. I later move the VM to a VMWare Server box. On first boot, VMWare Server asks if I want to assign a new UUID and I select yes. It turns out that the MAC address assigned to the virtual ethernet device is affiliated with the VMWare UUID. When the UUID changes, the MAC address changes. Debian assigns eth devices based on MAC address and therefore eth0 is lost after the MAC changes. The issue shows up when I try to start networking on the VM and eth0 doesn't come up.
Scenario 2: I install Debian on a PC-class box and tinker with it a while. It breaks (something to do with heat, probably a fan failure). I move the hard drive to an identical box and it boots fine, but eth0 doesn't come up. Same as above. Since Debian assigns the eth devices based on MAC address and the new ethernet device has a different MAC address, I get no eth0.
Solution: A comment on this post pointed me down the path of enlightenment.
/etc/udev/rules.d/z25_persistent-net.rules contains the MAC address to eth device mappings. Delete the lines like below, noting the module name on the "# PCI device" line:
# PCI device xxxxxx:xxxxxx ([module])
SUBSYSTEM=="net", DRIVERS=="?*", ATTRS{address}=="xx:xx:xx:xx:xx:xx", NAME="eth0"
This removes the MAC to eth device mapping info. Now we need to restart udev to allow the change to take effect:
/etc/init.d/udev restart
Next step is to "bounce" the kernel module for the ethernet device. Use the module name from the z25_persistent-net.rules file noted above:
modprobe -r [module]
modprobe [module]
"ifconfig" should now show the eth0 interface as up and running. If not, try "ifup eth0" and check "ifconfig" again. That rascally ethernet interface can't hide for long!
Update 2009/03/11: This post details a method that does not require modifying individual VMs. Probably a better solution for template VMs or virtual appliances.
Fixing VMWare Server directory permissions on Debian hosts
I often create a VMWare virtual machine using VMWare Workstation on my laptop, then later move it to a Debian machine running VMWare Server. As a result of the move from Windows to Linux, permissions on the VM directory end up being wrong and I get a blank black screen when connecting the VMWare Server Console to the VM. I normally just fix the permissions manually when I notice the problem.
Today, I was helping my father deal with the same situation via email and started typing out all the steps to create a group and fix permissions on the VM directory when I thought, "Why not be a good little sysadmin and write a script to do this?" So, here's the resulting BASH script:
#!/bin/bash
# Fix VMWare Server permissions on Debian host
# Bryan Voss 2007/07/19
# *** Config items ***
# Group that all VMWare users will belong to
VMGROUP=vmware
# *** /Config items ***
# Find directory where VMs are stored
VMDIR=`grep vmdir /etc/vmware/config | cut --delimiter=' ' -f 3 | cut --delimiter='"' -f 2`
# Add group that VMs should belong to
addgroup --system $VMGROUP
# Fix permissions on directories under VMDIR
cd $VMDIR
chgrp -R $VMGROUP *
find . -type d -exec chmod g+rwxs \{\} \;
# Fix permissions on vmx files
find . -name *.vmx -exec chmod -R +x,g+rwx \{\} \;
# Fix permissions on all other files
chmod -R g+rw *
Make sure any users who will be connecting via VMWare Server Console are members of whatever group you set VMGROUP to ("vmware" by default).
I have saved this as /usr/local/sbin/vmware-fixperms on my VMWare Server boxes and will probably use it pretty often. Maybe others will benefit from it.
RHEL3 and LVM
So there I was, installing Red Hat Enterprise Linux AS 3 on a new box. "RHEL3," you ask? Yes, the application that will be running on that box requires not version 5, not even version 4, but version 3.
Anyway, I boot into the GUI installer. The mouse doesn't work. Apparently the KVM module on this box is confused about the type of mouse it wants to present to the OS. Ok, no problem. I just reboot into the text-mode installer and go merrily on my way. I get to the partitioning step and spend several minutes trudging through the arcane requirements document provided by the vendor. for some reason, they want two volume groups rather than one that consumes the entire disk. [shrug] Calculate how to size the partitions. Create physical volumes. Done. Ok, time to create a volume group. Wait a minute. Where's the LVM button? The GUI installer has an LVM button on the partition screen that allows you to create volume groups and logical volumes. The text-mode installer is missing the LVM button!
I hit www.redhat.com and navigate my way to the install guide for RHEL3. Try the index first. Nothing about LVM. Ok, check the table of contents. Text mode installer user interface. Nope, nothing there. Maybe something in the Disk Druid buttons section? What's this? "Note, LVM is only available in the graphical installation program." Aargh!
Reboot yet again into the graphical installer with the plan to use the keyboard to navigate through it. Wiggle the mouse just for fun. Hey, it works! I guess the KVM module saw my dilemma and decided to have mercy on me.