BVLog Bryan Voss’ mental synchronization point

6Mar/080

Redundancy? We don’t need no stinkin’ redundancy!

We recently experienced a hard drive failure on one of our critical Linux servers. The server stopped responding on a weekend, of course. (Why do failures inevitably occur outside the hours that I'm normally in the office?) Since the server has two drives and was staged by the application vendor, I just assumed it was set up with a RAID1 mirror and at worst I would have to remove the failed drive and reboot to get it back up and running in degraded state. It turns out the drives were set up as a RAID0 volume with no redundancy. When the single drive failed, it took the whole volume down.

I was eventually able to get the drive back online by reseating it and resetting the RAID adapter. I called vendor support to ask why the volume was set up with no redundancy and the answer was, "Our staging group doesn't configure servers that way. We always set them up with redundant RAID volumes."

"Well, thanks for the info, but I have a server here with a RAID0 volume that was provided by your staging group," I said.

"Sorry, there must be some mistake. See, we don't set servers up that way."

"(Sigh) Ok, thanks for your time." Weekend support was obviously not going to be any help.

I went through all the other servers that were a part of that system and were staged at the same time. All but one were configured as RAID0. We had received three additional severs that were staged later. They were configured with RAID1 redundancy, rather than RAID0.

We called our vendor rep Monday morning, and explained that the industry standard is to set your RAID volumes up with redundancy since hard drives tend to fail on occasion. He initiated an investigation into the problem and eventually admitted that there was a period of time that all Linux servers they shipped were configured as RAID0 rather than RAID1, but that the issue had been resolved. (The guy in their staging group who was setting them up that way was probably promoted to a manager or something, and the new guy knew more about industry standards.)

We asked them to provide us with a plan on transitioning to RAID1 on all the affected servers, but have not received a response yet. I suspect we will have to do it ourselves. Sigh. These vendors don't seem to have any contact with reality at times.

Filed under: linux, sysadmin No Comments
27Sep/070

Disk space and treemaps

One of the things a sysadmin must occasionally struggle with is disk space. (I just provisioned 500GB to that filesystem a year ago and it's already less than 10% free??) Although just adding more disk space is a brute force method of resolving the immediate issue, it's usually a good idea to find out what is taking up the space and whether it can be reduced by deleting large unnecessary or infrequently used files. I have a couple of tools that are useful for providing the info needed to do some cleanup.

SequoiaView is a handy way to get a quick overview of a filesystem and see if any particular files or directories are taking up the majority of available space. Files are displayed as rectangles sized according to the relative amount of space they consume. This makes it easy to find things like Windows servicepack installers and other temporary files hanging around in temp directories taking up a lot of space.

SequoiaView is a free Windows application.

SequoiaView

Another useful tool that I have found is JDiskReport, which provides various charts depicting largest files, oldest files, types of files, distribution of files based on modification time, etc.

JDiskReport is a free cross-platform Java-based app which can be installed or run via Java WebStart.

JDiskReport

I generally start with SequoiaView to get a quick overview of large files, then use JDiskReport if I need to get more detailed info.

So there you go. Download some utils and get started cleaning up those old crufty files that are taking up all your space.

Filed under: sysadmin No Comments
23Aug/070

The case of the disappearing eth0

There have been a couple of occasions in the past week that I have lost an ethernet interface when swapping machines around. Looking back into my murky past, I can recall a couple of other times that I probably encountered the same issue. Don't recall how I resolved the issue before, but I have a definite solution now. I figured I should note it here so I can look it up later and so others can benefit from it.

Scenario 1: I build a Debian virtual machine using VMWare Workstation on my laptop. I later move the VM to a VMWare Server box. On first boot, VMWare Server asks if I want to assign a new UUID and I select yes. It turns out that the MAC address assigned to the virtual ethernet device is affiliated with the VMWare UUID. When the UUID changes, the MAC address changes. Debian assigns eth devices based on MAC address and therefore eth0 is lost after the MAC changes. The issue shows up when I try to start networking on the VM and eth0 doesn't come up.

Scenario 2: I install Debian on a PC-class box and tinker with it a while. It breaks (something to do with heat, probably a fan failure). I move the hard drive to an identical box and it boots fine, but eth0 doesn't come up. Same as above. Since Debian assigns the eth devices based on MAC address and the new ethernet device has a different MAC address, I get no eth0.

Solution: A comment on this post pointed me down the path of enlightenment.

/etc/udev/rules.d/z25_persistent-net.rules contains the MAC address to eth device mappings. Delete the lines like below, noting the module name on the "# PCI device" line:

# PCI device xxxxxx:xxxxxx ([module])
SUBSYSTEM=="net", DRIVERS=="?*", ATTRS{address}=="xx:xx:xx:xx:xx:xx", NAME="eth0"

This removes the MAC to eth device mapping info. Now we need to restart udev to allow the change to take effect:

/etc/init.d/udev restart

Next step is to "bounce" the kernel module for the ethernet device. Use the module name from the z25_persistent-net.rules file noted above:

modprobe -r [module]
modprobe [module]

"ifconfig" should now show the eth0 interface as up and running. If not, try "ifup eth0" and check "ifconfig" again. That rascally ethernet interface can't hide for long!

Update 2009/03/11: This post details a method that does not require modifying individual VMs. Probably a better solution for template VMs or virtual appliances.

19Jul/070

Fixing VMWare Server directory permissions on Debian hosts

I often create a VMWare virtual machine using VMWare Workstation on my laptop, then later move it to a Debian machine running VMWare Server. As a result of the move from Windows to Linux, permissions on the VM directory end up being wrong and I get a blank black screen when connecting the VMWare Server Console to the VM. I normally just fix the permissions manually when I notice the problem.

Today, I was helping my father deal with the same situation via email and started typing out all the steps to create a group and fix permissions on the VM directory when I thought, "Why not be a good little sysadmin and write a script to do this?" So, here's the resulting BASH script:


#!/bin/bash
# Fix VMWare Server permissions on Debian host
# Bryan Voss 2007/07/19


# *** Config items ***
# Group that all VMWare users will belong to
VMGROUP=vmware
# *** /Config items ***


# Find directory where VMs are stored
VMDIR=`grep vmdir /etc/vmware/config | cut --delimiter=' ' -f 3 | cut --delimiter='"' -f 2`


# Add group that VMs should belong to
addgroup --system $VMGROUP


# Fix permissions on directories under VMDIR
cd $VMDIR
chgrp -R $VMGROUP *
find . -type d -exec chmod g+rwxs \{\} \;


# Fix permissions on vmx files
find . -name *.vmx -exec chmod -R +x,g+rwx \{\} \;
# Fix permissions on all other files
chmod -R g+rw *

Make sure any users who will be connecting via VMWare Server Console are members of whatever group you set VMGROUP to ("vmware" by default).

I have saved this as /usr/local/sbin/vmware-fixperms on my VMWare Server boxes and will probably use it pretty often. Maybe others will benefit from it.

29Jun/070

RHEL3 and LVM

So there I was, installing Red Hat Enterprise Linux AS 3 on a new box. "RHEL3," you ask? Yes, the application that will be running on that box requires not version 5, not even version 4, but version 3.

Anyway, I boot into the GUI installer. The mouse doesn't work. Apparently the KVM module on this box is confused about the type of mouse it wants to present to the OS. Ok, no problem. I just reboot into the text-mode installer and go merrily on my way. I get to the partitioning step and spend several minutes trudging through the arcane requirements document provided by the vendor. for some reason, they want two volume groups rather than one that consumes the entire disk. [shrug] Calculate how to size the partitions. Create physical volumes. Done. Ok, time to create a volume group. Wait a minute. Where's the LVM button? The GUI installer has an LVM button on the partition screen that allows you to create volume groups and logical volumes. The text-mode installer is missing the LVM button!

I hit www.redhat.com and navigate my way to the install guide for RHEL3. Try the index first. Nothing about LVM. Ok, check the table of contents. Text mode installer user interface. Nope, nothing there. Maybe something in the Disk Druid buttons section? What's this? "Note, LVM is only available in the graphical installation program." Aargh!

Reboot yet again into the graphical installer with the plan to use the keyboard to navigate through it. Wiggle the mouse just for fun. Hey, it works! I guess the KVM module saw my dilemma and decided to have mercy on me.

Filed under: linux, sysadmin No Comments
28Jun/070

Linux sysadmin tip #1: LVM

There's no excuse for not using logical volumes when setting up a Linux system anymore. All the major distros support LVM. All the major rescue disks support it. The benefits of LVM are too great to ignore. You set up a volume group containing all your disks, then create logical volumes on top of that. The logical volumes can be expanded when necessary until the volume group is filled. When that happens, just add another disk and expand the volume group across it. You can then continue expanding the logical volumes as needed. Logical volumes work on top of RAID arrays and SAN LUNs, as well.

About the only thing better is Solaris' ZFS, which I've just begun experimenting with recently. But that's a topic for another post...

Filed under: linux, sysadmin No Comments
28Jun/070

One miiiiillion files

A sysadmin rule of thumb I had occasionally heard in the past was to limit the number of files in a directory to around 2000. Larger numbers of files were historically difficult to deal with on old systems with limited RAM and processing capability. And of course old filesystems had limitations on the number of files per directory as well.

We are migrating data from our old document imaging system to our new one. As part of the process, we shipped off a copy of all the optical platters from the old system to an agency that supposedly specializes in the migration process. They pulled the image data from the platters and dumped it all to USB drives along with scripts that we run through the new system to import all the images with the correct index information. I just received two of the three USB drives, so I hooked one up to a virtual machine to see how things look.

Two directories on the root of the drive, one for images and one for scripts. Took a look at the images directory. Lots of subdirectories with lots more subdirectories under each one. Looks ok.

Changed to the scripts directory. Since I'm browsing in Windows Explorer, I get the animated folder and flashlight indicating that it's reading the directory info, please wait. Ok, no problem. I go check my email. Come back to the window a couple of minutes later and it's still reading. What's going on here? Since I'm running it on a VM, I think maybe that's slowing things down.

I close the window, disconnect the drive, and connect it to my laptop. Open a DOS prompt. cd to the directory and type dir. Text starts scrolling by. Looks like all the scripts are in one directory with no subdirectories. Ok, that's to be expected since the test batches the agency sent us previously were laid out the same way. No problem, how many files could there be in that directory? (Text is still scrolling in the window.) I minimize the window in the hopes that it will speed up the process if it's not having to update the display.

I go do something else for a while. Come back about 5 minutes later and bring the DOS window back up. Still scrolling. WHAT?!? This is crazy. I don't remember offhand whether the dir command under DOS even shows the file count anyway (It does, as it turns out.), so I ctrl-C the process and close the window. Gotta break out the real command line utils. I launch a cygwin bash window and cd to the directory in question. Type "ls -1 | wc -l". For the uninitiated, this will generate a single-column listing of the files in the directory and pipe it to the word count util, which will count the number of lines returned. I leave this running and go down the hall to the datacenter to start a Red Hat install on a new box.

I come back to my desk about 15 minutes later and the process is still running. This is crazy! I trust bash more than DOS, though, so I let it run. Finally after about 30 minutes, the prompt suddenly appears again. I look at the output of the wc command. Count digits. Double check. Yes, my first glance was correct. One million, fourteen thousand, seven hundred eighty seven files in a single directory. I'm actually mildly surprised the FAT32 filesystem can handle that many files in a single directory.

My next challenge is copying all those files to a directory on the server that needs to process them. Is the DOS copy command up to the task? We shall see...

Filed under: sysadmin, windows No Comments
25Jun/070

I’m gonna need you to go ahead and reboot the mainframe, mmmmkay?

I just saw a trouble ticket come in from the Help Desk: "User reports error 12 in application. This indicates that the mainframe needs to be rebooted."

Ummm. Thanks. I'll get right on that.

  1. We don't have a mainframe.
  2. If we did, we probably wouldn't reboot it because one user is getting an error in an application.
Filed under: sysadmin No Comments
15May/070

Do not be afraid

I am occasionally asked how I learned so much about [insert technical subject here]. My answer is always: "I just started playing with it until I figured it out." The point I try to get across is that I learned it by doing it.

A coworker asked me to sit down with him and teach him some things about Linux. We can spend all day talking, but the only way to figure out the ideas and concepts behind the Linux shell is to live in it for a while. I told him that around 10 years ago, I determined that I needed to learn Linux since it looked like it was going to be an important platform and doggone it, I didn't have the money to keep buying commercial software. I threw out Windows entirely and used Linux for everything at home. I started a sysadmin job at a small manufacturing company and proceeded to migrate everything I could to Linux. I was probably overzealous in my attempts, but I learned a huge amount as part of the process. I spent countless hours reading man pages and howtos. I signed up on the local Linux users group email list and asked questions. Eventually, I got good enough that I didn't have to post many questions anymore. Then I got good enough that I was able to post responses and help other people with their Linux problems.

I never would have gotten where I am today without banging my head against seemingly insurmountable problems, breaking all kinds of systems and rebuilding them, doggedly sticking with Linux even though it would have been trivial to solve a problem with Windows. I have since learned to back down occasionally and recognize the particular situations where Linux is the right choice and to accept the situations where it's not the right choice. But I wouldn't trade my history for anything. It's what got me to where I am today.

Here's the point in all this: do not be afraid to play with a system. Build a test box if the system is mission critical. Just the experience of building the test system is useful. Break it and rebuild it.

Learn the history of whatever project or system you're experimenting with. Feel the mindset and methods of the developers. There's probably a reason they did it that way.

Once the system is in production, it will eventually break in some unexpected and odd way. Don't be afraid to open the hood and fix it, just document what you did (an internal blog makes a great worklog).

Welcome uncertainty, it's a learning opportunity in disguise. The only way to gain certainty is to act decisively. No route is perfect. Pick what looks like the best one and run with it. If it turns out to be the wrong route, at least you learned something along the way.

But above all, do not be afraid.

Filed under: linux, sysadmin No Comments

Pages

Archives

Categories

Meta