August 18, 2005

HongPong.com [OK]: Back in Black + Quad RAM; Gentoo Linux still r0x0rs the b0x0r

A mercilessly geeky tale: I am recording this so that myself and others may deal with similar problems better in the future. I will soon forget the details of how I fixed it, so it is best to write it down now.

It took a couple days, but the Linux server (Tarfin), a reliable Dell Dimension 4400 running Gentoo Linux, is back from its brush with Hardware Hell. The problems began after I found out about my new mysterious Politics in Minnesota project... The work at this stage would best happen using MediaWiki, I reckoned. MediaWiki has performed well as the HongWiki platform, and has reliably served wiki pages that have done Real Well on Google, although with the service problems it's gone south a bit.

So my new WordPress-powered HongPong website (under development) takes a lot more RAM to serve PHP files than this current MovableType-powered HongPong.com, and as I sat down to get the Politics in Minnesota project going, I noticed that Tarfin was basically maxed out for RAM. It only had 128 MB, which is really way too low for this. It only had a few megs of RAM available and had 80 MB in the swap partition (which is the same as Virtual Memory on a PC or Mac). Gridlock.

So in other words the stress of serving had totally maxed out the RAM, which I noticed when the site -- which is usually lickety-split quick over the LAN here -- was going much slower. More RAM, always a good solution. I looked up my usual suspects, namely Tran Micro and General Nanosystems on University, whose prices will pretty much always beat Best Buy type places. Only Nano had the type of RAM for Tarfin, PC2100 SDRAM. So I got two 256 chips (though I'd have liked a 512, they didn't have).

The Dell only has 2 slots, thanks Dell, so I pulled the old 128 and put these in. Turned it on, it booted fine, and I ran 'emerge sync', the nice Gentoo command that permits me to update all the various Linux software packs I have running. This streamlines one of the bitchiest problems in systems administration - tracking down the damn software packs and keeping up with their security patches.

It ran alright until suddenly it hit a Segmentation Fault, followed shortly by a Kernel Panic, the hardest Crash that Linux can Go Down with - it's real ugly, gibberish and Hex codes spilling all over.

So I have to reboot. The file system checker program, fsck, auto-scanned the main partition and found all sorts of horrible errors. I tried to have it fix, but then it hit another Segmentation Fault:

A segmentation fault occurs when your program tries to access memory locations that haven't been allocated for the program's use.

Therefore I should have thought that maybe it was the damn new chips. I had a flashback to the death of the first Hongpong.com (the one that got me suspended from MPA) - which was an old PowerPC 6100/60 running a hacked old Linux, whose hard drive abruptly refused to come back from a nasty death right around when I graduated from high school. And I had no backups. In other words, the first HongPong server died almost exactly four years ago, and took with it the great contributions of everyone in that strange season of 2000-2001. It couldn't happen again, could it?

So I started looking around the various forums for a solution to a sudden filesystem corruption, one of the true hells of computing. To compound this, I hadn't backed up all the new HongPong site stuff, nor the Mysql databases that run the sites, in quite a while. Fortunately I had just exported this entire site a few days ago to put it into WordPress (as it is now - mostly purged of the spam), so if it truly crashed, the Bulk would be safe.

After reboots, I could come back to the low-level emergency maintenance fsck (file system check) shell, and from there I could READ the messed up drive, but not write to it without risking more damage. And I could see that most files seemed ok. But I couldn't get the file sharing, or Apache webserver, or MYSQL database running again, without risking wrecking it. And I couldn't figure out what was really wrong. The solution?

Install a brand new Gentoo Linux setup on another old hard drive I had sitting around, and then pull the old stuff of the messed-up drive in Read-Only mode. After I put the drive in, the handy BIOS error light told me something was dreadfully wrong and it wouldn't boot at all. I found that on a Dell you have to only set the 'cable select' ATA hard drive jumper pins - the machine automatically takes the last drive on the ATA cable to be the Master drive. So I did that but it was still stuck.

I had pulled out the new RAM earlier, but I'd put it back in by this point. Then I tried taking out one of them. It booted! I pulled that one out, and put the other in. It halted! When I put both in, it would boot, but if I switched them, it would halt. In other words, the Dell could detect the bad RAM when it's by itself, but NOT necessarily when it's with others, BUT this depended on their order.

So I returned the bad RAM to Tran Micro the next day, and they nicely exchanged for another one and tested it there in the store. It was OK, so I was on my way, and everything went smoothly afterwards. (Other than this incident of random bad RAM, Tran Micro are fine folks - this could happen anywhere - their service was all right)

I used the memtest86 memory checker on the Gentoo Linux install CD to Make Very Sure they were ok - i wish I'd done it earlier. So it took a few more hours, especially since when I installed Gentoo on this machine a year ago, I hardly took any notes about it. There are some weird things about the Dell machine - in particular, (some/all?) Dells have a strange first boot partition or /dev/hda1 in Linux parlance, which makes the Dell screen and some BIOS stuff happen. I think I destroyed this partition last time, and it's a huge pain in the ass to repair with floppy disks and stuff.

The problem is that Gentoo Linux install instructions tell you to put GRUB, the bootloader, on /dev/hda or /dev/hda1 , and this time I almost commanded grub-install /dev/hda before I caught myself. That would have taken hours to fix. Instead it must be on /dev/hda2 or /dev/hdb1. hda2 is I think automatically loaded up after the Dell thing is done. But I did it right, and so I was able to reboot Linux and finish installing the system.

Downloading & installing the key web programs was easily done with 'emerge apache php mod_php' and the correct USE flags. Other various things were properly updated and recompiled.

I was able to get back into the messed-up drive using read-only mode, which doesn't touch the filesystem. All the elements of the site easily copied to the new drive. Happily, the Mysql database -- which can really be a bitch to put together from a crashed system, if you don't export it cleanly first -- went over VERY easily. All I had to do was 'cp -av * /var/lib/mysql' from the old /var/lib/mysql. Then a reboot, plugging it back where it belongs in my bedroom, and All Systems [ OK ].

So now, in short, I have a TON of Actual Real Professional Work for both Politics in Minnesota and Computer Zone. I don't have time to say much else about the Gaza situation and so forth. sry!

Posted by HongPong at August 18, 2005 05:41 PM
Listed under HongPong-site , Open Source , Politics in Minnesota , Technological Apparatus .
Comments