Jonathan's New Toy: GEMINI, the Terabyte Dual Opteron

Being a chronicle of my adventures getting operational on a new machine; I hope this may be of interest to others considering running Linux on the Opteron.

SOTHIS, GEMINI and JUPITER, the JSR Mission Control Center

GEMINI is the cool looking black box

Making the JSR

Some of you may be interested in the hardware used to make JSR a reality. Behind each week's JSR is a set of files containing gigabytes of orbital data and an extensive library of software to analyse it, as well as photographic and document archives. In addition, I run big codes to analyse astronomical datasets, using up lots of disk for both the data and the software, and lots of memory and CPU for the crunching. For several years I have been using JUPITER, a machine made by my local supplier, PCs For Everyone, a Linux box with an AMD Athlon 750 CPU, 0.5Gb of RAM, and 100 Gb of disk. (The JSR and public databases are then copied to another Linux box somewhere in Texas that I rent, the box which hosts planet4589.org. Don't bother trying to hack JUPITER and GEMINI, they are not exposed to the net.) My disks are well more than full, and I need more memory for the image analysis tasks, and more CPU to speed up the JSR number-crunching. Anyway, every self-respecting rocket scientist needs a terabyte at home! Time to buy myself a present...

Introducing GEMINI

For my new machine I went to a company that usually makes Beowulf clusters, PSSC Labs in California. Here's what they built me:

Altec Series 100Promise IDE, Adaptec SCSI
CPU Dual Processor AMD Opteron 246
OS Red Hat Linux (64-bit)
Memory 2.0 GB
Internal Hard Disks (3) 720 GB
External USB Hard Disk (1) 192 GB
External SCSI Hard Disks (2) (*) 106 GB
Total Hard Disk (*) 1018 GB
MonitorSamsung 21-inch LCD (Syncmaster 210T)
CDRW/DVDCDRW/DVD
Sound
Controllers
Printer (*)HP Laserjet 4100
Scanner (*)HP Scanjet 6100C

(*) Existing external peripherals taken from JUPITER.

Bringing GEMINI To Life

Friday, Oct 24: I pace up and down ready for the truck to arrive. From the Web tracking page, I know it reached Boston at 8am. There's a departure scan at 8:50. It's 10am - how long can it take them to get here? Oh... now I see they actually departed at 10:30, the other scan didn't mean anything. Dum-de-dum... postpone my 10am meeting again... it's noon, I'm getting really hungry. Dah-dah! 12:45, and here they are. We lug it up to JSR Mission Control. It's a huge shapeless mass wrapped in black and then shrinkwrap and strapped to a wooden pallet. I wonder if there's really a computer inside, but I have to actually go into work for a few hours so the fun will have to wait.

Later on Friday: the wrapping cast aside, I open the two big boxes. The case and monitor are a very cool black. Great! Well worth the money. Who cares if it actually works?

Many cables later, it's time to throw the switch. Where is the switch? I had this problem with JUPITER a few years ago and had to call tech support so they could tell me where the on/off switch was. I refuse to be so humiliated this time around. Eventually, I discover the whole front panel has to be hinged open to reveal the on/off switch and the CD drive etc. OK...

Nothing. The monitor is still doing its "No digital signal" thing. The computer's making a few sounds, but I can't see anything. Maybe it's the monitor, I'll switch to an old one.

Aliveness! It finds a couple drives, and goes through the SCSI BIOS, and... oh. Blank screen with blinking cursor. Not so good.

Well, let's try and figure out what's wrong with the monitor. Maybe I'm missing a cable. Aha! There's a cable for the digital video, the monitor manual says I have to connect that too. OK, now I can boot with the new monitor and get the same unfortunate result. It looks like it's seen the hardware but isn't booting. Except that it only mentions two disk drives and there should be three. Hmm. I try disconnecting most of the cables I connected earlier and simplify the problem to just the box and the monitor. No change. I'm stuck. Bloody hell, I hope I don't have to ship this monster back to the West Coast.

"Hi, PSSC Labs?.. I have a problem." "Is that Jonathan?". Great, I've been bugging this poor sales guy for so long he immediately recognizes my voice. (He started out trying to sell me a somewhat less ambitious computer but I waffled for a couple of weeks and managed to negotiate the price up by another couple thousand or so as I added more stuff...). They pass me to the tech guy, and he tells me to check all the connectors.

OK, I get out the Phillips screwdriver and the anti-static strap and lug the beast onto the table for an operation. Everything seems pretty solid, maybe one of the disk power supplies feels a bit loose. I shove everything down hard and try again. Same result.

"Hi, PSSC Labs?..."

This time Victor the tech guy gets serious. We puzzle together for a while, getting sidetracked by the missing disk drive (in fact it's connected directly to the motherboard and doesn't show up on the Promise BIOS) until he leads me through the BIOS setup and we discover the boot disk info has gotten messed up somehow. Reset Defaults does the trick and soon we are booting into Linux. Hooray!

While I've got this guy on the phone, let's see how far I can get. We soon run into another problem, X won't start. They haven't configured with this monitor (it was shipped straight to me) so I was worried about this. I manage to fool Victor for quite a while as we mess with redhat-config-xfree86 without success, until he starts quizzing me about cables. I confess that I have both the analog and digital video cables connected at once. Oops. Turns out this was *not* what the manual should have said. Once I remove the analog cable, things go much better. We're in X!! OK, it's redhat's sucky desktop thing, but I can fix things from here. We do some magic that I forgot to take note of which sets the resolution to 1600x1280 LCD and for bonus add the USB disk drive (damn, but the output of "df" looks very fine...), and test the CD player and the speakers by putting some Beatles on. Now we're rocking.

So far I'm very happy with PSSC's quality and support. The hardware seems in good shape and my first bump in the road has been solved with instant, polite, patient and expert tech support. I mention this because I've never had that experience before...

Getting Rid of GNOME

Friday, Oct 24, after dinner: Now it's time to make this a machine I can work with. First step, add a couple users from the root desktop system menu: me and my friend Zaphod Beeblebrox, whose account I use to test things out from when I can't figure out why mine isn't working. Wow, I hate this window manager. You can only resize windows from the bottom right, while I like to do that from the top right. It may sound petty, and maybe there's even a way to change that, but it's enough to make getting back to fvwm my priority.

Second, copy over my home directory - with config files - from JUPITER. scp seems to work fine.

Third, I put my NanoEmacs executable in the path. OK, I'm a geek, I hacked my own version of emacs because the existing 500 alternative versions weren't good enough for me. What can I say? The good news is that it runs fine without recompilation, so AMD's promises about 32-bit compatibility are good so far.

Now, how can I stop it coming up with Gnome? I hate modern desktops, and particularly ones that try and look like Windows. I much prefer raw X11/fvwm. Ahah! It's enough to have .xinitrc in your home directory which overrides /etc/X11/xinit/xinitrc (the system init which contains the evil instructions "exec /etc/X11/xinit/Xclients" which in turn leads to /usr/bin/gnome-session). My .xinitrc starts a few xterms and then runs fvwm2, massively customized by my $HOME/.fvwm2rc to have the window menus and colors the way I like them.

Next, there seems to be a broken system link: /usr/include/X11 points to the wrong place; I su to root and fix it to point to /usr/X11R6/include/X11 so that fvwm can find the right helper programs. I also edit (with NanoEmacs) my .fvwm2rc to remove lots of unneeded references to bitmaps that the latest X installation doesn't seem to have, probably you can do it another way now but I've had my config file fixed pretty much since I switched from twm in the mid-90s. After this, /usr/X11R6/bin/startx works fine and puts me in fvwm2 with my standard startup windows in roughly the right places and with the right colors. Progress!

The function keys don't seem to be recognized. I find an easy fix: jettison the shiny new black keyboard and use the keyboard from SOTHIS, JUPITER's predecessor. Now I can use F5 to raise and lower windows just like on my old Sun workstation at work. Maybe I'll debug the issues with the new keyboard later.

Mozilla's not working now. It worked when I was root! Maybe fvwm doesn't like it - if I type mozilla it just dies immediately with no error message, and since there doesnt seem to be a simple -debug=n option I don't see how to find out what's wrong. This is a job for Zaphod.

It wasn't fvwm after all, since Zaphod can run both fvwm and Mozilla just fine. Turns out it was my old .mozilla directory, deleting this solved the problem. (I mention stupidities like this in case someone else runs across them).

Moving To The World of 64-bit Linux - 1

Now I copy over a few small 32-bit astronomy applications. SM works fine but ds9 dies.. maybe this won't be so easy. Oh, but the new beta ds9/3.0b6 works fine - good job Bill.

Now let's try compiling in 32-bit mode. First benchmark: my C utils library. JUPITER compiles it in 26s actual elapsed time, GEMINI compiles it in 7s. Not bad for a real world improvement. I use the -m32 flag on gcc to force 32-bit. Without this flag, it seems to compile equally fast and not complain, and I assume I'm getting a 64 bit result, but testing that hypothesis will have to wait.

Hmm - but compiling an executable in 32-bit mode complains there is no /usr/lib/crt1.o.

More Setting Up

So it's early Saturday morning, I started unwrapping the computer about 8 hours ago and I'm pretty much operational (bar the printer, the scanner and the external SCSI disks, which can all stay on JUPITER until I'm ready to make GEMINI the primary machine).

Saturday Oct 25 - a bright sunny morning. Time to start sftp on its way and load up the disks while I go deal with real life for a while.

Hmm.. an odd thing.

ln -s /data1/foo foo 
makes a soft link to directory foo/. But ls foo (actually ls -F foo thanks to an alias) doesn't work; the soft link does not resolve. What's up with that? Hmm - more specifically:
ls -F foo 
doesn't show the contents of foo but
ls -F foo/ 
or
cd foo
does work. Very odd. Ahh.. ls -FH appears to restore the old behaviour.

Oops. No wonder that transfer took so long. I accidentally tarred up a 13 Gb tar file and sftp'd it, thinking it was only 1.3Gb. Well, I guess large file support works just fine...

Installed xv (and yes, I actually have a registered copy) and acroread in /usr/local/bin and verified operation.

The following is a bad idea:

df /data1  
 (5 Gb free out of 70 Gb)
du -s foo
  6 Gb
cd /data1
tar cvf /data1/a.tar foo
[hang]
df /data1
 (0 bytes free)
rm /data1/a.tar
df /data1
 (still 0 bytes free!)
This required a power-off, reboot, and ef2sck to repair. tar doesn't handle well writing to disks which get full.

Saturday night. Honest, I do have a life, but not this weekend. Now I've added the external SCSI disks and the printer and scanner. I've changed the hostname to gemini and changed the name of the USB external drive to /export. Looks like I have to go

mount /export
after each reboot; I got some USB timeout messages at the end of the reboot process.

Now that's a nice set of disks:

gemini> df -H
Filesystem             Size   Used  Avail Use% Mounted on
/dev/hda3              31GB  5.5GB   24GB  19% /
/dev/hda1             104MB   25MB   75MB  25% /boot
/dev/hda2             104GB   21GB   78GB  21% /data1
/dev/hda7             109GB  9.0GB   95GB   9% /data2
/dev/hde1             124GB   19GB   99GB  16% /data3
/dev/hde2             124GB   17GB  101GB  14% /data4
/dev/hdg1             124GB   34MB  118GB   1% /data5
/dev/hdg2             124GB   34MB  117GB   1% /data6
none                  1.1GB      0  1.1GB   0% /dev/shm
/dev/sdc1             197GB   34MB  187GB   1% /export
/dev/sda1              73GB   61GB  7.7GB  89% /data7
/dev/sdb1              37GB  5.5GB   30GB  16% /data8
OK, I lied, it's only 0.996E12 bytes after formatting and file system. But there's also 4 GB of swap on /dev/hda5 and /dev/hda6, making 1.000E12 bytes = 1 TB in use. Of course, this is only 0.909 TiB = 931 GiB. Unformatted, it's a lot more, so it's a terabyte as far as I'm concerned. Time to look at the printer.

I log in as root, and run the printer config tool, setting up a print queue "lp" on /dev/lp0 for an HP Laserjet 4100. The test page doesn't work:

There was a problem sending CUPS test page to 'lp' queue. 
lpr: unable to print file: server-error-service-unavailable.
Great. I try looking at the help. It shows you actually selecting the test page in the printer config window. I try this, selecting 'US Letter Postscript Test Page' and it works. Must be a bug in the conf tool then - phew. The ASCII test page works too, with none of the misalignment that one used to risk before print filters.

Now let's try the scanner. I find a scanning tool in the menu; of course in this stupid desktop there's no easy way to find out what the command line is, but the thing is called XSANE so I look in /usr/bin and find something plausible. It gives me big warnings not to do this as root. I'll go back to being me (and therefore in fvwm, let's hope that's not a problem).

Hmm, that's interesting. I told it to log out of gnome but it hasn't done so and now I can't get any menus at all. I'll log in remotely and kill X I guess.... wait. Just as I was doing that, it finally shut down, complaining that redhat-config-printer lost its connection. Guess I should have exited from it somehow.

Log in as me, print a test page - ok, that's still working, good! Typing 'xsane' succeeds (it turns out it's /usr/bin/xsane; I *wish* the Gnome menus had an option to print that info in the menu - maybe they do of course, how would I know?). USB has worked wonderfully: it has detected my scanner and when I hit acquire preview it burbles slightly. Eventually a scan appears! Cool, the hardware works.

Even more of a miracle, Mozilla actually prints properly (for some reason on my old machine it took a couple of minutes to sync to the print system, so I've been using netscape4.7 still - which I still much prefer anyway in terms of look and feel).

Next is the CD burner. I've been told to run 'nautilus-cd-burner', but this gives errors - 'libmapping.so returns a null handle' and 'you need to copy the files to the burn:/// location'. Big help. ...OK, a little browsing indicates this is something intimately linked with Gnome and which will require (shudder) using the file manager. Screw that: I'll do it the old way, directly with mkisofs and cdrecord.,,, OK, I wrote a CD and read it on another machine, so that's good.

Moving To The World of 64-bit Linux - 2

Sunday Oct 26: I try out our CIAO X-ray astronomy software. A trivial test seems to work ok. An adaptive smoothing test takes 1h 22m compared to 3.0h on JUPITER and 4.0h on the SunBlade URANIA.

Now let's test the dual processor aspect of things. I run the same smoothing code, but two copies of it at once. 'top' shows they're running one on each processor. One completes in 1hr 18min, the other a few minutes later at 1hr 30min - almost no overhead for running the second copy (the input file was the same so there may be some unfair I/O overhead). Looks like I really have a working dual processor machine.

Recompiled JSR software in 64-bit. Crash. Oh, I made the new include file with UT_SZ_PTR = 8 but am still pointing to the old one with UT_SZ_PTR = 4. Well, that's a good test that I'm really getting 64-bit. Recompile again, and it works. Simple tests show no problems. Factor of 4 speed improvement in run time on some test applications relative to JUPITER. That's it? I've migrated my applications to 64-bit and it took me 30 minutes? I must have been more careful coding this stuff than I thought...:-)

Well, guess I have to wait to Monday to see if PSSC can fix my 32-bit libraries.

Thursday Oct 30: PSSC found the fix for me on the 32-bit stuff (we needed to install the right 32-bit glibc-devel and then reinstall the 64-bit glibc-devel.) This was the final glitch (I hope!) and GEMINI is running great.

Friday Oct 31: I hate the move of Redhat towards all-Gnome stuff. The postscript viewer 'gv' is not installed - instead they have 'ggv', a Gnome version of gv, whose look and feel is much inferior in my opinion. [Nov 18: Finally just copied /usr/X11R6/bin/gv and /usr/X11R6/lib/libXaw3d.so.7.0 from JUPITER].

Thursday Nov 6: One software portability problem found so far (my fault, not the Opteron's). In testing for NaNs, our C code has a header file which maps bit patterns by (unsigned long*) when using POSIX-1 code that doesn't support the SUS isnan() function. A quick fix is to change this to (unsigned int*) if UT_SZ_I == sizeof(long) is 8. I think this is the one place in the code where there was an #ifdef _alpha_ (instead of explicit handling of sizeof(long)).

Nov 15: On my big screen, Mozilla uses a tiny font. I partly fixed this by setting a big minimum font size in Edit/Preferences/Appearance/Fonts. But then printed pages came out tiny. So I also had to do File/PageSetup and set Scale to 60 percent, switching off ShrinkToPageWidth. This doesn't work perfectly.

Nov 17: JAVA ON THE OPTERON: it turns out Sun is only porting Java to the Opteron by next summer. So Java is 32-bit for now. There was no /usr/bin/java as delivered. I installed 32-bit Java runtime using the i586 RPM from Sun at http://java.sun.com/j2se/1.4.2/download.html. This created /usr/java/j2re1.4.2_02/bin/java. I made a soft link from this to /usr/bin/java.