The end is nearly in sight

Here are some pictures of dexter, the new compute cluster we’ve been in the process of purchasing, installing, and configuring for the last few months. It’s been a long process but we’re now onto final acceptance testing. It belongs to the Wales, Frenkel, and Miller groups. Chris Whittleston took the pictures.

We have run out of space to put new machines in Chemistry so the machine is being hosted in Engineering. We’re renting rack space in their research server room, which is a modern design with very efficient cooling. This is the first time we’ve managed machines on a remote site so we paid a lot of attention to getting the emergency access features right. We can connect to every node in the cluster as if we were sitting in front of it with a monitor and keyboard plugged in. We can even turn the power supply off and on over the network.

dexter from the back

One half of the front. We’ve spread the machine out over two server racks. It would physically fit into a single rack with room left over, but there is a limit to the electrical power supplied per rack so we have to use two.

dexter from the front

Posted in Clusters, Hardware, Server rooms | Comments Off on The end is nearly in sight

(lack of) Documentation for simple things

We don’t usually post technical stuff here – this post has more technical content than most – but I’ve been bothered this week by a complete absence of documentation for doing what ought to be a simple task.
When reinstalling a machine from scratch, we can automatically install Windows and Linux quite easily. We boot the machine from the network, the network boot server asks the database which operating system should be supplied and it goes ahead from there. I’ve been looking at doing the same thing for OS X machines too and it’s turned out to be tricky.
It’s not been particularly difficult technically – the underlying principles are the same. Boot the machine from the network, provide an appropriate boot-loader, a small piece of software which the machine runs in order to download the operating system across the network and then get it to install the real operating system to the hard disk. The difficulties have been in finding the documentation for doing this with OS X.
For OS X, it turns out that we need just a few files from the install DVD.
boot.efi is the boot-loader, kernelcache and mach_kernel are loaded by this bootloader. One subtlety is that the kernelcache and mach_kernel files supplied on the CD each contain two binary files, one for 32bit processors and one for 64bit processors.

Posted in technical | Comments Off on (lack of) Documentation for simple things

Machine management in academia

Looking around the web, I’ve been struck by the different techniques of managing machines used. Most folks have a machine they look after – their own laptop perhaps – and there’s a lot of information out there to help. How to reinstall software, how to un-break the operating system after something bad has happened, which configuration settings are needed to make things work. For one machine, that’s fine. The problem comes with scale. Continue reading

Posted in Uncategorized | Comments Off on Machine management in academia

Dots in front of the eyes – Xymon custom views

We use Xymon to monitor every non-personal machine connected to the department network. As a minimum we check for network connectivity. More complex, highly managed, or important machines can have many more tests configured. The data is available through webpages which present each test as a coloured dot. Red is bad, green is good, and there are a few other colours in between.

Xymon currently monitors about 15,000 things. That’s a lot of dots to look at so the data also feeds into the ticket system; tickets are automatically created for important services with red dots.

The main Xymon web pages are available to all chemists at http://hobbit.ch.cam.ac.uk/hobbit/. However this view of the data is designed to help computer officers look for high priority problems and doesn’t give a good overview of each research group; it can hard to find what you want in the sea of Xymon dots.

We now have a custom view which presents the data in format that’s easier to read. Each research group has a dot. The colour of the dot is the colour of the ‘worst’ test within that group at the time. Click on the dot to see all the tests for that group in one page.

http://hobbit.ch.cam.ac.uk/hobbit/prettyrgs/

Posted in Uncategorized | Comments Off on Dots in front of the eyes – Xymon custom views

SPRI library

A recent challenge facing us at SPRI (Scott Polar Research Institute) was the transfer of the entire library IT infrastructure from one network to another. While a number of aspects of the move (once the processes were worked out) were relatively straightforward, there were a few tricky aspects as well. Most of this centred around the library’s cataloguing system, Muscat a DOS (yes that’s right) program dating back to the early 1990’s. The program lived on a networked share and the database itself was on another. The plan was to leave well enough alone and move the whole thing over as it stood…
 
The library consists of 5 staff members and 8 bibliographers that used Muscat on a day to day basis. This also required careful organisation and two days were booked three weeks in advance so to cause as little inconvenience as possible to the Muscat users both during and after the move. In addition to the Muscat shares, all user accounts were moved to the new network as were two other file shares and two networked printers. All 9 staff PCs were rinsed and replaced with 64 bit Windows 7.
What made this whole project do-able, was two tools we use on a routine basis; pxe and wpkg.  Pxe  runs unattended system installs while wpkg installs applications – again unattended, but also remotely (if need be).  Very handy!  Still a very busy two days but on the next day, all staff were able to login to the new system and view/edit Muscat records.
 
The final phase is to rinse and replace 5 catalogue kiosk computers in the library and it is anticipated that this will be completed around the end of next week.

Posted in Network | Comments Off on SPRI library

You are in a maze of twisty dependencies, all alike

We were recently asked to install a new piece of scientific software on one of the compute clusters. This software relies on a number of other pieces of software which we didn’t already have installed. And those in turn have dependencies. You can see where this is going. By the end of the process we’d got a whole new compiler, MPI library, and maths libraries to go with the new software. All of these things will be useful in the future but took quite some time to build and hook together.

Which brings me onto a quick plug for the University Computing Service’s excellent course on building software on Linux. The course notes are at http://www.ucs.cam.ac.uk/docs/course-notes/unix-courses/Building and the course itself is next running in March 2013. Highly recommended for anyone who needs to build scientific software on Linux!

Posted in Clusters, Linux, Software | Comments Off on You are in a maze of twisty dependencies, all alike

ngrep

This week, I’ve spent some time analysing traffic on our network to try to get to the bottom of some slightly odd behaviour we’ve been seeing. Ordinarly I’d use tcpdump and/or wireshark. They’re great for capturing traffic and filtering by, for example, the IP address concerned, or the network protocol being used, but sometimes that’s not enough: if one is interested in filtering based upon the contents of a packet you need a different tool. And so, a little bit of searching led me to ngrep, which is exactly what the name sounds like: grep for networks. I can now filter out all traffic on our network which is requesting one particular website address by filtering based upon the HTTP request being sent. Now I can quickly and easily get at the network traffic I’m interested in!

Posted in Linux, Network, Software | Comments Off on ngrep

Invisible new servers

We have just installed a new pair of servers to help take the strain of running the network services that the department relies upon. This includes Admitto, the VPN services, and many others that are less visible but make Internet access work.
We run all of these services on virtual machines which can be seamlessly migrated between physical servers. This means we can add new hardware without any downtime. VPN and laptop users may be interested to know that their connections moved from the main building to the UCC and back again this morning without missing a beat.

Posted in Hardware | Comments Off on Invisible new servers

Sleep talking

Do you talk in your sleep?

It’s been a busy week and about 5pm tonight, we were finishing off bouncing some ideas round and contemplating going to the welcome party for new students. Then the new fileserver dropped offline. Completely.

These new file-servers are whizzy, expensive boxes with lots of storage. They’ve been rock solid through all the testing we’ve been able to throw at them. To discover that the live server had become un-contactable was disturbing to say the least. We investigated. And investigated. We would restart one, walk to the other server room to restart the other and find the first had gone off-line before we got to the second. We googled. We read manual pages. Everything pointed to something on the network causing these machines to lose track of what was happening.

We monitored the network – lots of data on the network which didn’t conform to any known standard. It came in bursts of about fifteen or twenty seconds of junk then a pause for a while. We started disconnecting parts of the network to track down the source, one of us frantically typing commands with the other monitoring the flow of junk.

After a couple of false starts, we found the right area of the building and went in search of this errant machine. Perhaps something had been taken over by a particularly nasty virus? Or some experiment control system had been infected by a trojan even now attempting to disrupt nuclear centrifuges. Or an old machine was emitting rubbish as the dying gasp of an aged network card.

What we weren’t expecting was a nice, new, shiny PC in sleep mode. Sending a stream of utter junk across the network. It’s unplugged now and the network is much quieter. Our fileservers are happy once more.

So if you were at the welcome party, sorry for not joining you. We’ll catch up soon, I hope. And in the meantime, don’t encourage sleep-talking.

Posted in Network | Comments Off on Sleep talking

Air flow

One of the department’s compute clusters has recently had an increase in usage. The air conditioning in its room is not keeping up with the load and some hot spots have developed. We are going to rearrange the room to improve the air flow, but that involves dismantling and moving the best part of a ton of hardware. In the meantime a low tech solution has eliminated the hot spots.

Posted in Clusters, Hardware, Server rooms | Comments Off on Air flow