Minutes of the IT strategy steering committee, 15/12/10

Present: Martin Hardcastle (chair: MJH), John Atkinson (systems manager: JA), Mark Thompson (staff rep: MAT), Mark Galloway (student rep: MG). John Barnes (PD/fellow rep: JB)

Apology received from: Tim Gledhill (staff rep: TMG).

0. Introduction

The meeting opened at 14:00.

1. Minutes of previous meeting

The minutes had previously been circulated and were approved.

Actions from the previous minutes:

2. System status report

JA gave a report on central IT facilities.

He reported that LDAP (the protocol enabling single sign on) is now served by the new server, car-server. This did not solve the problem of user desktops periodically freezing but does mean that all main services are now provided by car-server.

The old Solaris 10 server (uhssfv1) now has Linux FC14 installed on it, marking the end of the Solaris era at CAR. uhssfv1 is now a backup server and the new tape library (an Overland NEO 4100) has been installed and is now fully functional. This greatly increases the capacity of data that can physically be backed up and as a result, JA has increased all user /data quotas to a minimum of 100GB. The next stage of development will be to implement Amanda as the default backup mechanism.

At the time of the meeting, JA and MJH thought that the "freezing" problems might be attributed to the fibre channel storage firmware not being correctly loaded when car-server was booted. This isn't the case on the cluster head node (an identical machine, running the same operating system) so the reason is difficult to understand. To alleviate the problems, JA sent out an email explaining that I/O intensive work should not be done using the data discs; data should be copied to local discs, worked on and then copied back. Since the email was sent, the freezes seem to have become less frequent and less severe. Investigation of these problems is ongoing.

Installation of two additional AC units has begun. These units are floor standing and are rated at 12.5 kW each. This will probably convert to 9 kW sensible cooling yielding an increase of 18 kW, and thus a total of approximately 37 kW for the whole room. Combined with an air circulator, we hope this will be sufficient cooling to allow full use of all IT equipment.

3. Cluster report

MJH reported on the status of the cluster. The cluster had generally been stable despite quite heavy usage: 210,000 jobs had been run since switchon at the time of the meeting. A period of instability in November had apparently been completely cured by a kernel upgrade. The main hardware news was the installation of the GPU computing unit, which is now attached to smp1. Since this equipment was installed it seems to have been very little used, either by the computer scientists who requested it or by anyone else.

More astronomers are using the cluster but overall CAIR and computer science dominate the usage profile.

MJH reported that there was some possibility of CAIR funding an extension of the cluster, perhaps up to one additional enclosure (16 compute nodes). Discussions were ongoing at the time of the meeting.

4. Future system developments

JA and MJH reported on developments expected over the next few months. Most of the changes discussed in the previous minutes (Section 4) had now been implemented or were in progress. In planned downtime on the 23rd December JA and MJH were planning to re-organize the racks in the server room so as to use the space more efficiently and add new items such as the new tape drive. The major disruptive item on the list would be the installation of the UPS unit purchased earlier in the year, which had been awaiting the A/C upgrade. This would potentially require downtime on the main system and cluster.

JA reported that FC14 was being tested and would start to be rolled out to users soon.

MJH reported that the cluster nodes should be upgraded at some point -- there is a long-standing kernel problem that means nodes have to be -- rebooted periodically. The plan was to integrate the old cluster machines as a 'sandbox', using these to test a new FC14 install, and then to roll this out across the main system. This process would begin some time in the new year.

5. AOB

MAT raised the issue of input from the committee to the consolidated grant process in 2011. MJH agreed that we should be thinking about this and that the date of the next meeting should be chosen to allow us to do so.

6. Date of next meeting

A meeting would be held in March (to be re-assessed when the timescale for the consolidated grant process is clearer).

Action: MJH to circulate a Doodle nearer the time.

The meeting closed at 14:30.