Minutes of the IT strategy steering committee, 12/05/10

Present: Martin Hardcastle (chair: MJH), John Atkinson (systems manager: JA), Tim Gledhill (staff rep: TMG), Mark Thompson (staff rep: MAT), John Barnes (PD/fellow rep: JB), Mark Galloway (student rep: MG).

0. Introduction

The meeting opened at 15:00. MJH welcomed members to the first meeting of the committee.

1. System status report

JA gave a report on central IT facilities. The process of moving from the old Sun file servers to the new system purchased from Dell had been accelerated as a result of instability on the old machines and was now almost complete. The new server would have access to 28 Tb of fibre-channel storage, part of a 140-Tb array shared via a SAN with the new cluster, which would serve /home and /data discs; the final stages of moving over to this system were now in progress. The new server was working well on the whole; some NFS instability was probably the result of a regression in recent versions of the Linux kernel and hopefully was cured by reverting to an older kernel version. At present the old Solaris server was still running LDAP and DHCP services, but these would soon be moved to the new server, allowing us to remove the last machine running Solaris from the system. The old server would then likely become a backup machine, running AMANDA. The other one of the two old servers (uhssfv2, aka star) was still acting as a file and web server, and would remain in this role, probably also becoming the main ssh gateway machine.

As none of the old RAID arrays had any crucial role in the new arrangement, JA proposed that we cease paying annual maintenance for these, and the committee agreed.

TMG asked about maintenance for the old servers; the committee agreed that this was probably no longer necessary either, since either of them could be replaced by ordinary desktop machines if they failed.

JB asked about the continuing limits on ssh access from outside. JA explained that this was necessary because a Linux kernel vulnerability still existed for a few of the machines on the network, notably the machines in the old cluster. These needed to be upgraded before ssh access (via a gateway) could be re-enabled; otherwise we risked another hacking incident. Work on the new system and cluster had delayed these upgrades but doing them was now a priority.

MJH asked whether the committee had a view on the eventual fate of the old cluster; it could be left as it is, merged with the new cluster for testing purposes, or turned off. It was agreed that users of the old cluster should be encouraged to migrate to the new system.

Actions: JA + MJH to work towards restoring ssh access to the system.

MJH to talk to users of the old cluster.

2. New cluster

MJH reported on the construction of the new cluster. The cluster consisted of a head node and 80 compute nodes, all 8-core Xeon machines; 32 of these were reserved mostly for CAIR and the remaining 48, each with 24 Gb of RAM, were shared between CAR, Computer Science and some smaller groups. ~65 Tb of storage was available for the main cluster (the CAIR cluster has ~ 40 Tb) and it was intended that the cluster would be used, among other things, for high-volume data reduction, as the Infiniband links between the compute nodes and the head node allow fast access to the storage. MJH and others had run initial tests showing that the system performed as expected; the cluster was now open to users. Some documentation was available at http://stri-cluster.herts.ac.uk/.

TMG asked about job submission systems; MJH stated that these were set up and working.

MJH reported that some money remained in the budget for further expansion of the cluster. At least part of this would be spent on at least one large SMP machine (hopefully with 48 cores and 256-512 Tb RAM) which had originally been specified by computer science but which had obvious applications for data crunching (MAT agreed that this would be useful). Possibly two of these, or one plus one additional compute node enclosure (16 nodes) could be afforded. Nobody had any suggestions for alternatives.

JA suggested that time on the cluster could be sold, internally or externally. The committee agreed that this could be looked at once usage profiles became clear.

3. Future system developments

JA reported that the next step would be the purchase of a tape library system capable of backing up some or all of the storage attached to the main server (at least 10 Tb, maybe up to 28 Tb). This would be done using the AMANDA software. The committee did not foresee a need to back up more than 28 Tb. AMANDA should replace the current arrangements of level 0 and incremental backups, driven by home-written scripts.

JA explained that the only off-site backup arrangement currently was an ad-hoc arrangement in which he took a tape containing a dump of /home off-site on a regular basis. MJH and MG agreed that this was reasonable and consistent with what is done elsewhere. It was agreed that JA should continue to do this and that this was as much as could reasonably be expected, though there might be some advantage in using an external HD rather than a tape. However, it was thought to be important that users should be aware of the limits of backups, and should be encouraged 1) to use /home for important, hard to replace files such as papers/theses and 2) to make their own off-site backups where required using external hard drives.

Action: JA to send regular e-mails summarizing the backup situation and pointing users to the more detailed description of the policy on the web.

4. New student desktop/laptop purchases

Prompted by a recent e-mail from Jim Hough, the committee discussed the appropriate types of system to purchase for new students. JA pointed out that laptops are very much harder to set up and maintain than desktops, and cannot be securely integrated into our local network. MAT argued that laptops were sometimes necessary for students, e.g. for observing runs; MJH suggested that this need could be met by having a small number of laptops for loan. After discussion, the committee agreed to recommend that desktops be the default for all new students, unless supervisors can make a very strong academic case for the purchase of a laptop. Laptop support by JA (after initial install) will be best-efforts only.

Action: MJH to pass this on to Jim.

5. AOB

There was no other business.

6. Date of next meeting

A meeting would be held in about 2 months, i.e. in ~ mid-July.

Action: MJH to circulate a Doodle nearer the time. The meeting closed at 15:55.