Minutes of the IT strategy steering committee, 24/08/10

Present: Martin Hardcastle (chair: MJH), John Atkinson (systems manager: JA), Tim Gledhill (staff rep: TMG), John Barnes (PD/fellow rep: JB)

Absent: Mark Thompson (staff rep: MAT), Mark Galloway (student rep: MG).

0. Introduction

The meeting opened at 14:00.

1. Minutes of previous meeting

The minutes had previously been circulated and were approved.

Actions from the previous minutes:

2. System status report

JA gave a report on central IT facilities.

As noted above, ssh access had been restored. All access to the system is now through the gateway machine, star.herts.ac.uk, from which users can access their desktop machines if necessary. This arrangment allows us to respond effectively to security threats. Most users appeared happy with the arrangement. The committee supported JA in making it clear that there will be no direct ssh access to desktop machines in future. DenyHosts has been installed on the gateway, which blocks many brute-force password attacks.

The server transition was now almost complete -- all file system server tasks and most of the other tasks of the old main server were now being handled by the new machine (car-server.herts.ac.uk). The exception was the directory service LDAP which allows a consistent set of users across all machines. JA was currently testing a private LDAP system before rolling it out system-wide. This should be done in the next few weeks, and the old main server (uhssfv1) would then become a backup server. Quotas on /home had been increased to take account of the much larger disc space available.

JA reported that the new backup tape drive had recently arrived -- this has a capacity of 60 Tb and will cope with our backup needs for the foreseeable future. A removable magazine will allow backups to be taken off-site or kept in the local fire safe, increasing the security of our data, and the aim will eventually be to back up the whole new /data area as well as /home. Work on integrating the new drive into the system will start once the LDAP transition has taken place.

There had been some system instability over the past few weeks. Some of this had been traced by JA and MJH to one user's work over the network on data stored on car-server, although it was not clear why this was causing problems. A server kernel upgrade might help to solve this particular problem. Other problems seemed to be the fault of the UH network.

JA reported that security updates for FC11 have now ceased to be made available; consequently, FC11 machines will need to be upgraded to the latest version, FC13, relatively soon. This should not be a particularly disruptive process.

Air conditioning in the server room had also been problematic. JA reported that the fundamental problem was that we had not been supplied with the cooling capacity we had specified at the time of the server room extension. Discussions with Estates and with Kingswood, the contractors, are in progress. Earlier in the summer, when the new cluster was being very heavily used, there had been serious problems with overheating, particularly of the storage array. As a result most of the old cluster, and some nodes of the new cluster, had been shut down to reduce the heat output.

14 new PCs had been purchased for incoming students and staff. These would be set up over the next month. TMG asked if that meant that we were fully covered for the new cohort of students: JA responded that he believed that we were. The intention was, in addition, to buy some cheap loaner laptops for student use at conferences etc. These would be Windows/Linux dual boot machines.

JA reported that the local computing web pages were very out of date. He asked for input on what should be documented, and also whether a Wiki for user-supported documentation would be useful. TMG stressed that the basic pages on system architecture and access were the most important.

Action: JA to update the key pages.
Action: JB and MG (as postdoc/student reps) to consult on what documentation is needed and whether people would be likely to use a Wiki.

TMG asked whether anyone had used the new videoconferencing system. JA and JB were able to report that it had been used successfully.

3. Cluster report

MJH reported on the status of the new cluster. Since the last meeting, two 48-core, 256-Gb SMP machines with 12 Tb of local hard disc had been purchased and were now fully integrated into the cluster. Several astronomers had been using these for data crunching tasks. The final component of the cluster, a GPU computing system, should be available for purchase in the UK soon.

Cluster use had been very heavy in the early part of the summer, but was lower now, and computer scientists appeared to be dominating the usage profile, though astronomers had had a significant share. Over 55,000 jobs had been executed so far on the job control system.

JA pointed out that MATLAB was available on the cluster, with distributed computing modules that would make it particularly effective on the new SMP machines. So far it had not been heavily used.

4. Future system developments

MJH and JA explained that no further large purchases are expected but there will be some changes to the infrastructure and the contents of the server room, some of which may be disruptive. The rough sequence of this is as follows:

  1. LDAP transition from old to new server (see above). Some disruption possible if it doesn't work: may be done by JA at weekend to minimize this.
  2. Backup system upgrade. Requires (1): no disruption likely.
  3. Air conditioning capacity upgrade. Potentially hardware will need to be turned off; not yet clear when or how this will happen.
  4. Shutdown of old cluster and integration into new cluster as testing/sandbox system. Requires (3) to allow old cluster machines to be switched back on permanently.
  5. Installation of new UPS to provide backup for new servers and storage. Requires (3) to ensure safe operation of UPS. Potentially disruptive if equipment needs to be shut down during transition.
  6. Rack re-organization and consolidation to optimize space usage in the server room. Possibly some disruption to secondary services. Needs (4).
5. AOB

There was no other business.

6. Date of next meeting

A meeting would be held in about 3 months, i.e. in November

Action: MJH to circulate a Doodle nearer the time.

The meeting closed at 14:50.