RAL Tier 1/A Status
description
Transcript of RAL Tier 1/A Status
Martin Bly
RAL CSF Tier 1/A
RAL Tier 1/A Status
HEPiX-HEPNT
NIKHEF, May 2003
Martin Bly
RAL CSF Tier 1/A
CPU Farm – Existing Hardware
• 108 dual processors (450, 600 and 1GHz)
– Up to 1GB RAM
– Desktop towers on warehouse shelves
• 156 dual processor 1400MHz PIII
– 133MHz FSB, 1Gb RAM each
– 1U rackmount, remote power switching
– RedHat 7.2
Martin Bly
RAL CSF Tier 1/A
New Hardware – Spring 2003 +
• 80 dual processor 1U rackmount units– 2 x 2.66GHz P4 Xeons @ 533MHz FSB– Hyper-threading– 2048Mbyte memory– 2x1Gb/s NICs (o/b)– RedHat 7.3– 3 racks, remote power switching
• Next delivery expected Summer 2003
Martin Bly
RAL CSF Tier 1/A
Operating Systems
• Operating Systems:– Redhat 6.2 service will close end May– Redhat 7.2 service has been in production for
Babar for 6 months.– New Redhat 7.3 service now available for
LHC/other experiments– Testing/benchmarking on new Xeon systems
• Increasing demands for security updates becoming problematic.
Martin Bly
RAL CSF Tier 1/A
Disk Farm – Existing Hardware
• 2002 – 26 servers, each with 2 external RAID arrays - 1.7TB disk per server, RAID 5:– Excellent performance, well balanced system– Problems with a bad batch of Maxtor drives –
many failures and high error rate – all 620 drives now replaced by Maxtor.
– Still outstanding problems with Accusys controller failing to eject bad drives from RAID set.
Martin Bly
RAL CSF Tier 1/A
Disk Farm – Spring 2003 +
• Recent upgrade to disk farm:– 11 dual P4 Xeon servers (2.4GHz, 1024Mb RAM, PCIx), each
with 2 Infortrend IFT-6300 arrays via Ultra160 SCSI– 12 Maxtor 200GB DiamondMax Plus 9 drives per array, RAID 5.
• Not yet in production – but a few snags:– Originally tendered Maxtor Maxline Plus II drive was found not to
exist!– Infortrend array has 2TB limit per RAID set – pushing for a
firmware update.– 11+1spare better than 2 x 6 – 5Gb over 11 systems.
• Nick White ([email protected]) for more info.
Martin Bly
RAL CSF Tier 1/A
New Projects
• Basic fabric performance monitoring (ganglia)
• Resource CPU accounting (based on PBS accounts/mysql)
• New CA in production
• New batch scheduler (MAUI)
• Deploy new helpdesk (May)
Martin Bly
RAL CSF Tier 1/A
Ganglia
• Urgently needed live performance and utilisation monitoring:– RAL Ganglia Monitoring
http://ganglia.gridpp.rl.ac.uk/• Scalable solution based on multicast• Very rapidly deployable - reasonable
support on all Tier1A Hardware• See: http://ganglia.sourceforge.net/
Martin Bly
RAL CSF Tier 1/A
PBS Accounting Software
• Need to keep track of system CPU and disk usage.
• Home grown PBS accounting package (Derek Ross):– Upload PBS and disk stats into MYSQL– Process with Perl DBI script– Serve via Apache
• http://www.gridpp.rl.ac.uk/stats• Contact Derek ([email protected]) for more info.
Martin Bly
RAL CSF Tier 1/A
MAUI / PBS
• Maui scheduler has been in production for last 4 months.
• Allows extremely flexible scheduling with many features. But ….– Not all of it works – we have done much work
with developers for fixes.– Major problem – MAUI schedules on wall
clock time – not CPU time. Had to bodge it!!
Martin Bly
RAL CSF Tier 1/A
New Helpdesk Software
• Old helpdesk email based/unfriendly.• With additional staff, urgently need to deploy
new solution.• Expect new system to be based on free software
– probably Request Tracker• Hope that deployed system will also meet needs
of Testbed and may also satisfy Tier 2 sites.• Expect deployment by end of May.• http://requestracker.gridpp.rl.ac.uk
Martin Bly
RAL CSF Tier 1/A
Outstanding issues / worries
• We have to run many distinct services.– Fermi Linux– RH 6.2/7.2/7.3…– EDG testbeds, LCG …
• Farm management is getting very complex. We need better tools and automation.
• Security is becoming a big concern again.