Niko Neufeld "A 32 Tbit/s Data Acquisition System"
Cern intro 2010-10-27-snw
1. Grid Computing
at the Large Hadron Collider:
Massive Computing at the
Limit of Scale, Space, Power
and Budget
Dr Helge Meinhard
CERN, IT Department
SNW Frankfurt, 27 October 2010
2. CERN (1)
§ Conseil européen
pour la recherche
nucléaire – aka
European Laboratory
for Particle Physics
§ Facilities for
fundamental research
§ Between Geneva and
the Jura mountains,
straddling the Swiss-
French border
§ Founded in 1954
3. CERN (2)
§ 20 member states
§ ~3300 staff members,
fellows, students,
apprentices
§ 10’000 users registered
(~7’000 on site)
§ from more than 550
institutes in more than 80
countries
§ 1026 MCHF (~790 MEUR)
annual budget
§ http://cern.ch/
4. Physics at the LHC (1)
Matter particles: fundamental
building blocks
Force particles:
bind matter particles
5. Physics at the LHC (2)
§ Four known forces:
strong force,
weak force,
electromagnetism,
gravitation
§ Standard model
unifies three of them § Higgs particle
§ Verified to § Higgs condensate fills
0.1 percent level vacuum
§ Too many free § Acts like ‘molasse’,
parameters slows other particles
§ E.g. particle masses down, gives them
mass
6. Physics at the LHC (3)
§ Open questions in particle physics:
§ Why are the parameters of the size as we observe
them?
§ What gives the particles their masses?
§ How can gravity be integrated into a unified theory?
§ Why is there only matter and no anti-matter in the
universe?
§ Are there more space-time dimensions than the 4 we
know of?
§ What is dark energy and dark matter which makes
up 98% of the universe?
§ Finding the Higgs and possible new physics with
LHC will give the answers!
7. The Large Hadron Collider (1)
§ Accelerator for
protons against
protons – 14 TeV
collision energy
§ By far the world’s
most powerful
accelerator
§ Tunnel of 27 km
circumference, 4 m
diameter, 50…150 m
below ground
§ Detectors at four
collision points
8. The Large Hadron Collider (2)
§ Approved 1994, first
circulating beams on
10 September 2008
§ Protons are bent by
superconducting magnets
(8 Tesla, operating at 2K
= –271°C) all around the
tunnel
§ Each beam: 3000
bunches of 100 billion
protons each
§ Up to 40 million bunch
collisions per second at
the centre of each of the
four detectors
9. LHC Status and Future Plans
Date Event
10-Sep-2008 First beam in LHC
19-Sep-2008 Leak when magnets ramped to full field for 7 TeV/beam
20-Nov-2009 First circulating beams since Sep-2008
30-Nov-2009 World record: 2 * 1.18 TeV, collisions soon after
19-Mar-2010 Another world record: 2 * 3.5 TeV
30-Mar-2010 First collisions at 2 * 3.5 TeV, special day for the press
26-Jul-2010 Experiments present first results at ICHEP conference
14-Oct-2010 Target luminosity for 2010 reached (10^32)
Until end 2011 Run at 2 * 3.5 TeV to collect 1 fb-1
2012 Shutdown to prepare machine for 2 * 7 TeV
2013 - …(?) Run at 2 * 7 TeV
11. LHC Detectors (2)
3’000 physicists (including 1’000 students)
from 173 institutes of 37 countries
12. LHC Data
(1)
The accelerator generates
40 million bunch collisions
(“events”) every second at
the centre of each of the
four experiments’
detectors
§ Per bunch collision, typically
~20 proton-proton
interactions
§ Particles from previous bunch
collision only 7.5 cm away
from detector centre
13. LHC Data
(2)
Reduced by online
computers that filter
out a few hundred
“good” events per
second …
15’000 Terabytes = 3 million DVDs
… which are recorded on disk 15 Petabytes per
and magnetic tape at year for four
100…1’000 Megabytes/sec
experiments
1 event = few Megabytes
15. Summary of Computing Resource Requirements
All experiments – 2008
From LCG TDR – June 2005
CERN All Tier-1s All Tier-2s Total
CPU (MSPECint2000s) 25 56 61 142
Disk (Petabytes) 7 31 19 57
Tape (Petabytes) 18 35 53
30’000 CPU servers,
110’000 disks:
Too much for CERN!
CPU Disk Tape
CERN CERN
18% 12%
All Tier-2s CERN
33% 34%
All Tier-2s
43%
All Tier-1s
All Tier-1s 66%
All Tier-1s
39%
55%
16. Worldwide LHC Computing Grid (1)
§ Tier 0: CERN
§ Data acquisition and Uni x Lab m
initial processing grid for a
§ Data distribution regional
group Uni a
§ Long-term curation Nordic
Countries
§ Tier 1: 11 major centres Lab a UK
USA
§ Managed mass storage
§ Data-heavy analysis France
Spain
§ Dedicated 10 Gbps lines Tier 1
Uni n
to CERN CERN Nether-
Italy Tier 0 lands
§ Tier 2: More than 200 Tier2
centres in more than 30 Germany
g Taiwan
countries Lab b
Lab c
§ Simulation grid for a
§ End-user analysis b physics
Uni y study
§ Tier 3: from physicists’ a
Uni b group
desktops to small
workgroup cluster Tier3
§ Not covered by MoU physics
Desktop department
17. Worldwide LHC Computing Grid (2)
§ Grid middleware for “seemless”
integration of services
§ Aim: Looks like single huge compute facility
§ Projects: EDG/EGEE/EGI, OSG
§ Big step from proof of concept to stable,
large-scale production
§ Centres are autonomous, but lots of
commonalities
§ Commodity hardware (e.g. x86 processors)
§ Linux (RedHat Enterprise Linux variant)
18. CERN
Computer Centre
Functions:
§ WLCG: Tier 0,
some T1/T2
§ Support for smaller
experiments at
CERN
§ Infrastructure for
the laboratory
§ …
19. Requirements and Boundaries (1)
§ High Energy Physics applications require mostly
integer processor performance
§ Large amount of processing power and storage
needed for aggregate performance
§ No need for parallelism / low-latency high-speed
interconnects
§ Can use large numbers of components with
performance below optimum level (“coarse-grain
parallelism”)
§ Infrastructure (building, electricity,
cooling) is a concern
§ Refurbished two machine rooms
(1500 + 1200 m2) for total air cooled
power consumption of 2.5 MWatts
§ Will run out of power in about 2014…
20. Requirements and Boundaries (2)
§ Major boundary
condition: cost
Purchased in 2004,
§ Getting maximum
resources with fixed
now retired
budget…
§ … then dealing with cuts
to “fixed” budget
§ Only choice: commodity
equipment as far as
possible, minimising
TCO / performance
§ This is not always the
solution with the cheapest
investment cost!
21. The Bulk Resources – Event Data
Permanent storage on (simplified network topology)
tape
Disk as temporary
buffer
R
Ethernet R
backbone
(multiple 10GigE)
Data paths: R R
tape « disk
disk « cpu
Router
10GigE
Tapes/ Disk CPU servers
servers servers
22. CERN CC currently (September 2010)
§ 8’500 systems, 54’000 processing cores
§ CPU servers, disk servers, infrastructure
servers
§ 49’900 TB raw on 58’500 disk drives
§ 25’000 TB used, 50’000 tape cartridges
total (70’000 slots), 160 tape drives
§ Tenders in progress or planned
(estimates)
§ 800 systems, 11’000 processing cores
§ 16’000 TB raw on 8’500 disk drives
23. Disk Servers for Bulk Storage (1)
§ Target: temporary event data storage
§ More than 95% of disk storage capacity
§ Best TCO / performance: Integrated PC server
§ One or two x86 processors, 8…16 GB, PCI RAID card(s)
§ 16…24 hot-swappable 7’200 rpm SATA disks in server chassis
§ Gigabit or 10Gig Ethernet
§ Linux (of course)
§ Adjudication based on total usable capacity with
constraints
§ Power consumption taken into account
§ Systems procured recently: depending on specs,
5…20 TB usable
§ Looking at software RAID, external iSCSI disk enclosures
§ Home-made optimised protocol (rfcp) and HSM
software (Castor)
26. Other Disk-based Storage
§ For dedicated applications (not physics
bulk data):
§ SAN/FC storage
§ NAS storage
§ iSCSI storage
§ Total represents well below 5% of disk
capacity
§ Consolidation project ongoing
27. Procurement Guidelines
§ Qualify companies to participate in calls for
tender
§ A-brands and their resellers
§ Highly qualified assemblers/integrators
§ Specify performance rather than box counts
§ Some constraints on choices for solution
§ Leave detailed system design to bidder
§ Decide based on TCO
§ Purchase price
§ Box count, network connections
§ Total power consumption
28. The Power Challenge – why bother?
§ Infrastructure limitations
§ E.g. CERN: 2.5 MW for IT equipment
§ Need to fit maximum capacity into given power envelope
§ Electricity costs money
§ Costs likely to raise (steeply) over the next few years
§ IT responsible of significant fraction of world energy
consumption
§ Server farms in 2008: 1…2% of the world’s energy
consumption (annual growth rate: 16…23%)
§ CERN’s data centre is 0.1 per mille of this…
§ Responsibility towards mankind demands using the energy
as efficiently as possible
§ Saving a few percent of energy consumption makes a
big difference
29. CERN’s Approach
§ Don’t look in detail at PSU, fans, CPUs, chipset,
RAM, disk drives, VRMs, RAID controllers, …
§ Rather: Measure apparent (VA) power consumption
in primary AC circuit
§ CPU servers: 80% full load, 20% idle
§ Storage and infrastructure servers: 50% full load, 50% idle
§ Add element reflecting power consumption to
purchase price
§ Adjudicate on the sum of purchase price and power
adjudication element
30. Power Efficiency: Lessons Learned
§ CPU servers: power efficiency increased
by factor 12 in a little over four years
§ Need to benchmark concrete servers
§ Generic statements on platform are void
§ Fostering energy-efficient solutions
makes a difference
§ Power supplies feeding more than one
system usually more power-efficient
§ Redundant power supplies are inefficient
31. Future (1)
§ Is IT growth sustainable?
§ Demands continue to rise exponentially
§ Even if Moore’s law continues to apply, data
centres will need to grow in number and size
§ IT already consuming 2% of world’s energy –
where do we go?
§ How to handle growing demands within a
given data centre?
§ Demands evolve very rapidly, technologies less
so, infrastructure even at a slower pace – how to
best match these three?
32. Future (2)
§ IT: Ecosystem of
§ Hardware
§ OS software and tools
§ Applications
§ Evolving at different paces: hardware
fastest, applications slowest
§ How to make sure at any given time that
they match reasonably well?
33. Future (3)
§ Example: single-core to multi-core to
many-core
§ Most HEP applications currently single-
threaded
§ Consider server with two quad-core CPUs as
eight independent execution units
§ Model does not scale much further
§ Need to adapt applications to many-core
machines
§ Large, long effort
34. Summary
§ The Large Hadron Collider (LHC) and its experiments is a
very data (and compute) intensive project
§ LHC has triggered or pushed new technologies
§ E.g. Grid middleware, WANs
§ High-end or bleeding edge technology not necessary
everywhere
§ That’s why we can benefit from the cost advantages of
commodity hardware
§ Scaling computing to the requirements of LHC is hard
work
§ IT power consumption/efficiency is a primordial concern
§ We are steadily taking collision data at 2 * 3.5 TeV, and
have the capacity in place for dealing with this
§ We are on track for further ramp-ups of the computing
capacity for future requirements
35. Thank you
Summary of Computing Resource Requirements
All experiments - 2008
From LCG TDR - June 2005
CERN All Tier-1s All Tier-2s Total
CPU (MSPECint2000s) 25 56 61 142
Disk (PetaBytes) 7 31 19 57
Tape (PetaBytes) 18 35 53
37. CPU Servers (1)
§ Simple, stripped down, “HPC like” boxes
§ No fast low-latency interconnects
§ EM64T or AMD64 processors (usually 2),
2 or 3 GB/core, 1 disk/processor
§ Open to multiple systems per enclosure
§ Adjudication based on total performance
(SPECcpu2006 – all_cpp subset)
§ Power consumption taken into account
§ Linux (of course)
39. Tape Infrastructure (1)
§ 15 Petabytes per year
§ … and in 10 or 15 years’ time physicists will
want to go back to 2010 data!
§ Requirements for permanent storage:
§ Large capacity
§ Sufficient bandwidth
§ Proven for long-term data curation
§ Cost-effective
§ Solution: High-end tape infrastructure
41. Mass Storage System (1)
§ Interoperation challenge locally at CERN
§ 100+ tape drives
§ 1’000+ RAID volumes on disk servers
§ 10’000+ processing slots on worker nodes
§ HSM required
§ Commercial options carefully considered
and rejected: OSM, HPSS
§ CERN development: CASTOR (CERN
Advanced Storage Manager)
http://cern.ch/castor
42. Mass Storage System (2)
§ Key CASTOR features
§ Database-centric layered architecture
§ Stateless agents; can restart easily on error
§ No direct connection from users to critical services
§ Scheduled access to I/O
§ No overloading of disk servers
§ Per-server limit set according to type of transfer
§ servers can support many random access style
accesses, but only a few sustained data transfers
§ I/O requests can be scheduled according to priority
§ Fair shares access to I/O just as for CPU
§ Prioritise requests from privileged users
§ Performance and stability proven at the
level required for Tier 0 operation
43. Box Management (1)
§ Many thousand boxes
§ Hardware management (install, repair, move,
retire)
§ Software installation
§ Configuration
§ Monitoring and exception handling
§ State management
§ 2001…2002: Review of available packages
§ Commercial: Full Linux support rare, insufficient
reduction on staff to justify licence fees
§ Open Source: Lack of features considered
essential, didn’t scale to required level
44. Box Management (2)
§ ELFms (http://cern.ch/ELFms)
§ CERN development in collaboration with
many HEP sites and in the context of the
European DataGrid (EDG) project
§ Components:
§ Quattor: installation and configuration
§ Lemon: monitoring and corrective actions
§ Leaf: workflow and state management
48. Box Management (6): Lemon
§ Apart from node parameters, non-node
parameters are monitored as well
§ Power, temperatures, …
§ Higher-level views of Castor, batch queues
on worker nodes etc.
§ Complemented by user view of service
availability: Service Level Status
49. Box Management (7): Leaf
§ HMS (Hardware management system)
§ Track systems through lifecycle
§ Automatic ticket creation
§ GUI to physically find systems by host name
§ SMS (State management system)
§ Automatic handling and tracking of high-level
configuration steps
§ E.g. reconfigure, drain and reboot all cluster
nodes for a new kernel version
50. Box Management (8): Status
§ Many thousands of boxes managed
successfully by ELFms, both at CERN and
elsewhere, despite decreasing staff levels
§ No indication of problems scaling up further
§ Changes being applied wherever necessary
§ E.g. support for virtual machines
§ Large-scale farm operation remains a
challenge
§ Purchasing, hardware failures, …