Rechen- und Kommunikationszentrum (RZ)
Brainware als Faktor für
energieeffizientes HPC
Christian Bischof, Dieter an Mey, Christian Terboven
[email protected] - HRZ, TU Darmstadt
{anmey, terboven}@rz.rwth-aachen.de - RZ, RWTH Aachen
20.09.2012, ZKI AK SC, Universität Düsseldorf
Brainware als Faktor für energieeffizientes HPC
Christian Terboven | Rechen- und Kommunikationszentrum 2
Motivation
Definition of „Green HPC“ from insidehpc.com:
Design and management techniques that contribute to the responsible,
effective use of energy in the operation of high performance computing
centers and equipment.
But: The current situation hardly allows for the Economical
Optimization of the Total Budget
Different budgets for
Staff
Hardware (mainly through applications – every X years), Maintenance
Power
Building (mainly through applications – once per decade?!)
User’s in general don’t pay for compute resources
Brainware als Faktor für energieeffizientes HPC
Christian Terboven | Rechen- und Kommunikationszentrum 3
Cost of Brainware versus Hardware
Tuning Opportunities
Success Stories
Summary
Agenda
Brainware als Faktor für energieeffizientes HPC
Christian Terboven | Rechen- und Kommunikationszentrum 4
Cost of Brainware versus
Hardware
Brainware als Faktor für energieeffizientes HPC
Christian Terboven | Rechen- und Kommunikationszentrum 5
Understanding the Total Cost of Ownership
Assumptions
2 Mio € HW investment
per year
5 years lifetime with
4 years maintenance
through vendor
Power: 850 KW,
PUE=1.5,
0.14€ per kWh
=> 1.5 Mio € per year
ISV software provided by
users
Commercial batch system
Free Linux distribution
costs per year percentage
Building
( 5Mio / 25y) 200.000 € 3,72%
HPC software 50.000 € 0,93%
ISV software 0 € 0,00%
Batch system 100.000 € 1,86%
Linux 0 € 0,00%
power 1.500.000 € 27,93%
office space 0 € 0,00%
Staff 12 FTE 720.000 € 13,41%
hardware
maintenance 800.000 € 14,90%
investment
compute servers 2.000.000 € 37,24%
sum costs 5.370.000 € 100,00%
Brainware als Faktor für energieeffizientes HPC
Christian Terboven | Rechen- und Kommunikationszentrum 6
Does it pay off to hire more HPC Experts?
Start tuning top user projects first
15 projects account for 50% of the load
64 projects account for 80% of the load
Assumptions
It takes 2 months to tune one project
One analyst can handle 5 projects per year
A projects profits for 2 years
As a consequence one HPC expert
can take care of 10 projects at a time
One FTE costs 60,000€
Tuning can improve the code by 5, 10 or 20 percent
0,00%
10,00%
20,00%
30,00%
40,00%
50,00%
60,00%
70,00%
80,00%
90,00%
100,00%
1
18
35
52
69
86
10
3
12
0
13
7
15
4
17
1
18
8
Accumulated usage of top accounts (excl. JARA-HPC)
Accumulated usage of top accounts
Brainware als Faktor für energieeffizientes HPC
Christian Terboven | Rechen- und Kommunikationszentrum 7
Does it pay off? Yes!
-600.000
-400.000
-200.000
0
200.000
400.000
600.000
#Pro
jects
10
20
30
40
50
60
70
80
90
10
0
11
0
12
0
13
0
14
0
15
0
16
0
Savings with 5% improvement
Savings with 10% improvement
Savings with 20% improvement
ROI [€]
# of tuned projects 10 projects handled by one FTE (60.000€/y)
For example: Break even point: 7.5 HPC Analysts improve top 75 projects by 10% (TCO is 5.3 Mio €/y)
Brainware als Faktor für energieeffizientes HPC
Christian Terboven | Rechen- und Kommunikationszentrum 8
Tuning Opportunities
Brainware als Faktor für energieeffizientes HPC
Christian Terboven | Rechen- und Kommunikationszentrum 9
Opportunities for Tuning wo/ Code Access
Sanity Check
Use HW Counters to Measure Performance
To check for Performance Anomalies
IO behavior
System call statistics
Hardware
Choose the optimal hardware platform
File system, IO parameters
Parameterization
Choose optimal number of threads / MPI processes
Thread / Process Placement (NUMA)
Mapping MPI topology to hardware topology
MPI parameterization (buffers, protocols)
Optimal libraries (MKL …)
Brainware als Faktor für energieeffizientes HPC
Christian Terboven | Rechen- und Kommunikationszentrum 10
Opportunities for Tuning w/ Code Changes
Cache Tuning
padding, blocking, loop based optimization techniques
Inlining/outlining,
Help compiler to perform optimizations …
MPI optimization
avoid global synchronization
Hide / reduce communication overhead, Unblocking communication
Coalesce communications …
OpenMP optimization
Extend parallel regions
Check for false sharing
NUMA optimization: first touch, migration
In vogue: Add OpenMP to an MPI code to improve scalability
Of Course: Choosing the optimal Algorithm is crucial
To be handled by or with the domain expert
Brainware als Faktor für energieeffizientes HPC
Christian Terboven | Rechen- und Kommunikationszentrum 11
Opportunities for Tuning wo/ Code Changes
Development Environment
Choose the optimal compiler
Choose optimal compiler options
Autoparallelization
Compiler Profile / Feedback
Adapt dataset: Partitioning / Blocking – Load Balancing
This list is not intended to be exhaustive, but rather to illustrate that the
skill set of an HPC tuning expert is very different from that of an
application scientist who develops a program, but both skill sets are
needed.
SimLabs
Interdisciplinary Collaboration: HPC, Domain Expert, Numerical Expert, …
Brainware als Faktor für energieeffizientes HPC
Christian Terboven | Rechen- und Kommunikationszentrum 12
Teaching via Courses and Workshop
Selected courses and workshops in Aachen, since 2011:
Date Event
03/2011 Parallel Programming in CES (en), 1 week, 75 part.
05/2011 Visual Studio 2010 + Windows HPC Server workshop, 25 selected part.
10/2011 AIXcelerate Tuning Workshop (en) (with Intel), 25 selected part.
12/2011 Parallel Programming with MATLAB, 35 part.
03/2012 Parallel Programming in CES (en), 1 week, 55 part.
08-09/2012 Parallel Programming Summer Courses (en): MPI, OpenMP, Tools, …
10/2012 Planned: Tuning for bigSMP HPC Workshop (en)
10/2012 Planned: OpenACC Workshop (en)
11/2012 Planned: Technical Cloud Computing with Microsoft Azure (en)
Brainware als Faktor für energieeffizientes HPC
Christian Terboven | Rechen- und Kommunikationszentrum 13
Success Stories
Brainware als Faktor für energieeffizientes HPC
Christian Terboven | Rechen- und Kommunikationszentrum 14
HPC Consulting may save real money: Combustion of “Biofuels”
Primary Breakup for Diesel
Sprays: Hybrid CDPLit
Adding a single OpenMP-
parallelized kernel improves
efficiency by 10% approx.
Turns into cost reduction of
an equivalent of one FTE/yr.
Human effort ~7 weeks
16 s
32 s
64 s
128 s
256 s
512 s
1 2 4 16 32 48 64
8 PPN 1TPP
4 PPN 1 TPP
4 PPN 2 TPP
Runtime for Small Test Data Set
Nodes
Cluster of Excellence „Tailor-Made Fuels from Biomass“, Inst. F. Combustion Technology, RWTH Aachen University
Brainware als Faktor für energieeffizientes HPC
Christian Terboven | Rechen- und Kommunikationszentrum 15
Green IT = Using Resources more efficiently
Code Impact Partner
FLOWer 1.8 x higher efficiency of hybrid Navier Stokes Solver simulating the landing of a space launch vehicle By adding autopar to MPI, carefully assigning threads to processes to adjust load imbalances
RWTH Laboratory of Mechanics
Matlab 150x Speed-up of numerical Solution of the Diffusion Equation by extracting compute intense kernel, transforming it to a Fortran code + careful cache tuning
RWTH Institute of Physical Chemistry
Gene-hunter
14x Speed-up through Cache Optimization plus scalable MPI Parallelization of Linkage Analysis to identify genes which may cause diseases.
Inst. F. Medical Biometry, Computer Science and Epidemiology (IMBIE) Bonn
Dynmatt 33x Speed-up through I/O-Optimization by implementing appropriate Buffering and reducing meta data operations
RWTH Institute of Steel and Light Alloy Building
Code Impact Partner
FIRE ~100x Speed-up of Image Recognition Software on large SMP by nested parallelization which saves a lot of IO
RWTH Chair of Computer Science 6
NestedCP 10 - 50 x Speed-up for Critical Point Extraction in flow simulation output data through nested parallelization with OpenMP even with highly imbalanced work chunks.
Virtual Reality Center Aachen
TFS 20x Speed-up for Simulation of Human Nasal Flow for Computer Aided Surgery through nested parallelization with OpenMP
RWTH Aerodynamic Institute, Parallel Software Products
Higher sophistication of parallelization leads to higher scalability, but does not save resources…
Brainware als Faktor für energieeffizientes HPC
Christian Terboven | Rechen- und Kommunikationszentrum 16
HECToR: Distributed CSE Success Stories
Code Domain Effect Effort Saving
CASTEP Key Materials Science 4x Speed and 4x Scalability 8 PMs 320k - 480k £ (p.a.)
NEMO Oceanography Speed and I/O-Performance
6 PMs 95 k £ (p.a.)
CASINO Quantum Monte-Carlo
4x Performance and 4x Scalability
12 PMs 760 k £ (p.a.)
CP2K Materials Science 12 % Speed and Scalability 12 PMs 1500 k £ (in total)
GLOMAP/ TOMCAT
Atmospheric Chemistry
15 % Performance ?
CITCOM Geodynamic Thermal Convection
30% Performance ? significant
EBL Fluid Turbulence 4x Scalability 12 PMs
ChemShell Catalytic Chemistry 8x Performance 9 PMs
Fluidity-ICOM
Ocean Modelling Scalability ?
DL_POLY_3 Molecular Dynamics 20x Performance 6 PMs
CARP Heart Modelling 20x Performance 8 PMs
Brainware als Faktor für energieeffizientes HPC
Christian Terboven | Rechen- und Kommunikationszentrum 17
HPC Consulting may save real money: Hydro-Dynamics with XNS
XNS (M. Behr, CATS, RWTH)
Simulation of Hydro-Dynamic
forces of the Ohio Dam
Parallelized with MPI
Scales very well for larger case
Additional OpenMP
Parallelization:
9 parallel regions
Human effort: ~ 6 weeks
20-40 % improvement
# Compute Nodes -20,00%
-10,00%
0,00%
10,00%
20,00%
30,00%
40,00%
50,00%
1 2 4 6 8 16 32 48 64
Improvement of execution time in percentage best effort MPI only versus best effort hybrid
Nehalem EP Cluster (2 processor chips, 4 cores each) with InfiniBand-QDR
Higher is better
Brainware als Faktor für energieeffizientes HPC
Christian Terboven | Rechen- und Kommunikationszentrum 18
XNS: How much efficiency do we want to sacrifice? – Execution Time
# Compute Nodes
PPN = processes per node TPP = threads per process
Exe
cuti
on
tim
e [
sec]
20
40
80
160
320
640
1280
1 2 4 6 8 16 32 48 64
PPN1 TPP1
PPN1 TPP2
PPN1 TPP4
PPN1 TPP8
PPN2 TPP1
PPN2 TPP2
PPN2 TPP4
PPN4 TPP1
PPN4 TPP2
PPN8 TPP1
Lower is better
Brainware als Faktor für energieeffizientes HPC
Christian Terboven | Rechen- und Kommunikationszentrum 19
XNS: How much efficiency do we want to sacrifice? - Efficiency
Effi
cien
cy [
%]
PPN = processes per node TPP = threads per process
0,0%
10,0%
20,0%
30,0%
40,0%
50,0%
60,0%
70,0%
80,0%
90,0%
100,0%
1 2 4 6 8 16 32 48 64
PPN1 TPP1
PPN1 TPP2
PPN1 TPP4
PPN1 TPP8
PPN2 TPP1
PPN2 TPP2
PPN2 TPP4
PPN4 TPP1
PPN4 TPP2
PPN8 TPP1
Higher is better
Brainware als Faktor für energieeffizientes HPC
Christian Terboven | Rechen- und Kommunikationszentrum 20
XNS: How much efficiency to we want to sacrifice? – Improvements versus Efficiency
-16,19% -14,01%
15,52%
33,21%38,93%
19,79% 20,98% 20,53%15,95%
100,00%93,40%
74,10%
59,35%
49,53%
32,95%
17,46%12,67%
8,57%
-20,00%
0,00%
20,00%
40,00%
60,00%
80,00%
100,00%
1 2 4 6 8 16 32 48 64
# Compute Nodes
Parallelization Efficiency (best effort)
Relative Improvement of Hybrid Version (best efforts)
Brainware als Faktor für energieeffizientes HPC
Christian Terboven | Rechen- und Kommunikationszentrum 21
Summary
Brainware als Faktor für energieeffizientes HPC
Christian Terboven | Rechen- und Kommunikationszentrum 22
Summary
Higher investment in brainware pays-off, if it is only for tuning
A University can save real money by investing in Brainware rather than
Electricity for inefficiently used hardware
HPC Performance Analysts are a rare species
Needs HW knowledge, Tools, programming languages, compiler technologies
+ paradigms, (algorithms), OS effects
It takes some time to hire anyone and get him/her up to speed
Team work: more different brains create more synergy
And now there are GPUs …
They have much higher head room for tuning
Brainware als Faktor für energieeffizientes HPC
Christian Terboven | Rechen- und Kommunikationszentrum 23
The End – and an Invitation …
German Heterogeneous Computing Group (GHCG)
Unabhängige Interessengruppe rund um das Hochleistungsrechnen mit
Beschleunigern im deutschsprachigen Raum
Ziel: Intensivierung des technischen und wissenschaftlichen Austauschs über
Projekte, Hardware und Algorithmen
Nutzergruppen-Treffen
Datum: 1. + 2. Oktober 2012
Ort: Braunschweig (Haus der Wissenschaft)
Anmelden und mitmachen (kostenfrei) !
www.ghc-group.org (Anmeldung & weitere Infos)
Jeder ist herzlich willkommen!
Themen (u.a.)
Neuerungen im Hard-
und Softwarebereich
CFD auf heterogenen
Architekturen
Top Related