Download - Brainware als Faktor für energieeffizientes HPC · Brainware als Faktor für energieeffizientes HPC Christian Terboven | Rechen- und Kommunikationszentrum 2 Motivation Definition

Transcript

Rechen- und Kommunikationszentrum (RZ)

Brainware als Faktor für

energieeffizientes HPC

Christian Bischof, Dieter an Mey, Christian Terboven

[email protected] - HRZ, TU Darmstadt

{anmey, terboven}@rz.rwth-aachen.de - RZ, RWTH Aachen

20.09.2012, ZKI AK SC, Universität Düsseldorf
Brainware als Faktor für energieeffizientes HPC

Christian Terboven | Rechen- und Kommunikationszentrum 2

Motivation

Definition of „Green HPC“ from insidehpc.com:

Design and management techniques that contribute to the responsible,

effective use of energy in the operation of high performance computing

centers and equipment.

But: The current situation hardly allows for the Economical

Optimization of the Total Budget

Different budgets for

Staff

Hardware (mainly through applications – every X years), Maintenance

Power

Building (mainly through applications – once per decade?!)

User’s in general don’t pay for compute resources
Brainware als Faktor für energieeffizientes HPC

Christian Terboven | Rechen- und Kommunikationszentrum 3

Cost of Brainware versus Hardware

Tuning Opportunities

Success Stories

Summary

Agenda
Brainware als Faktor für energieeffizientes HPC

Christian Terboven | Rechen- und Kommunikationszentrum 4

Cost of Brainware versus

Hardware
Brainware als Faktor für energieeffizientes HPC

Christian Terboven | Rechen- und Kommunikationszentrum 5

Understanding the Total Cost of Ownership

Assumptions

2 Mio € HW investment

per year

5 years lifetime with

4 years maintenance

through vendor

Power: 850 KW,

PUE=1.5,

0.14€ per kWh

=> 1.5 Mio € per year

ISV software provided by

users

Commercial batch system

Free Linux distribution

costs per year percentage

Building

( 5Mio / 25y) 200.000 € 3,72%

HPC software 50.000 € 0,93%

ISV software 0 € 0,00%

Batch system 100.000 € 1,86%

Linux 0 € 0,00%

power 1.500.000 € 27,93%

office space 0 € 0,00%

Staff 12 FTE 720.000 € 13,41%

hardware

maintenance 800.000 € 14,90%

investment

compute servers 2.000.000 € 37,24%

sum costs 5.370.000 € 100,00%
Brainware als Faktor für energieeffizientes HPC

Christian Terboven | Rechen- und Kommunikationszentrum 6

Does it pay off to hire more HPC Experts?

Start tuning top user projects first

15 projects account for 50% of the load

64 projects account for 80% of the load

Assumptions

It takes 2 months to tune one project

One analyst can handle 5 projects per year

A projects profits for 2 years

As a consequence one HPC expert

can take care of 10 projects at a time

One FTE costs 60,000€

Tuning can improve the code by 5, 10 or 20 percent

0,00%

10,00%

20,00%

30,00%

40,00%

50,00%

60,00%

70,00%

80,00%

90,00%

100,00%

1

18

35

52

69

86

10

3

12

0

13

7

15

4

17

1

18

8

Accumulated usage of top accounts (excl. JARA-HPC)

Accumulated usage of top accounts
Brainware als Faktor für energieeffizientes HPC

Christian Terboven | Rechen- und Kommunikationszentrum 7

Does it pay off? Yes!

-600.000

-400.000

-200.000

0

200.000

400.000

600.000

#Pro

jects

10

20

30

40

50

60

70

80

90

10

0

11

0

12

0

13

0

14

0

15

0

16

0

Savings with 5% improvement

Savings with 10% improvement

Savings with 20% improvement

ROI [€]

# of tuned projects 10 projects handled by one FTE (60.000€/y)

For example: Break even point: 7.5 HPC Analysts improve top 75 projects by 10% (TCO is 5.3 Mio €/y)
Brainware als Faktor für energieeffizientes HPC

Christian Terboven | Rechen- und Kommunikationszentrum 8

Tuning Opportunities
Brainware als Faktor für energieeffizientes HPC

Christian Terboven | Rechen- und Kommunikationszentrum 9

Opportunities for Tuning wo/ Code Access

Sanity Check

Use HW Counters to Measure Performance

To check for Performance Anomalies

IO behavior

System call statistics

Hardware

Choose the optimal hardware platform

File system, IO parameters

Parameterization

Choose optimal number of threads / MPI processes

Thread / Process Placement (NUMA)

Mapping MPI topology to hardware topology

MPI parameterization (buffers, protocols)

Optimal libraries (MKL …)
Brainware als Faktor für energieeffizientes HPC

Christian Terboven | Rechen- und Kommunikationszentrum 10

Opportunities for Tuning w/ Code Changes

Cache Tuning

padding, blocking, loop based optimization techniques

Inlining/outlining,

Help compiler to perform optimizations …

MPI optimization

avoid global synchronization

Hide / reduce communication overhead, Unblocking communication

Coalesce communications …

OpenMP optimization

Extend parallel regions

Check for false sharing

NUMA optimization: first touch, migration

In vogue: Add OpenMP to an MPI code to improve scalability

Of Course: Choosing the optimal Algorithm is crucial

To be handled by or with the domain expert
Brainware als Faktor für energieeffizientes HPC

Christian Terboven | Rechen- und Kommunikationszentrum 11

Opportunities for Tuning wo/ Code Changes

Development Environment

Choose the optimal compiler

Choose optimal compiler options

Autoparallelization

Compiler Profile / Feedback

Adapt dataset: Partitioning / Blocking – Load Balancing

This list is not intended to be exhaustive, but rather to illustrate that the

skill set of an HPC tuning expert is very different from that of an

application scientist who develops a program, but both skill sets are

needed.

SimLabs

Interdisciplinary Collaboration: HPC, Domain Expert, Numerical Expert, …
Brainware als Faktor für energieeffizientes HPC

Christian Terboven | Rechen- und Kommunikationszentrum 12

Teaching via Courses and Workshop

Selected courses and workshops in Aachen, since 2011:

Date Event

03/2011 Parallel Programming in CES (en), 1 week, 75 part.

05/2011 Visual Studio 2010 + Windows HPC Server workshop, 25 selected part.

10/2011 AIXcelerate Tuning Workshop (en) (with Intel), 25 selected part.

12/2011 Parallel Programming with MATLAB, 35 part.

03/2012 Parallel Programming in CES (en), 1 week, 55 part.

08-09/2012 Parallel Programming Summer Courses (en): MPI, OpenMP, Tools, …

10/2012 Planned: Tuning for bigSMP HPC Workshop (en)

10/2012 Planned: OpenACC Workshop (en)

11/2012 Planned: Technical Cloud Computing with Microsoft Azure (en)
Brainware als Faktor für energieeffizientes HPC

Christian Terboven | Rechen- und Kommunikationszentrum 13

Success Stories
Brainware als Faktor für energieeffizientes HPC

Christian Terboven | Rechen- und Kommunikationszentrum 14

HPC Consulting may save real money: Combustion of “Biofuels”

Primary Breakup for Diesel

Sprays: Hybrid CDPLit

Adding a single OpenMP-

parallelized kernel improves

efficiency by 10% approx.

Turns into cost reduction of

an equivalent of one FTE/yr.

Human effort ~7 weeks

16 s

32 s

64 s

128 s

256 s

512 s

1 2 4 16 32 48 64

8 PPN 1TPP

4 PPN 1 TPP

4 PPN 2 TPP

Runtime for Small Test Data Set

Nodes

Cluster of Excellence „Tailor-Made Fuels from Biomass“, Inst. F. Combustion Technology, RWTH Aachen University
Brainware als Faktor für energieeffizientes HPC

Christian Terboven | Rechen- und Kommunikationszentrum 15

Green IT = Using Resources more efficiently

Code Impact Partner

FLOWer 1.8 x higher efficiency of hybrid Navier Stokes Solver simulating the landing of a space launch vehicle By adding autopar to MPI, carefully assigning threads to processes to adjust load imbalances

RWTH Laboratory of Mechanics

Matlab 150x Speed-up of numerical Solution of the Diffusion Equation by extracting compute intense kernel, transforming it to a Fortran code + careful cache tuning

RWTH Institute of Physical Chemistry

Gene-hunter

14x Speed-up through Cache Optimization plus scalable MPI Parallelization of Linkage Analysis to identify genes which may cause diseases.

Inst. F. Medical Biometry, Computer Science and Epidemiology (IMBIE) Bonn

Dynmatt 33x Speed-up through I/O-Optimization by implementing appropriate Buffering and reducing meta data operations

RWTH Institute of Steel and Light Alloy Building

Code Impact Partner

FIRE ~100x Speed-up of Image Recognition Software on large SMP by nested parallelization which saves a lot of IO

RWTH Chair of Computer Science 6

NestedCP 10 - 50 x Speed-up for Critical Point Extraction in flow simulation output data through nested parallelization with OpenMP even with highly imbalanced work chunks.

Virtual Reality Center Aachen

TFS 20x Speed-up for Simulation of Human Nasal Flow for Computer Aided Surgery through nested parallelization with OpenMP

RWTH Aerodynamic Institute, Parallel Software Products

Higher sophistication of parallelization leads to higher scalability, but does not save resources…
Brainware als Faktor für energieeffizientes HPC

Christian Terboven | Rechen- und Kommunikationszentrum 16

HECToR: Distributed CSE Success Stories

Code Domain Effect Effort Saving

CASTEP Key Materials Science 4x Speed and 4x Scalability 8 PMs 320k - 480k £ (p.a.)

NEMO Oceanography Speed and I/O-Performance

6 PMs 95 k £ (p.a.)

CASINO Quantum Monte-Carlo

4x Performance and 4x Scalability

12 PMs 760 k £ (p.a.)

CP2K Materials Science 12 % Speed and Scalability 12 PMs 1500 k £ (in total)

GLOMAP/ TOMCAT

Atmospheric Chemistry

15 % Performance ?

CITCOM Geodynamic Thermal Convection

30% Performance ? significant

EBL Fluid Turbulence 4x Scalability 12 PMs

ChemShell Catalytic Chemistry 8x Performance 9 PMs

Fluidity-ICOM

Ocean Modelling Scalability ?

DL_POLY_3 Molecular Dynamics 20x Performance 6 PMs

CARP Heart Modelling 20x Performance 8 PMs
Brainware als Faktor für energieeffizientes HPC

Christian Terboven | Rechen- und Kommunikationszentrum 17

HPC Consulting may save real money: Hydro-Dynamics with XNS

XNS (M. Behr, CATS, RWTH)

Simulation of Hydro-Dynamic

forces of the Ohio Dam

Parallelized with MPI

Scales very well for larger case

Additional OpenMP

Parallelization:

9 parallel regions

Human effort: ~ 6 weeks

20-40 % improvement

# Compute Nodes -20,00%

-10,00%

0,00%

10,00%

20,00%

30,00%

40,00%

50,00%

1 2 4 6 8 16 32 48 64

Improvement of execution time in percentage best effort MPI only versus best effort hybrid

Nehalem EP Cluster (2 processor chips, 4 cores each) with InfiniBand-QDR

Higher is better
Brainware als Faktor für energieeffizientes HPC

Christian Terboven | Rechen- und Kommunikationszentrum 18

XNS: How much efficiency do we want to sacrifice? – Execution Time

# Compute Nodes

PPN = processes per node TPP = threads per process

Exe

cuti

on

tim

e [

sec]

20

40

80

160

320

640

1280

1 2 4 6 8 16 32 48 64

PPN1 TPP1

PPN1 TPP2

PPN1 TPP4

PPN1 TPP8

PPN2 TPP1

PPN2 TPP2

PPN2 TPP4

PPN4 TPP1

PPN4 TPP2

PPN8 TPP1

Lower is better
Brainware als Faktor für energieeffizientes HPC

Christian Terboven | Rechen- und Kommunikationszentrum 19

XNS: How much efficiency do we want to sacrifice? - Efficiency

Effi

cien

cy [

%]

PPN = processes per node TPP = threads per process

0,0%

10,0%

20,0%

30,0%

40,0%

50,0%

60,0%

70,0%

80,0%

90,0%

100,0%

1 2 4 6 8 16 32 48 64

PPN1 TPP1

PPN1 TPP2

PPN1 TPP4

PPN1 TPP8

PPN2 TPP1

PPN2 TPP2

PPN2 TPP4

PPN4 TPP1

PPN4 TPP2

PPN8 TPP1

Higher is better
Brainware als Faktor für energieeffizientes HPC

Christian Terboven | Rechen- und Kommunikationszentrum 20

XNS: How much efficiency to we want to sacrifice? – Improvements versus Efficiency

-16,19% -14,01%

15,52%

33,21%38,93%

19,79% 20,98% 20,53%15,95%

100,00%93,40%

74,10%

59,35%

49,53%

32,95%

17,46%12,67%

8,57%

-20,00%

0,00%

20,00%

40,00%

60,00%

80,00%

100,00%

1 2 4 6 8 16 32 48 64

# Compute Nodes

Parallelization Efficiency (best effort)

Relative Improvement of Hybrid Version (best efforts)
Brainware als Faktor für energieeffizientes HPC

Christian Terboven | Rechen- und Kommunikationszentrum 21

Summary
Brainware als Faktor für energieeffizientes HPC

Christian Terboven | Rechen- und Kommunikationszentrum 22

Summary

Higher investment in brainware pays-off, if it is only for tuning

A University can save real money by investing in Brainware rather than

Electricity for inefficiently used hardware

HPC Performance Analysts are a rare species

Needs HW knowledge, Tools, programming languages, compiler technologies

+ paradigms, (algorithms), OS effects

It takes some time to hire anyone and get him/her up to speed

Team work: more different brains create more synergy

And now there are GPUs …

They have much higher head room for tuning
Brainware als Faktor für energieeffizientes HPC

Christian Terboven | Rechen- und Kommunikationszentrum 23

The End – and an Invitation …

German Heterogeneous Computing Group (GHCG)

Unabhängige Interessengruppe rund um das Hochleistungsrechnen mit

Beschleunigern im deutschsprachigen Raum

Ziel: Intensivierung des technischen und wissenschaftlichen Austauschs über

Projekte, Hardware und Algorithmen

Nutzergruppen-Treffen

Datum: 1. + 2. Oktober 2012

Ort: Braunschweig (Haus der Wissenschaft)

Anmelden und mitmachen (kostenfrei) !

www.ghc-group.org (Anmeldung & weitere Infos)

Jeder ist herzlich willkommen!

Themen (u.a.)

Neuerungen im Hard-

und Softwarebereich

CFD auf heterogenen

Architekturen