Performance Characteristics of Large SMP Machines...Performance Characteristics of Large SMP...

16
Rechen- und Kommunikationszentrum (RZ) Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller [email protected]

Transcript of Performance Characteristics of Large SMP Machines...Performance Characteristics of Large SMP...

Rechen- und Kommunikationszentrum (RZ)

Performance Characteristics of

Large SMP Machines

Dirk Schmidl, Dieter an Mey, Matthias S. Müller

[email protected]

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 2

Investigated Hardware

Kernel Benchmark Results

Memory Bandwidth

NUMA Distances

Synchronizations

Applications

NestedCP

TrajSearch

Conclusion

Agenda

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 3

Hardware

HP ProLiant DL980 G7

8 x Intel Xeon X6550 @ 2 GHz

256 GB main memory

internally several boards

SGI Altix Ultraviolet

104 x Intel Xeon E7- 4870 @ 2.4 GHz

about 2 TB main memory

2 Socket Boards connected with NUMALink network

Bull Coherence Switch System

16 x Intel Xeon X7550 @ 2 GHz

256 GB main memory

4-socket boards externally connected with the Bull Coherence Switch (BCS)

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 4

Hardware

ScaleMP System

64 x Intel Xeon X7550 @ 2 GHz

about 4 TB main memory

4-socket boards connected with Infiniband

vSMP foundation software used to create a cache coherent single system

Intel Xeon Phi

1 Intel Xeon Phi coprocessor @ 1.05 GHz

plugged in a PCIe slot

8 GB main memory

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 5

Serial Bandwidth

0

5

10

15

1B

16

B

25

6B

4K

B

64

KB

1M

B

16

MB

25

6M

B

4G

B

Ban

dw

idth

in G

B/s

Write Bandwidth HP

0

5

10

15

1B

16

B

25

6B

4K

B

64

KB

1M

B

16

MB

25

6M

B

4G

B

Ban

dw

idth

in G

B/s

Write Bandwidth AltixUV

local

remote 1st level

remote 2nd level

0

5

10

15

1B

16

B

25

6B

4K

B

64

KB

1M

B

16

MB

25

6M

B

4G

B

Ban

dw

idth

in G

B/s

Write Bandwidth BCS

0

5

10

15

1B

16

B

25

6B

4K

B

64

KB

1M

B

16

MB

25

6M

B

4G

B

Ban

dw

idth

in G

B/s

Write Bandwidth ScaleMP

012345

1B

16

B

25

6B

4K

B

64

KB

1M

B

16

MB

25

6M

B

4G

B

Ban

dw

idth

in G

B/s

Write Bandwidth Phi

Standard SW-Prefetching

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 6

Distance Matrix

Measured bandwidth between sockets

memory and threads placed with numactl

normalized to 10 for socket 0

Socket 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 10 13 13 13 57 57 57 57 59 59 59 59 59 57 57 57

1 13 10 13 13 56 55 56 56 56 56 56 55 55 56 56 55

2 14 13 10 13 58 58 58 58 56 56 56 56 58 58 58 58

3 13 13 13 10 56 55 56 55 56 56 56 55 56 55 56 55

4 56 56 56 56 10 13 13 13 56 56 56 57 58 58 58 58

5 55 55 55 55 13 10 13 13 55 55 55 55 56 56 56 55

6 58 58 58 59 13 13 10 13 58 58 58 58 56 56 56 57

7 56 55 56 55 13 13 13 10 56 56 56 56 56 56 56 56

8 58 58 58 58 56 57 56 56 10 13 13 13 56 56 56 56

9 56 56 55 55 55 55 55 55 13 10 13 13 55 55 56 55

10 56 56 56 56 58 58 58 58 13 13 10 13 58 58 58 58

11 56 56 56 55 56 56 56 55 13 13 13 10 56 56 56 56

12 56 56 56 56 58 58 58 58 56 57 56 56 10 13 13 13

13 55 55 55 55 56 56 55 55 56 55 55 55 13 10 13 13

14 58 58 58 58 56 56 56 56 58 58 58 58 13 13 10 13

15 56 56 56 56 56 56 56 56 56 56 56 56 13 13 13 10

Socket 0 1 2 3 4 5 6 7

0 10 10 17 13 18 18 18 18

1 10 10 17 13 18 18 18 18

2 17 17 10 11 18 18 18 18

3 17 17 10 11 19 19 18 18

4 18 18 18 18 10 10 17 17

5 18 18 18 18 10 10 17 17

6 18 18 18 18 17 17 10 10

7 18 19 18 18 17 17 10 9

• remote accesses much more expensive on the BCS machine

• HP machine internally has also several NUMA levels

BCS

HP

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 7

Parallel Bandwidth

Read and Write Bandwidth on local data

16 MB memory footprint per thread

0

50

100

150

200

250

0

10

20

30

40

50

60

70

80

90

10

0

11

0

12

0

13

0

14

0

15

0

16

0

17

0

18

0

19

0

20

0

21

0

22

0

23

0

24

0

Ban

dw

idth

in G

B/s

Number of Threads

HP-read ALTIX-read BCS-read SCALEMP-read Phi-read

HP-write ALTIX-write BCS-write SCALEMP-write Phi-write

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 8

mem_go_around

Investigate slow-down, when remote accesses ocure

Every thread initializes local memory and measures the bandwidth

In step n thread t uses the memory of thread (t+n)%nthreads

this increases the number of remote accesses in every step

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 9

mem_go_around

1

2

4

8

16

32

64

128

256

512

0

10

20

30

40

50

60

70

80

90

10

0

11

0

12

0

Me

mo

ry B

and

wid

th in

GB

/s

Turn

HP Altix BCS ScaleMP Phi

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 10

Synchronization

Overhead in microseconds to acquire a lock

Synchronization overhead rises with the number of threads

ScaleMP introduces more overhead for large thread counts

#threads BCS SCALEMP PHI ALTIX HP

1 0.06 0.07 0.40 0.05 0.93 8 0.27 0.29 1.89 0.21 0.26

32/30 0.62 0.99 1.77 3.29 0.97 64/60 1.04 24.36 1.94 3.72 1.07

128/120 1.64 35.78 2.01 2.99 240 2.26

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 11

NestedCP: Parallel Critical Point Extraction

Virtual Reality Group of RWTH Aachen University:

Analysis of large-scale flow simulations

Feature extraction from raw data

Interactive analysis in virtual environment (e.g. a cave)

Critical Point: Point in the vector field with zero velocity

Andreas Gerndt, Virtual Reality Center, RWTH Aachen

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 12

NestedCP

parallelization done with OpenMP tasks

many independent tasks only synchronized at the end

0

20

40

60

80

100

0

50

100

150

200

250

300

1 2 4 8

16

32

64

/60

12

8/1

20

24

0

Spe

ed

up

Ru

nti

me

in s

ec.

Number of Threads

BCS SCALEMP PHI ALTIXHP BCS-Speedup SCALEMP-Speedup PHI-Speedup

ALTIX-Speedup HP-Speedup

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 13

TrajSearch

Direct Numerical Simulation of three

dimensional turbulent flow field

produces large output arrays

16384 Procs@BlueGene ~ ½ year of

computation produced 2048³ output

grid (320GB)

Trajectory Analysis (TrajSearch)

implemented with OpenMP was

optimized for large NUMA machines

Here the 1024³ grid cells data was

used (~40 GB) Institute for Combustion Technology

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 14

TrajSearch

Optimizations:

reduced number of locks

NUMA aware data initialization

data blocked to 8x8x8 blocks to load nearest data on ScaleMP

self-written NUMA aware scheduler

0

20

40

60

80

100

120

140

0

5

10

15

20

25

30

35

40

8

16

32

64

12

8

Spe

ed

up

Ru

nti

me

in h

ou

rs

Number of Threads ALTIX BCS SCALEMP ALTIX-Speedup BCS-Speedup SCALEMP-Speedup

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 15

Conclusion

larger systems provide a larger total memory bandwidth

the overhead for a lot of remote accesses is also higher on larger

systems as has been seen with the mem_go_around test

the caching in the vSMP software can hide the remote latency, even

when larger arrays are read or written remotely

synchronization is a problem on all systems that increases with the

number of cores

the Xeon Phi system delivers a good bandwidth and low

synchronization overhead for a large number of threads

applications can run well on large NUMA machines

Remark:

A Revised version with newer performance measurements will soon

be available on our website under publications: https://sharepoint.campus.rwth-aachen.de/units/rz/HPC/public/default.aspx

Performance Characteristics of Large SMP Machines

Dirk Schmidl | Rechen- und Kommunikationszentrum 16

Conclusion

larger systems provide a larger total memory bandwidth

the overhead for a lot of remote accesses is also higher on larger

systems as has been seen with the mem_go_around test

the caching in the vSMP software can hide the remote latency, even

when larger arrays are read or written remotely

synchronization is a problem on all systems that increases with the

number of cores

the Xeon Phi system delivers a good bandwidth and low

synchronization overhead for a large number of threads

applications can run well on large NUMA machines

Remark:

A Revised version with newer performance measurements will soon

be available on our website under publications: https://sharepoint.campus.rwth-aachen.de/units/rz/HPC/public/default.aspx

Thank you for your attention! Questions?