Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A...

60
GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems Pietro Bonfà, Fabio Affinito, Carlo Cavazzoni CINECA, Casalecchio di Reno, Italy

Transcript of Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A...

Page 1: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

GA 676598EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018

Porting Quantum ESPRESSO to GPU Accelerated SystemsPietro Bonfà, Fabio Affinito, Carlo Cavazzoni

CINECA, Casalecchio di Reno, Italy

Page 2: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

https://www.nvidia.com/en-us/data-center/tesla-k80/

Page 3: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

What is QuantumESPRESSO

Porting strategy

Benchmarks

Conclusions

Outlook

Page 4: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

What is QuantumESPRESSO

Page 5: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

What is QuantumESPRESSO● QUANTUM ESPRESSO is an initiative coordinated by the QUANTUM

ESPRESSO Foundation, with the participation of SISSA, CINECA, ICTP, EPFL, with many partners in Europe and Worldwide.

● QUANTUM ESPRESSO is not a single application for quantum simulations; it is rather a distribution of packages performing different tasks and destined to be interoperable.

● Free as in freedom (GPLv2) and open development.

Page 6: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

What is QuantumESPRESSO● Runs from standalone workstation to massively parallel systems.

● Large scientific user base, vehicle for new methods and new algorithms.

○ V6.2.1 → 70400 downloads○ >50 contributors○ 1600+ registered users○ ~ 500k lines, Fortran (& C)

● Simplify transition of new science to HPC systems.

$ ./configure && make all

Posts/month in ML

Page 7: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

What is QuantumESPRESSO

Page 8: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

QE LibrariesSome of the time consuming workloads of many packages are already encapsulated in a number of libraries, namely

LAXLib FFTXlib KS_Solvers

FFTW, MKL, ESSL, ...

Page 9: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Clues from profilingPWscf (CPU version) running on a single KNL node with 64 MPI processes

(best time to solution).

Page 10: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

Porting strategy

Page 11: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Past and present QE GPU portsPorting effort carried out by MaX and supported by NVIDIA.

CUDA C based plugin for QE 5.x (pw.x) developed by F. Spiga and I. Girotto.

2012

2013

2014

2015

2016

2017

2018 Independent CUDA Fortran based port of QE 6.1 (pw.x) developed by F. Spiga and NVIDIA. Provides best performance, most used features implemented.

Page 12: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

QE v5.4: CUDA C Plugin✓✓ Self contained

● BLAS → PHIGEMM● LAPACK→ MAGMA● 3 CUDA C kernels + cuFFT

Page 13: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

QE v5.4: CUDA C Plugin✓✓ Self contained

✓ Good performance

F. Spiga: http://www.tcm.phy.cam.ac.uk/~mdt26/esdg_slides/spiga_may13.pdf

Page 14: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

QE v5.4: CUDA C Plugin✓✓ Self contained

✓ Good performance

✗ Boilerplate code InterfaceKernel

Page 15: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

QE v6.1: CUDA Fortran✓ Single programming language: Fortran + CUDA Fortran

● BLAS → cuBLAS● LAPACK→ Custom GPU Eigensolver (outperforms MAGMA)● CUF Kernel directives and CUDA Fortran kernels

Page 16: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

QE v6.1: CUDA Fortran✓ Single programming language: Fortran + CUDA Fortran

✓✓ Very good performance

For a detailed description of the code and the benchmarks see: http://www.dcs.warwick.ac.uk/pmbs/pmbs17/PMBS17/

Page 17: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

QE v6.1: CUDA Fortran✓ Single programming language: Fortran + CUDA Fortran

✓✓ Very good performance

✗ Diverged from master branch

✗ Only selected features implemented

Page 18: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

New Porting StrategyLanguage: CUDA Fortran, leverage on existing v6.1 code.

Programming model: explicit and directive based.

Plan:

1. Preserve modularity.2. Maintain alignment with master branch. Maintain “hackability”.3. Leave user experience intact.4. General GPU architecture solutions.5. Performance, of course.

Page 19: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

New Porting Strategy

Page 20: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

New Porting Strategy

Page 21: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

New Porting StrategyApplication: pw.x

Accelerated, Working, Unavailable, Broken

GPU version

Total Energy (K points)

Forces Stress Collinear Magnetism

Non-collinear magnetism

Gamma trick

US PP PAW PP DFT+U All other functionalities

v5.4 A W W B (?) U A A ? W (?) W (?)

v6.1 A A A A U W (*) A A (*) U U (*)

v6.3 A W W A A A A A (*) W W

Page 22: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

New Porting Strategy

Libraries Global Variables

Memory Allocation

Page 23: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Libraries● Full API support:

● Unit testing:

● Target best performance: CUDA Fortran, explicit CUDA API (concurrency, hardware specific options).

Page 24: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Libraries - FFTXlib● Many small 3D FFTs (101 → 103)

Page 25: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Libraries - FFTXlib● Many small 3D FFTs (101 → 103)● Overlap of communication and computation

Page 26: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Libraries - FFTXlib● Many small 3D FFTs (101 → 103)● Overlap of communication and computation● Batched work

# bands times

Page 27: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Libraries - FFTXlib● Many small 3D FFTs (101 → 103)● Overlap of communication and computation● Batched work

4 bands 1D FFT

4 bands 1D FFT

Scatter

Scatter

8 ba

nds

Alltoall

4 bands 2D FFTAlltoall

4 bands 2D FFT

Page 28: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Home-brewed managed memory:

1. Prioritize data encapsulation efforts.2. Enforce a simple and effective update scheme for global variables.3. Can provide asynchronous updates (not implemented yet).4. General data duplication scheme.5. Saves performance on old hardware.

Global Variables

USE us, ONLY : nqx, dq, spline_psUSE us_gpum, ONLY : tab_d, tab_d2y_d!implicit none!if (lmaxkb.lt.0) returncall start_clock ('init_us_2')

call using_tab_d(READ) ! <- sync. hereif (spline_ps) call using_tab_d2y_d(READWRITE) <-’

Page 29: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Memory allocation● pw.x allocates many scratch variables. This impacts substantially the

performance of the accelerated version of the subroutines.● At the same time GPU memory is limited.

USE some_module, ONLY : work!implicit none!IF( ALLOCATED( work ) .and. SIZE( work ) < lwork ) DEALLOCATE( work )IF( .not. ALLOCATED( work ) ) ALLOCATE( work( max_lwork ) )[...]

QE GPU v6.1

Page 30: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Memory allocation● pw.x allocates many scratch variables. This impacts substantially the

performance of the accelerated version of the subroutines.● At the same time GPU memory is limited.

USE some_module, ONLY : work!implicit none!IF( ALLOCATED( work ) .and. SIZE( work ) < lwork ) DEALLOCATE( work )IF( .not. ALLOCATED( work ) ) ALLOCATE( work( max_lwork ) )[...]

USE buffer_module,ONLY : gpu_buffer!implicit none!REAL, POINTER :: work(:)gpu_buffer%lock_buffer(work, 10, ierr)[...]gpu_buffer%release_buffer(work, ierr)

QE GPU v6.3

Page 31: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

✓ Self contained ✓ Single programming language: Fortran + CUDA Fortran✓ Aligned with official develop branch❓Performance...

Recap

Libraries

Global Variables

Memory Allocation

Page 32: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

Benchmarks

Page 33: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Benchmark systemsCompute units

Piz Daint XC50 @ CSCS:Model: Xeon E5-2690 v3 (HSW) @ 2.60 GHzCores: 1x12 = 12Accelerators: 1 x P100RAM: 64 GB/node

Galileo @ CINECAModel: Xeon E5-2630 v3 (HSW) @ 2.40 GHzCores: 2x8 = 16Accelerators: 2 x K80RAM: 128 GB/node

Marconi @ CINECAModel: Xeon E5-2697 v4 (BDW) @ 2.30 GHzCores: 2x18 = 36 RAM: 128 GB/node

Q3 20161.3 TFLOPs

Q1 20150.6 + 2x2.9 TFLOPs

Q4 20160.5 + 4.7 TFLOPs

Page 34: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Benchmark systemsCompute units

Piz Daint XC50 @ CSCS:Aries routing and communications ASIC, and Dragonfly network topology.

Galileo @ CINECAInfiniband network, with OFED v1.5.3, capable of a maximum bandwidth of 40Gbit/s between each pair of nodes.

Marconi @ CINECAIntel Omnipath, 100 Gb/s. Fat Tree OPA(2:1 oversubscription tapering at the level of the core switches only)

Q3 20161.3 TFLOPs

Q1 20150.6 + 2x2.7 TFLOPs

Q4 20160.5 + 4.7 TFLOPs

GPU

CPU NIC

GPU

CPU NICCPU

GPU

Page 35: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

● Total time for the iterative solution of the KS equation is compared for the CPU and the GPU versions of pw.x.

● Best time to solution per compute unit(s) is reported.

● Optimal execution parameters for v6.1 and v6.3 may differ.

Benchmark details

Initialization

Iterations for electronic ground state

Forces and Stress

pw.x

Structural optimization

Page 36: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

C70Very small test case, gamma trick.

number of atoms/cell = 280number of atomic types = 1number of electrons = 1120number of Kohn-Sham states = 672kinetic-energy cutoff = 45 Rycharge density cutoff = 450 Ryconvergence threshold = 1.0E-08

Dense grid: 1685364 G-vectors FFT dimensions: ( 225, 128, 240) Smooth grid: 426442 G-vectors FFT dimensions: ( 144, 81, 150)

Iterations to reach convergence: 16

Page 37: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

C70Very small test case, gamma trick.

1. Speedup GPU vs CPU ~ 1.5x2. v6.1 is missing gamma trick

( vs )

Page 38: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

C70Very small test case, gamma trick.

1. Speedup GPU vs CPU ~ 1.5x2. v6.1 is missing gamma trick

( vs )3. CPU version scales better

Page 39: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

C70Very small test case, gamma trick.

1. Speedup GPU vs CPU ~ 1.5x2. v6.1 is missing gamma trick

( vs )3. CPU version scales better4. At saturation GPU still faster

Page 40: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

AuSurfSmall test case, 2 k-points.

Iterations to reach convergence: 21±1

number of atoms/cell = 112number of atomic types = 1number of electrons = 1232number of Kohn-Sham states = 800kinetic-energy cutoff = 25 Rycharge density cutoff = 200 Ryconvergence threshold = 1.0E-06

Dense grid: 2158381 G-vectors FFT dimensions: ( 180, 90, 288)Smooth grid: 763307 G-vectors FFT dimensions: ( 125, 64, 200)

Page 41: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

AuSurfSmall test case, 2 k-points.

1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory

(but vs in this case)

~

Page 42: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

AuSurfSmall test case, 2 k-points.

1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory

(but vs in this case)3. CPU and GPU versions both

scaling well.

Page 43: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

AuSurfSmall test case, 2 k-points.

1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory

(but vs in this case)3. CPU and GPU versions both

scaling well.4. v6.3 on GPUs is significantly

slower than v6.1.

Page 44: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

AuSurfSmall test case, 2 k-points.

1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory

(but vs in this case)3. CPU and GPU versions both

scaling well.4. v6.3 on GPUs is significantly

slower than v6.1.

Page 45: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

AuSurfSmall test case, 2 k-points.

1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory

(but vs in this case)3. CPU and GPU versions both

scaling well.4. v6.3 on GPUs is significantly

slower than v6.1.

Page 46: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

AuSurfSmall test case, 2 k-points.

1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory

(but vs in this case)3. CPU and GPU versions both

scaling well.4. v6.3 on GPUs is significantly

slower than v6.1.

Page 47: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Ta2O5Large test case, 26 k-points.

Iterations to reach convergence: [45, 49, 50, 51, 52]

number of atoms/cell = 96number of atomic types = 2number of electrons = 544number of Kohn-Sham states = 326kinetic-energy cutoff = 130 Rycharge density cutoff = 520 Ryconvergence threshold = 1.0E-08

Dense grid: 3645397 G-vectors FFT dimensions: ( 200, 180, 216)

Page 48: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Ta2O5Large test case, 26 k-points.

1. Speedup GPU vs CPU ≳ 2x2. v6.1 allocates more memory

(but vs in this case)

Page 49: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Ta2O5Large test case, 26 k-points.

1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory

(but vs in this case)3. CPU and GPU versions both

scaling well.

Page 50: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Ta2O5Large test case, 26 k-points.

1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory

(but vs in this case)3. CPU and GPU versions both

scaling well.

Page 51: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Ta2O5Large test case, 26 k-points.

1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory

(but vs in this case)3. CPU and GPU versions both

scaling well.4. v6.3 on GPUs is significantly

slower than v6.1.

Page 52: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Ta2O5Large test case, 26 k-points.

1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory

(but vs in this case)3. CPU and GPU versions both

scaling well.4. v6.3 on GPUs is significantly

slower than v6.1.

Page 53: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Ta2O5Large test case, 26 k-points.

1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory

(but vs in this case)3. CPU and GPU versions both

scaling well.4. v6.3 on GPUs is significantly

slower than v6.1.

Page 54: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Porting statusQE 6.3 GPU is:

✓ aligned with develop branch of community, ✓ passes all 186 tests of the feature testing suite,✓ undergoing integration with the main project,✓ provides good performance, generally better than 2x (far from saturation),✓ ready for alpha release.

Page 55: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Porting statusQE 6.3 GPU is:

✓ aligned with develop branch of community, ✓ passes all 186 tests of the feature testing suite,✓ undergoing integration with the main project,✓ provides good performance, generally better than 2x (far from saturation),✓ ready for alpha release.

Collaboration and support from: J. Romero, M. Marić, M. Fatica, E. Phillips (NVIDIA)F. Spiga (ARM), A. Chandran (FZJ), I. Girotto (ICTP), P. Giannozzi (Univ. Udine), P. Delugas, S. De Gironcoli (SISSA).

Page 56: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Conclusions● Preserved modularity

○ For code maintainability○ For simpler development and debugging

● Preserved all functionalities○ Same user experience○ Various level of acceleration for the

various functionalities

● Preserved (promote?) data encapsulation

Page 57: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

(from www.nvidia.com/en-us/data-center/tesla-k80 )

Page 58: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

(modified from www.nvidia.com/en-us/data-center/tesla-k80 )

Page 59: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Outlook and perspectives● Investigate performance degradation from v6.1 to v6.3

○ How much is coming from missing components?○ Impact of directive based programming model?

● More benchmarking on different HW combinations.

● More code validation, initialization and forces ported to CUDA Fortran.

● Prepare first alpha release.

Page 60: Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018 Porting Quantum ESPRESSO to GPU Accelerated Systems

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

Outlook and perspectives● Investigate performance degradation from v6.1 to v6.3

○ How much is coming from missing components?○ Impact of directive based programming model?

● More benchmarking on different HW combinations.

● More code validation, initialization and forces ported to CUDA Fortran.

● Prepare first alpha release.THANK YOU FOR YOUR ATTENTION!

Credits: icons made by freepik from flaticon