Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A...

GA 676598EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018

Porting Quantum ESPRESSO to GPU Accelerated SystemsPietro Bonfà, Fabio Affinito, Carlo Cavazzoni

CINECA, Casalecchio di Reno, Italy

https://www.nvidia.com/en-us/data-center/tesla-k80/

EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE

What is QuantumESPRESSO

Porting strategy

Benchmarks

Conclusions

Outlook


What is QuantumESPRESSO● QUANTUM ESPRESSO is an initiative coordinated by the QUANTUM

ESPRESSO Foundation, with the participation of SISSA, CINECA, ICTP, EPFL, with many partners in Europe and Worldwide.

● QUANTUM ESPRESSO is not a single application for quantum simulations; it is rather a distribution of packages performing different tasks and destined to be interoperable.

● Free as in freedom (GPLv2) and open development.


What is QuantumESPRESSO● Runs from standalone workstation to massively parallel systems.

● Large scientific user base, vehicle for new methods and new algorithms.

○ V6.2.1 → 70400 downloads○ >50 contributors○ 1600+ registered users○ ~ 500k lines, Fortran (& C)

● Simplify transition of new science to HPC systems.

$ ./configure && make all

Posts/month in ML


QE LibrariesSome of the time consuming workloads of many packages are already encapsulated in a number of libraries, namely

LAXLib FFTXlib KS_Solvers

FFTW, MKL, ESSL, ...


Clues from profilingPWscf (CPU version) running on a single KNL node with 64 MPI processes

(best time to solution).

Porting strategy


Past and present QE GPU portsPorting effort carried out by MaX and supported by NVIDIA.

CUDA C based plugin for QE 5.x (pw.x) developed by F. Spiga and I. Girotto.

2012

2013

2014

2015

2016

2017

2018 Independent CUDA Fortran based port of QE 6.1 (pw.x) developed by F. Spiga and NVIDIA. Provides best performance, most used features implemented.


QE v5.4: CUDA C Plugin✓✓ Self contained

● BLAS → PHIGEMM● LAPACK→ MAGMA● 3 CUDA C kernels + cuFFT



✓ Good performance

F. Spiga: http://www.tcm.phy.cam.ac.uk/~mdt26/esdg_slides/spiga_may13.pdf



✓ Good performance

✗ Boilerplate code InterfaceKernel


QE v6.1: CUDA Fortran✓ Single programming language: Fortran + CUDA Fortran

● BLAS → cuBLAS● LAPACK→ Custom GPU Eigensolver (outperforms MAGMA)● CUF Kernel directives and CUDA Fortran kernels



✓✓ Very good performance

For a detailed description of the code and the benchmarks see: http://www.dcs.warwick.ac.uk/pmbs/pmbs17/PMBS17/



✓✓ Very good performance

✗ Diverged from master branch

✗ Only selected features implemented


New Porting StrategyLanguage: CUDA Fortran, leverage on existing v6.1 code.

Programming model: explicit and directive based.

Plan:

1. Preserve modularity.2. Maintain alignment with master branch. Maintain “hackability”.3. Leave user experience intact.4. General GPU architecture solutions.5. Performance, of course.


New Porting Strategy


New Porting StrategyApplication: pw.x

Accelerated, Working, Unavailable, Broken

GPU version

Total Energy (K points)

Forces Stress Collinear Magnetism

Non-collinear magnetism

Gamma trick

US PP PAW PP DFT+U All other functionalities

v5.4 A W W B (?) U A A ? W (?) W (?)

v6.1 A A A A U W (*) A A (*) U U (*)

v6.3 A W W A A A A A (*) W W


New Porting Strategy

Libraries Global Variables

Memory Allocation


Libraries● Full API support:

● Unit testing:

● Target best performance: CUDA Fortran, explicit CUDA API (concurrency, hardware specific options).


Libraries - FFTXlib● Many small 3D FFTs (101 ￫ 103)


Libraries - FFTXlib● Many small 3D FFTs (101 ￫ 103)● Overlap of communication and computation


Libraries - FFTXlib● Many small 3D FFTs (101 ￫ 103)● Overlap of communication and computation● Batched work

# bands times


Libraries - FFTXlib● Many small 3D FFTs (101 ￫ 103)● Overlap of communication and computation● Batched work

4 bands 1D FFT

4 bands 1D FFT

Scatter

Scatter

8 ba

nds

Alltoall

4 bands 2D FFTAlltoall

4 bands 2D FFT


Home-brewed managed memory:

1. Prioritize data encapsulation efforts.2. Enforce a simple and effective update scheme for global variables.3. Can provide asynchronous updates (not implemented yet).4. General data duplication scheme.5. Saves performance on old hardware.

Global Variables

USE us, ONLY : nqx, dq, spline_psUSE us_gpum, ONLY : tab_d, tab_d2y_d!implicit none!if (lmaxkb.lt.0) returncall start_clock ('init_us_2')

call using_tab_d(READ) ! <- sync. hereif (spline_ps) call using_tab_d2y_d(READWRITE) <-’


Memory allocation● pw.x allocates many scratch variables. This impacts substantially the

performance of the accelerated version of the subroutines.● At the same time GPU memory is limited.

USE some_module, ONLY : work!implicit none!IF( ALLOCATED( work ) .and. SIZE( work ) < lwork ) DEALLOCATE( work )IF( .not. ALLOCATED( work ) ) ALLOCATE( work( max_lwork ) )[...]

QE GPU v6.1


Memory allocation● pw.x allocates many scratch variables. This impacts substantially the

performance of the accelerated version of the subroutines.● At the same time GPU memory is limited.

USE some_module, ONLY : work!implicit none!IF( ALLOCATED( work ) .and. SIZE( work ) < lwork ) DEALLOCATE( work )IF( .not. ALLOCATED( work ) ) ALLOCATE( work( max_lwork ) )[...]

USE buffer_module,ONLY : gpu_buffer!implicit none!REAL, POINTER :: work(:)gpu_buffer%lock_buffer(work, 10, ierr)[...]gpu_buffer%release_buffer(work, ierr)

QE GPU v6.3


✓ Self contained ✓ Single programming language: Fortran + CUDA Fortran✓ Aligned with official develop branch❓Performance...

Recap

Libraries

Global Variables

Memory Allocation

Benchmarks


Benchmark systemsCompute units

Piz Daint XC50 @ CSCS:Model: Xeon E5-2690 v3 (HSW) @ 2.60 GHzCores: 1x12 = 12Accelerators: 1 x P100RAM: 64 GB/node

Galileo @ CINECAModel: Xeon E5-2630 v3 (HSW) @ 2.40 GHzCores: 2x8 = 16Accelerators: 2 x K80RAM: 128 GB/node

Marconi @ CINECAModel: Xeon E5-2697 v4 (BDW) @ 2.30 GHzCores: 2x18 = 36 RAM: 128 GB/node

Q3 20161.3 TFLOPs

Q1 20150.6 + 2x2.9 TFLOPs

Q4 20160.5 + 4.7 TFLOPs


Benchmark systemsCompute units

Piz Daint XC50 @ CSCS:Aries routing and communications ASIC, and Dragonfly network topology.

Galileo @ CINECAInfiniband network, with OFED v1.5.3, capable of a maximum bandwidth of 40Gbit/s between each pair of nodes.

Marconi @ CINECAIntel Omnipath, 100 Gb/s. Fat Tree OPA(2:1 oversubscription tapering at the level of the core switches only)

Q3 20161.3 TFLOPs

Q1 20150.6 + 2x2.7 TFLOPs

Q4 20160.5 + 4.7 TFLOPs

GPU

CPU NIC

GPU

CPU NICCPU

GPU


● Total time for the iterative solution of the KS equation is compared for the CPU and the GPU versions of pw.x.

● Best time to solution per compute unit(s) is reported.

● Optimal execution parameters for v6.1 and v6.3 may differ.

Benchmark details

Initialization

Iterations for electronic ground state

Forces and Stress

pw.x

Structural optimization


C70Very small test case, gamma trick.

number of atoms/cell = 280number of atomic types = 1number of electrons = 1120number of Kohn-Sham states = 672kinetic-energy cutoff = 45 Rycharge density cutoff = 450 Ryconvergence threshold = 1.0E-08

Dense grid: 1685364 G-vectors FFT dimensions: ( 225, 128, 240) Smooth grid: 426442 G-vectors FFT dimensions: ( 144, 81, 150)

Iterations to reach convergence: 16



1. Speedup GPU vs CPU ~ 1.5x2. v6.1 is missing gamma trick

( vs )




( vs )3. CPU version scales better




( vs )3. CPU version scales better4. At saturation GPU still faster


AuSurfSmall test case, 2 k-points.

Iterations to reach convergence: 21±1


Dense grid: 2158381 G-vectors FFT dimensions: ( 180, 90, 288)Smooth grid: 763307 G-vectors FFT dimensions: ( 125, 64, 200)



1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory

(but vs in this case)

~




(but vs in this case)3. CPU and GPU versions both

scaling well.





scaling well.4. v6.3 on GPUs is significantly

slower than v6.1.


Ta2O5Large test case, 26 k-points.

Iterations to reach convergence: [45, 49, 50, 51, 52]


Dense grid: 3645397 G-vectors FFT dimensions: ( 200, 180, 216)



1. Speedup GPU vs CPU ≳ 2x2. v6.1 allocates more memory

(but vs in this case)





scaling well.





scaling well.4. v6.3 on GPUs is significantly

slower than v6.1.


Porting statusQE 6.3 GPU is:

✓ aligned with develop branch of community, ✓ passes all 186 tests of the feature testing suite,✓ undergoing integration with the main project,✓ provides good performance, generally better than 2x (far from saturation),✓ ready for alpha release.


Porting statusQE 6.3 GPU is:

✓ aligned with develop branch of community, ✓ passes all 186 tests of the feature testing suite,✓ undergoing integration with the main project,✓ provides good performance, generally better than 2x (far from saturation),✓ ready for alpha release.

Collaboration and support from: J. Romero, M. Marić, M. Fatica, E. Phillips (NVIDIA)F. Spiga (ARM), A. Chandran (FZJ), I. Girotto (ICTP), P. Giannozzi (Univ. Udine), P. Delugas, S. De Gironcoli (SISSA).


Conclusions● Preserved modularity

○ For code maintainability○ For simpler development and debugging

● Preserved all functionalities○ Same user experience○ Various level of acceleration for the

various functionalities

● Preserved (promote?) data encapsulation

(from www.nvidia.com/en-us/data-center/tesla-k80 )


(modified from www.nvidia.com/en-us/data-center/tesla-k80 )



Outlook and perspectives● Investigate performance degradation from v6.1 to v6.3

○ How much is coming from missing components?○ Impact of directive based programming model?

● More benchmarking on different HW combinations.

● More code validation, initialization and forces ported to CUDA Fortran.

● Prepare first alpha release.


Outlook and perspectives● Investigate performance degradation from v6.1 to v6.3

○ How much is coming from missing components?○ Impact of directive based programming model?

● More benchmarking on different HW combinations.

● More code validation, initialization and forces ported to CUDA Fortran.

● Prepare first alpha release.THANK YOU FOR YOUR ATTENTION!

Credits: icons made by freepik from flaticon

https://www.flaticon.com/authors/freepik

http://www.flaticon.com

Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A...

Documents

Transcript of Porting Quantum ESPRESSO to GPU Accelerated Systems · GA 676598 EUROPEAN CENTER OF EXCELLENCE - A...