HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is...

25
HPC for the PT-Code Stefan Rosenberger Supervisor: Univ.-Prof. Dipl.-Ing. Dr. Gundolf Haase Institut f¨ ur Mathematik und wissenschaftliches Rechnen Universit¨ at Graz January 29, 2016 Stefan Rosenberger HPC for the PT-Code January 29, 2016 1 / 25

Transcript of HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is...

Page 1: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

HPC for the PT-Code

Stefan Rosenberger

Supervisor: Univ.-Prof. Dipl.-Ing. Dr. Gundolf HaaseInstitut fur Mathematik und wissenschaftliches Rechnen

Universitat Graz

January 29, 2016

Stefan Rosenberger HPC for the PT-Code January 29, 2016 1 / 25

Page 2: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

The PT - Code

1 The PT - Code

2 OpenMP Parallelization

3 OpenACC Parallelization

Stefan Rosenberger HPC for the PT-Code January 29, 2016 2 / 25

Page 3: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

The PT - Code

Basic Structure

Solve a linear System Au = fWherein A is sparse, positive definite and a 2× 2 block system.

Apply CG - method to solve the system (with an inverse operator C ).

Use AMG - method to apply C efficient.

Stefan Rosenberger HPC for the PT-Code January 29, 2016 3 / 25

Page 4: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

The PT - Code

CG - Structure

We want to minimize the error r .

Start:

r0 = f − Au0

h0 = Cr0

d0 = h0

Calculate: αk =rTk hkdTk z

, and the new values:

uk+1 = uk + αkdk

rk+1 = rk − αkz

hk+1 = Crk+1 (≈ A−1rk+1)

βk =rTk+1hk+1

rTk hk

dk+1 = hk+1 + βkdkStefan Rosenberger HPC for the PT-Code January 29, 2016 4 / 25

Page 5: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

The PT - Code

Basic Structure

Use AMG - method to apply C efficient.

Apply multigrid for hk+1 = Crk+1, for l levels.

Necessary calculations: prolongation, restriction and smoothing(Jacobi method) between the levels l .

Those calculations are parallelizable!

Goal:

This final calculation should be done on a GPU.

Stefan Rosenberger HPC for the PT-Code January 29, 2016 5 / 25

Page 6: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

OpenMP Parallelization

1 The PT - Code

2 OpenMP Parallelization

3 OpenACC Parallelization

Stefan Rosenberger HPC for the PT-Code January 29, 2016 6 / 25

Page 7: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

OpenMP Parallelization

OpenMP - General Concept

OpenMP is parallelization on the CPU.

Stefan Rosenberger HPC for the PT-Code January 29, 2016 7 / 25

Page 8: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

OpenMP Parallelization

OpenMP - Application

We can apply simple pragma directives to create parallel loops.

e.g.:

#pragma omp p a r a l l e l f o r sha r ed ( cnt , . . . ) p r i v a t e ( s , . . . )for ( int i = 0 ; i < s i z e ; i++) {

.

.

.// shared values: usually read values

// private values: usually run index , pointer

.

.

.}

Remark: As mentioned in slide 1, we use a initial matrix A, whereinall entries are 2× 2 matrices. (c.f. function next slide)

Stefan Rosenberger HPC for the PT-Code January 29, 2016 8 / 25

Page 9: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

OpenMP Parallelization

OpenMP - Example Jacobi

void operator ( ) ( const t o o l b o x v e c t o r<S>& f , t o o l b o x v e c t o r<S>& u ) {// define constants , pointers

// ...

#pragma omp p a r a l l e l f o r d e f a u l t ( none ) sha r ed ( cnt , co l , e l e , dsp , uu , v , f ) p r i v a t e ( s , c , t ) s c h edu l e ( gu ided , 2)for ( int i i = 0 ; i i < n s i z e ; i i ++) {

s = f [ i i ] ;const T* pco l = c o l + dsp [ i i ] ;const S* p e l e = e l e + dsp [ i i ] ;

// First part of the matrix

const T c s i z e = cnt [ i i ] ;for ( int j j = 0 ; j j < c s i z e ; j j ++) {

c = pco l [ j j ] ;t = p e l e [ j j ] ;

s −= t * uu [ c ] ;}v [ i i ] = s ;

}const int d s i z e = d i a . s i z e ( ) / 4 ;

#pragma omp p a r a l l e l f o r d e f a u l t ( none ) sha r ed (u , v , d i a ) s c h edu l e ( gu ided , 2)for ( int i i = 0 ; i i < d s i z e ; i i ++) {

u [ 2 * i i ] += omega * ( v [ 2 * i i ] * d i a [4* i i +3] − v [ 2 * i i + 1 ] * d i a [4* i i +1 ] ) ;u [ 2 * i i + 1 ] += omega * (−v [ 2 * i i ] * d i a [4* i i +2] + v [ 2 * i i + 1 ] * d i a [4* i i ] ) ;

}}

Stefan Rosenberger HPC for the PT-Code January 29, 2016 9 / 25

Page 10: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

OpenMP Parallelization

OpenMP - Notes

One can apply the parallelization on all occurring loops (and we did

that ).

OpenMP calculate on the CPU −→ only in the parallelization regions,we have to take care of data access (shared, private).

Compare speed up:

Without parallelization: ≈ 1, 33614 seconds total time.

OpenMP with 8 threads: ≈ 0, 967979 seconds total time.

Without parallelization: ≈ 0.0576582 seconds per loop.

OpenMP with 8 threads: ≈ 0.0445387 seconds total time.

Stefan Rosenberger HPC for the PT-Code January 29, 2016 10 / 25

Page 11: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

OpenACC Parallelization

1 The PT - Code

2 OpenMP Parallelization

3 OpenACC Parallelization

Stefan Rosenberger HPC for the PT-Code January 29, 2016 11 / 25

Page 12: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

OpenACC Parallelization

OpenACC - General Concept

OpenACC is paralellization on the GPU.

Destiguish between code on the host, and code on the device.

Advantage host: One can use classes and corresponding functions(e.g. constructor).

Advantage device: much faster calculation possible.Stefan Rosenberger HPC for the PT-Code January 29, 2016 12 / 25

Page 13: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

OpenACC Parallelization

OpenACC - Application

We can apply again simple pragma directives to create parallel loops.

e.g.:

#pragma acc k e r n e l s l oop p r i v a t e ( s , c , t ) i ndependen tpcopyout ( out [ 0 : n out ] ) , pcopy in ( i n [ 0 : n i n ] )

for ( int i = 0 ; i < s i z e ; i++) {...

// shared values: usually read values

// private values: usually run index , pointer

.

.

.}

Now we have to care about data - copy. We have to provide the dataon the device to execute the code correct.

Stefan Rosenberger HPC for the PT-Code January 29, 2016 13 / 25

Page 14: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

OpenACC Parallelization

OpenACC - Example Jacobi

void operator ( ) ( const t o o l b o x v e c t o r<S>& f , t o o l b o x v e c t o r<S>& u ) {// define constants , pointers

// ...

#pragma acc k e r n e l s l oop p r i v a t e ( s , c , t ) i ndependen t pcopyout ( v [ 0 : n v ] ) , pcopy in ( uu [ 0 : n u ] ,dsp [ 0 : n dsp ] , cnt [ 0 : n cn t ] , f [ 0 : n f ] , c o l [ 0 : n c o l ] , e l e [ 0 : n e l e ] )for ( int i i = 0 ; i i < n s i z e ; i i ++) {

s = f [ i i ] ;const T* pco l = c o l + dsp [ i i ] ;const S* p e l e = e l e + dsp [ i i ] ;

// First part of the matrix

const T c s i z e = cnt [ i i ] ;for ( int j j = 0 ; j j < c s i z e ; j j ++) {

c = pco l [ j j ] ;t = p e l e [ j j ] ;

s −= t * uu [ c ] ;}v [ i i ] = s ;

}

const int d s i z e = d i a . s i z e ( ) / 4 ;

#pragma acc k e r n e l s l oop independen t pcopy ( u [ 0 : n u ] ) , pcopy in ( v [ 0 : n v ] , d i a [ 0 : n d i a ] )for ( int i i = 0 ; i i < d s i z e ; i i ++) {

u [2* i i + 0 ] += omega * ( v [2* i i ] * d i a [4* i i + 3 ] − v [2* i i + 1 ] * d i a [4* i i + 1 ] ) ;u [2* i i + 1 ] += omega * (−v [2* i i ] * d i a [4* i i + 2 ] + v [2* i i + 1 ] * d i a [4* i i + 0 ] ) ;

}}

Stefan Rosenberger HPC for the PT-Code January 29, 2016 14 / 25

Page 15: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

OpenACC Parallelization

OpenACC - Notes

One can apply the parallelization on all occurring loops (and we did

that ).

But we have to take care of the data on the device and the host.

We need data - regions:

#pragma acc data copy i n ( u [ 0 : s i z e ] , . . . ) copyout ( u [ 0 : s i z e ] , . . . ){

.// do stuff with data

.}

The pragma acc data create the data only for the following scope!

For a complex code it is better to create only once a copy of thedata, and delete it only in the end of the code.

We need enter and exit commands.

Stefan Rosenberger HPC for the PT-Code January 29, 2016 15 / 25

Page 16: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

OpenACC Parallelization

OpenACC - Data handling

OpenACC provides commands to create a hard copy of data on thedevice:

#pragma acc e n t e r data copy i n ( data [ 0 : s i z e ] )

This will create a data array with length size on the device and copythe data to the device.

Create data on the device, without knowing of the values:

#pragma acc e n t e r data c r e a t e ( tmp [ 0 : s i z e ] )

This will create a data array with length size on the device, but it willnot copy any data to the device.Useful for e.g. temporary values.

To delete data on the device one can use the pragma:

#pragma acc e x i t data d e l e t e ( tmp [ 0 : s i z e ] )#pragma acc e x i t data copyout ( data [ 0 : s i z e ] )

This will remove the data from the device.

Stefan Rosenberger HPC for the PT-Code January 29, 2016 16 / 25

Page 17: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

OpenACC Parallelization

OpenACC - Data handling during Calculation

If data is on the device, one can use:

#pragma acc e n t e r data pcopy in ( data [ 0 : s i z e ] )

This will create a data array with length size on the device and copythe data to the device if the data is not on the device or the pointeris not created. Otherwise the command will do nothing!

And similar, one can use

#pragma acc e n t e r data p c r e a t e ( tmp [ 0 : s i z e ] )

to create data if the array did not exist on the device.

Stefan Rosenberger HPC for the PT-Code January 29, 2016 17 / 25

Page 18: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

OpenACC Parallelization

OpenACC - toolbox vector

Most data is stored in toolbox vectors.

OpenACC need class informations on the device.

Create copyin-Functions for classes:

void todev (){#pragma acc e n t e r data pcopy in ( t h i s [ 0 : 1 ] , da t a [ 0 : s i z e ] )

}void todev ( ) const{

#pragma acc e n t e r data pcopy in ( t h i s [ 0 : 1 ] , da t a [ 0 : s i z e ] )}

the pointer this copies all class - member informations (not the data)to the device.

One should copy the class pointer for all required classes to the device.

Stefan Rosenberger HPC for the PT-Code January 29, 2016 18 / 25

Page 19: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

OpenACC Parallelization

OpenACC - Compiling Checks

We can use pgc++ -Minfo=acc -acc .... to get compiling informationof OpenACC.

Information for jacobi data copyin:

de f au l t amg : : j a c ob i 2 x 2<int , double> : : todev ( ) :68 , i n c l u d e "toolbox.h"

116 , i n c l u d e "amg_2x2.h"

58 , Gene r a t i ng e n t e r data copy i n ( this [ : 1 ] )

Stefan Rosenberger HPC for the PT-Code January 29, 2016 19 / 25

Page 20: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

OpenACC Parallelization

OpenACC - Compiling Checks

Information for jacobi parallelization of loops:

de f au l t amg : : j a c ob i 2 x 2<int , double> : : operator ( ) ( const t o o l b o x v e c t o r<double> &, t o o l b o x v e c t o r<double> &):68 , i n c l u d e "toolbox.h"

116 , i n c l u d e "amg_2x2.h"

89 , Gene r a t i ng copy i n ( this [ : ] )115 , Loop i s p a r a l l e l i z a b l e

A c c e l e r a t o r k e r n e l g ene r a t edGene r a t i ng Tes l a code

115 , #pragma acc l oop gang /* b l o c k I d x . x */123 , #pragma acc l oop v e c t o r (128) /* t h r e a d I d x . x */127 , Sum r e d u c t i o n gene r a t ed for s

123 , Loop i s p a r a l l e l i z a b l e152 , Gene r a t i ng copy i n ( this [ : ] )167 , Loop i s p a r a l l e l i z a b l e

A c c e l e r a t o r k e r n e l g ene r a t edGene r a t i ng Tes l a code

167 , #pragma acc l oop gang , v e c t o r (128) /* b l o c k I d x . x t h r e a d I d x . x */89 , Gene r a t i ng copyout ( v [ : v−> s i z e ] )Gene r a t i ng copy i n ( uu [ : u−> s i z e ] , dsp [ : dsp−> s i z e ] , cnt [ : cnt−> s i z e ] , f [ : f−> s i z e ] , c o l [ : c o l−> s i z e ] , e l e [ : e l e−> s i z e ] )

NOTE: the output 89, Generating copyout(v[: v-> size]) did notimply, that the system actually copies data from or to the device!

Stefan Rosenberger HPC for the PT-Code January 29, 2016 20 / 25

Page 21: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

OpenACC Parallelization

OpenACC - Compiling Checks

Stefan Rosenberger HPC for the PT-Code January 29, 2016 21 / 25

Page 22: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

OpenACC Parallelization

OpenACC - Run Checks

One can use the command

export PGI\ ACC\ NOTIFY=3

to get run information for real enter data.

Information for jacobi enter data:

up load CUDA data f i l e =/home/ ro s enb s /PT Code/SR PT/ s r c /example amg openACC/example−amg−openacc . cpp f u n c t i o n= ZN11de fau l t amg10 jacob i 2x2 I i dEc lERK14

t oo l b o x v e c t o r I dERS3 l i n e =152 d e v i c e=0 t h r e a d i d=1 v a r i a b l e=v by t e s=352

We get: line, name of the variable and size!

We get this information for all operations on the device!

→ A lot of information!

Now, one can go step by step through the code, and reduce the copyoperation to optimize the calculation.

Stefan Rosenberger HPC for the PT-Code January 29, 2016 22 / 25

Page 23: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

OpenACC Parallelization

OpenACC - Run Checks

Stefan Rosenberger HPC for the PT-Code January 29, 2016 23 / 25

Page 24: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

OpenACC Parallelization

OpenMP - Notes

I applied the copyin data regions (almost everywhere )!

Compare speed up total time:

Without parallelization: ≈ 1, 336 seconds total time.

OpenMP with 8 threads: ≈ 0, 967 seconds total time (×1, 38 faster).

OpenACC on the device: ≈ 0, 559 seconds total time (×2, 40 faster).

Compare speed up per loop:

Without parallelization: ≈ 0.0576 seconds per loop.

OpenMP with 8 threads: ≈ 0.0445 seconds total time (×1, 29 faster).

OpenACC on the device: ≈ 0, 0241 seconds total time (×2, 39 faster).

Stefan Rosenberger HPC for the PT-Code January 29, 2016 24 / 25

Page 25: HPC for the PT-Code · The PT - Code Basic Structure Solve a linear System Au = f Wherein A is sparse, positive de nite and a 2 2 block system. Apply CG - method to solve the system

OpenACC Parallelization

Thank you for your attention!

Stefan Rosenberger HPC for the PT-Code January 29, 2016 25 / 25