Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable...

90
Zellescher Weg 16 Trefftz-Bau (HRSK-Anbau) - HRSK 151 Tel. +49 351 - 463 - 39871 Guido Juckeland ([email protected]) Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) Einführung in die Programmierung von Grafikprozessoren PGI Accelerate, NVIDIA CUDA und OpenCL

Transcript of Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable...

Page 1: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Zellescher Weg 16

Trefftz-Bau (HRSK-Anbau) - HRSK 151

Tel. +49 351 - 463 - 39871

Guido Juckeland ([email protected])

Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)

Einführung in die Programmierung von

Grafikprozessoren

PGI Accelerate, NVIDIA CUDA und OpenCL

Page 2: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 2

Guido Juckeland

Gliederung

1. Motivation, Inhalte und Ziele des Kurses

2. Grundlagen

3. PGI Accelerate – der einfache Einstieg

4. Einführung NVIDIA CUDA

5. CUDA etwas tiefer

6. Einführung OpenCL

7. Wie weiter?

Page 3: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 3

Motivation

Guido Juckeland

• Jeder hat einen Grafikprozessor im Rechner

• 2 TFLOP/s Rechenleistung (einfache Genauigkeit)

• Einfach nutzbare Programmierschnittstellen

• Ähnliche Datentypen im Grafikbereich und im wissenschaftlichen Rechnen

• Nächste Hochleistungsrechner-Beschaffung durch das ZIH

• Möglichkeit die Systemgröße (und auch den Stromverbrauch) kleiner zu halten

Page 4: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 4

Guido Juckeland

Was ist Inhalt des Kurses?

Einführung (!) in die Thematik

Kennenlernen der Umgebungen von PGI, AMD und NVIDIA

Erste eigene Programme testen

Page 5: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 5

Guido Juckeland

Was ist nicht Inhalt des Kurses?

Hardwarenahe Programmierung von Grafikprozessoren

Hybrid-parallele Anwendungen

Page 6: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 6

Guido Juckeland

2. Grundlagen

Page 7: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 7

Grafikkarte

Systemaufbau

Guido Juckeland

CPU

GPU

Haupt-

speicher

Grafik-

speicher

Page 8: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 8

Wichtige Begriffe

• Anwendung: Programm, dass auf dem Hauptprozessor ausgeführt

• Übertragungsbandbreite: Anzahl der Daten, die pro Zeiteinheit zwischen Hauptspeicher und Grafikspeicher übertragen werden können

• Thread: Ein Stück „Arbeit‚, dass von einer Programminstanz bearbeitet wird

• SIMD (Single Instruction Multiple Data): Mit einer Prozessoranweisung mehr als ein Datum berechnen

• FLOP/s: Anzahl der Gleitkommabefehle, die pro Sekunde abgearbeitet werden

• Hauptprozessor: Allzweckprozessor, der jeden Algorithmus, der sich mit einer Turing-Maschine beschreiben lässt, ausführen kann

• Grafikprozessor: Spezialrecheneinheit, um 2- bzw. 3-dimensionale Bilder für die Bildausgabe berechnet

Guido Juckeland

Page 9: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 9

Warum nicht schon früher?

• Man muss jetzt sowieso Multi-threading programmieren, um alle CPU Kerne zu nutzen – warum dann nicht gleich 100 GPU-Kerne nutzen?

• Softwareinfrastruktur vorhanden (gute Entwicklungsumgebung mit C-ähnlichen Sprachen)

• Frei programmierbare GPU Recheneinheiten

• Unterstützung mehrerer Datentypen auf den Grafikprozessoren

• DirectX9 Grafikkarten werden von STREAM/CUDA unterstützt

Guido Juckeland

Page 10: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 10

Vergleich GPU und CPU

CPU

•5% der Fläche für ALU

•Hauptspeicher mit niedriger Zugriffszeit (1/10 GPU)

•Großer Cache (10*GPU)

•Linearer Speicher

GPU

•40% der Fläche für ALU

•Speicher mit hoher Bandbreite (10*CPU)

•Kleiner Cache

•Ausgefeilte Speichertechniken

Guido Juckeland

Page 11: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 11

Guido Juckeland

3. PGI Accelerate – der einfache Einstieg

http://www.pgroup.com/resources/accel.htm

Page 12: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 12

Components

• CUDA Fortran interface

PGI CUDA Fortran

• The OpenMP of CUDA programming

PGI Accelerate

Guido Juckeland

Works only with NVIDIA GPUs!!

Page 13: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 13

How does it work?

/* alloc */ a = (float*)malloc(n*sizeof(float)); r = (float*)malloc(n*sizeof(float));

/* initialize */ for( i = 0; i < n; ++i ) a[i] = (float)(i+1);

/* compute */

for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f;

Guido Juckeland

Page 14: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 14

How does it work (2)?

/* alloc */ a = (float*)malloc(n*sizeof(float)); r = (float*)malloc(n*sizeof(float));

/* initialize */ for( i = 0; i < n; ++i ) a[i] = (float)(i+1);

/* compute */

#pragma acc region

{

for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f;

}

Guido Juckeland

Page 15: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 15

Erste Übung

• Einloggen (SSH) mit Username pXX (Passwort: CUDApXX!) auf joker.zih.tu-dresden.de

• Wenn grafischer Editor gebraucht, dann mit X-Forwarding einloggen und „gedit‚ oder „emacs‚ verwenden

• PGI Umgebung laden: module load software module load pgi

• Copy the examples to your home directory and unpack: cp /tmp/pgi_accelerator_examples.tar . tar xf pgi_accelerator_examples.tar

• Build and run the first example: make c1.exe # (f1.exe for Fortan) ./c1.exe

• Are you really using the GPU? export ACC_NOTIFY=1

• (Achtung, es streiten sich hier alle um EINE Grafikkarte)

Guido Juckeland

Page 16: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 16

Cool – that was easy, but how fast was it?

Example 2

• Computes: for( i = 0; i < n; ++i ){ s = sinf(a[i]); c = cosf(a[i]); r[i] = s*s + c*c; }

• Correctness check: for( i = 0; i < n; ++i ) assert( fabsf(r[i] - e[i]) < 0.000001f );

• Watch out! GPUs not (yet) IEEE -754 compliant!

• Hint: When you have a choice, use trigonometric functions on the GPU

instead of square roots.

Guido Juckeland

Page 17: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 17

Let‘s run it

• Compile and run: make c2.exe ./c2.exe 100000 iterations completed 2064439 microseconds on GPU 1692 microseconds on host

• What’s the deal? Compile again and run again: make c2.time ./c2.time /home/h1/juckel/joker/pgi1/c2.c main 32: region entered 1 time time(us): total=2037112 init=2035604 region=1508 kernels=83 data=806 w/o init: total=1508 max=1508 min=1508 avg=1508 34: kernel launched 1 times grid: [391] block: [256] time(us): total=83 max=83 min=83 avg=83

Guido Juckeland

Page 18: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 18

Oh - one more thing (actually two)

• How to fix the initialization issue: Uncomment acc_init( acc_device_nvidia );

• What if I don‘t know if I (will) have a GPU?

• Build universal binaries make c2.uni

• Run on host only export ACC_DEVICE=host

• Now go and program a matrix-matrix multiply.

Guido Juckeland

Page 19: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 19

Tuning accelerated code (1)

• In general use „time‚ option of pgi compiler!

• Initialization

• Always once per accelerated application

• Can only be avoided by running pgcudainit & (CAUTION!)

• Host-Accelerator Traffic

• Eats up a lot of time for typical kernels

• Mark temporary arrays as private #pragma acc region private(a[0:n-1])

• Mark data that is not need back on CPU as local #pragma acc region local(b[0:n-1])

• Use contigous memory and mark it as such #pragma acc region copy(a[0:n-1][0:m-1])

Guido Juckeland

Page 20: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 20

Tuning accelerated code (2)

• Kernel performance

• Stride 1 accesses !! (Superword = 16 words)

• Scheduling • Specify the number of parallel running threads • Usually done automatically and done pretty good • If not enough parallelism in loop, do it manually #pragma acc region vector(x)

• Use performance tools (e.g. cudaprof)

• Data regions

• Keep data on GPU for multiple loops #pragma acc data region copy(a(1:n,1:m)) local(b(2:n-1,2:m-1)) copyin(w(2:n-1)) { for ... #pragma acc region

Guido Juckeland

Page 21: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 21

Guido Juckeland

4. Einführung NVIDIA CUDA

http://www.nvidia.com/cuda

Page 22: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

What is CUDA?

CUDA is a scalable parallel programming model and a software environment for parallel computing

Minimal extensions to familiar C/C++ environmentHeterogeneous serial-parallel programming model

NVIDIA’s TESLA architecture accelerates CUDAExpose the computational horsepower of NVIDIA GPUsEnable GPU computing

CUDA also maps well to multicore CPUs

Page 23: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Outline

CUDA programming modelBasics of CUDA programming

Software stackData managementExecuting code on the GPU

CUDA librariesBLASFFT

Page 24: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Some Design Goals

Scale to 100’s of cores, 1000’s of parallel threads

Let programmers focus on parallel algorithmsNot on the mechanics of a parallel programming language

Enable heterogeneous systems (i.e. CPU + GPU)CPU and GPU are separate devices with separate DRAMs

Page 25: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

CUDA Kernels and Threads

Parallel portions of an application are executed on the device as kernels

One kernel is executed at a timeMany threads execute each kernel

Differences between CUDA and CPU threads CUDA threads are extremely lightweight

Very little creation overheadInstant switching

CUDA uses 1000s of threads to achieve efficiencyMulti-core CPUs can use only a few

Definitions Device = GPU Host = CPU

Kernel = function that runs on the device

Page 26: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Arrays of Parallel Threads

A CUDA kernel is executed by an array of threadsAll threads run the same codeEach thread has an ID that it uses to compute memory addresses and make control decisions

0 1 2 3 4 5 6 7

…float x = input[threadID];float y = func(x);output[threadID] = y;…

threadID

Page 27: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Thread Cooperation

The Missing Piece: threads may need to cooperate

Thread cooperation is valuableShare results to avoid redundant computationShare memory accesses

Drastic bandwidth reduction

Thread cooperation is a powerful feature of CUDA

Cooperation between a monolithic array of threads is not scalable

Cooperation within smaller batches of threads is scalable

Page 28: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Thread Batching

Kernel launches a grid of thread blocksThreads within a block cooperate via shared memoryThreads within a block can synchronizeThreads in different blocks cannot cooperate

Allows programs to transparently scale to different GPUs

GridThread Block 0

Shared Memory

Thread Block 1

Shared Memory

Thread Block N-1

Shared Memory

Page 29: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Transparent Scalability

Kernel grid

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Device Device

Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Block 0 Block 1

Hardware is free to schedule thread blocks on any processor

A kernel scales across parallel multiprocessors

Page 30: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

8-Series Architecture (G80)

128 thread processors execute kernel threads16 multiprocessors, each contains

8 thread processorsShared memory enables thread cooperation

SharedMemory

SharedMemory

SharedMemory

SharedMemory

SharedMemory

SharedMemory

SharedMemory

SharedMemory

SharedMemory

SharedMemory

SharedMemory

SharedMemory

SharedMemory

SharedMemory

SharedMemory

SharedMemory

Multiprocessor

ThreadProcessors

SharedMemory

Page 31: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

10-Series Architecture

240 thread processors execute kernel threads30 multiprocessors, each contains

8 thread processorsOne double-precision unitShared memory enables thread cooperation

ThreadProcessors

Multiprocessor

SharedMemory

Double

Page 32: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Kernel Memory Access

Per-thread

Per-block

Per-device

ThreadRegisters

Local Memory

SharedMemoryBlock

...Kernel 0

...Kernel 1

GlobalMemoryTi

me

On-chip

Off-chip, uncached

• On-chip, small• Fast

• Off-chip, large• Uncached• Persistent across

kernel launches• Kernel I/O

Page 33: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Physical Memory Layout

“Local” memory resides in device DRAMUse registers and shared memory to minimize local memory use

Host can read and write global memory but not shared memory

Multiprocessor

Host

CPU

ChipsetDRAM

Device

DRAM

Local Memory

GlobalMemory

GPU

Multiprocessor

MultiprocessorRegisters

Shared Memory

Page 34: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Execution ModelSoftware Hardware

Threads are executed by thread processors

Thread

Thread Processor

Thread Block Multiprocessor

Thread blocks are executed on multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources (shared memory and register file)

...

Grid Device

A kernel is launched as a grid of thread blocks

Only one kernel can execute on a device at one time

Page 35: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Key Parallel Abstractions in CUDA

Trillions of lightweight threadsSimple decomposition model

Hierarchy of concurrent threadsSimple execution model

Lightweight synchronization of primitivesSimple synchronization model

Shared memory model for thread cooperationSimple communication model

Page 36: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Outline

CUDA programming modelBasics of CUDA programming

Software stackData managementExecuting code on the GPU

CUDA librariesBLASFFT

Page 37: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

CUDA Installation

CUDA installation consists of DriverCUDA Toolkit (compiler, libraries)CUDA SDK (example codes)

Page 38: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

CUDA Software Development

NVIDIA C Compiler

NVIDIA Assemblyfor Computing (PTX) CPU Host Code

Integrated CPU + GPUC Source Code

CUDA Optimized Libraries:math.h, FFT, BLAS, …

CUDADriver Profiler Standard C Compiler

GPU CPU

Page 39: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Compiling CUDA Code

NVCC

C/C++ CUDAApplication

PTX to TargetCompiler

G80 … GPU

Target code

PTX Code Virtual

Physical

CPU Code

Page 40: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Build Configurations

nvcc <filename>.cu [-o <executable>]Builds release mode

nvcc -g <filename>.cuBuilds debug modeCan debug host code but not device code

nvcc -deviceemu <filename>.cuBuilds device emulation modeAll code runs on CPU, no debug symbols

nvcc -deviceemu -g <filename>.cuBuilds debug device emulation modeAll code runs on CPU, with debug symbols

Page 41: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Outline

CUDA programming modelBasics of CUDA programming

Software stackData managementExecuting code on the GPU

CUDA librariesBLASFFT

Page 42: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Managing Memory

CPU and GPU have separate memory spacesHost (CPU) code manages device (GPU) memory:

Allocate / freeCopy data to and from deviceApplies to global device memory (DRAM)

Multiprocessor

Host

CPU

ChipsetDRAM

Device

DRAM

Local Memory

GlobalMemory

GPU

Multiprocessor

MultiprocessorRegisters

Shared Memory

Page 43: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

GPU Memory Allocation / Release

cudaMalloc(void ** pointer, size_t nbytes)cudaMemset(void * pointer, int value, size_t count)cudaFree(void* pointer)

int n = 1024;int nbytes = 1024*sizeof(int);int *a_d = 0;cudaMalloc( (void**)&a_d, nbytes );cudaMemset( a_d, 0, nbytes);cudaFree(a_d);

Page 44: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Data Copies

cudaMemcpy(void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);

direction specifies locations (host or device) of src and dstBlocks CPU thread: returns after the copy is completeDoesn’t start copying until previous CUDA calls complete

enum cudaMemcpyKindcudaMemcpyHostToDevicecudaMemcpyDeviceToHostcudaMemcpyDeviceToDevice

Page 45: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Data Movement Example int main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);

for (i=0, i<N; i++) a_h[i] = 100.f + i;

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;}

Page 46: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Data Movement Example int main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);

for (i=0, i<N; i++) a_h[i] = 100.f + i;

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;}

Host

a_h

b_h

Page 47: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Data Movement Example int main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);

for (i=0, i<N; i++) a_h[i] = 100.f + i;

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;}

Host Device

a_h

b_h

a_d

b_d

Page 48: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Data Movement Example int main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);

for (i=0, i<N; i++) a_h[i] = 100.f + i;

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;}

Host Device

a_h

b_h

a_d

b_d

Page 49: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Data Movement Example int main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);

for (i=0, i<N; i++) a_h[i] = 100.f + i;

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;}

Host Device

a_h

b_h

a_d

b_d

Page 50: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Data Movement Example int main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);

for (i=0, i<N; i++) a_h[i] = 100.f + i;

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;}

Host Device

a_h

b_h

a_d

b_d

Page 51: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Data Movement Example int main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);

for (i=0, i<N; i++) a_h[i] = 100.f + i;

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;}

Host Device

a_h

b_h

a_d

b_d

Page 52: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Data Movement Example int main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);

for (i=0, i<N; i++) a_h[i] = 100.f + i;

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;}

Host Device

a_h

b_h

a_d

b_d

Page 53: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Data Movement Example int main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);

for (i=0, i<N; i++) a_h[i] = 100.f + i;

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;}

Host Device

Page 54: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 23

Erste Übung

• Einloggen (SSH) mit Username pXX (Passwort: CUDApXX!) auf joker.zih.tu-dresden.de

• Wenn grafischer Editor gebraucht, dann mit X-Forwarding einloggen und „gedit‚ oder „emacs‚ verwenden

• CUDA Umgebung laden: module load software module load cuda

• Erzeugen Sie ein CUDA Programm für das letzte Beispiel

• Übersetzen Sie es mit „nvcc‚ in ein ausführbares Programm

• Führen Sie das Programm mit „./a.out‚ aus

• (Achtung, es streiten sich hier alle um EINE Grafikkarte)

Guido Juckeland

Page 55: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 24

Guido Juckeland

6. CUDA etwas tiefer

Page 56: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Outline

CUDA programming modelBasics of CUDA programming

Software stackData managementExecuting code on the GPU

CUDA librariesFFTBLAS

Page 57: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Executing Code on the GPU

Kernels are C functions with some restrictions

Cannot access host memoryMust have void return typeNo variable number of arguments (“varargs”)Not recursiveNo static variables

Function arguments automatically copied from host to device

Page 58: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Function Qualifiers

Kernels designated by function qualifier:__global__

Function called from host and executed on deviceMust return void

Other CUDA function qualifiers__device__

Function called from device and run on deviceCannot be called from host code

__host__

Function called from host and executed on host (default)__host__ and __device__ qualifiers can be combined to generate both CPU and GPU code

Page 59: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Launching Kernels

Modified C function call syntax:

kernel<<<dim3 dG, dim3 dB>>>(…)

Execution Configuration (“<<< >>>”)dG - dimension and size of grid in blocks

Two-dimensional: x and yBlocks launched in the grid: dG.x * dG.y

dB - dimension and size of blocks in threads: Three-dimensional: x, y, and zThreads per block: dB.x * dB.y * dB.z

Unspecified dim3 fields initialize to 1

Page 60: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Execution Configuration Examples

kernel<<<32,512>>>(...);

dim3 grid, block;grid.x = 2; grid.y = 4;block.x = 8; block.y = 16;

kernel<<<grid, block>>>(...);

dim3 grid(2, 4), block(8,16);

kernel<<<grid, block>>>(...);

Equivalent assignment using constructor functions

Page 61: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

CUDA Built-in Device Variables

All __global__ and __device__ functions have access to these automatically defined variables

dim3 gridDim;Dimensions of the grid in blocks (at most 2D)

dim3 blockDim;Dimensions of the block in threads

dim3 blockIdx;Block index within the grid

dim3 threadIdx;Thread index within the block

Page 62: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Unique Thread IDs

Built-in variables are used to determine unique thread IDs

Map from local thread ID (threadIdx) to a global ID which can be used as array indices

0

0 1 2 3 4

1

0 1 2 3 4

2

0 1 2 3 4

blockIdx.x

blockDim.x = 5

threadIdx.x

blockIdx.x*blockDim.x+ threadIdx.x

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Grid

Page 63: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Minimal Kernels

__global__ void minimal( int* a_d, int value){ *a_d = value;}

__global__ void assign( int* a_d, int value){ int idx = blockDim.x * blockIdx.x + threadIdx.x; a_d[idx] = value;}

Page 64: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Increment Array Example

CPU program CUDA program

void inc_cpu(int *a, int N){ int idx;

for (idx = 0; idx<N; idx++) a[idx] = a[idx] + 1;}

int main(){ ... inc_cpu(a, N);}

__global__ void inc_gpu(int *a, int N){ int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < N) a[idx] = a[idx] + 1;}

int main(){ … dim3 dimBlock (blocksize); dim3 dimGrid( ceil( N / (float)blocksize) ); inc_gpu<<<dimGrid, dimBlock>>>(a, N);}

Page 65: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Host Synchronization

All kernel launches are asynchronouscontrol returns to CPU immediatelykernel executes after all previous CUDA calls have completed

cudaMemcpy() is synchronouscontrol returns to CPU after copy completescopy starts after all previous CUDA calls have completed

cudaThreadSynchronize()blocks until all previous CUDA calls complete

Page 66: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Host Synchronization Example

// copy data from host to devicecudaMemcpy(a_d, a_h, numBytes, cudaMemcpyHostToDevice);

// execute the kernelinc_gpu<<<ceil(N/(float)blocksize), blocksize>>>(a_d, N);

// run independent CPU coderun_cpu_stuff();

// copy data from device back to hostcudaMemcpy(a_h, a_d, numBytes, cudaMemcpyDeviceToHost);

Page 67: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Variable Qualifiers (GPU code)

__device__Stored in global memory (large, high latency, no cache)Allocated with cudaMalloc (__device__ qualifier implied)Accessible by all threadsLifetime: application

__shared__Stored in on-chip shared memory (very low latency)Specified by execution configuration or at compile timeAccessible by all threads in the same thread blockLifetime: thread block

Unqualified variables:Scalars and built-in vector types are stored in registersWhat doesn’t fit in registers spills to “local” memory

Page 68: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Using shared memory

Size known at compile time

__global__ void kernel(…){ … __shared__ float sData[256]; …}

int main(void){ … kernel<<<nBlocks,blockSize>>>(…); …}

Size known at kernel launch

__global__ void kernel(…){ … extern __shared__ float sData[]; …} int main(void){ … smBytes = blockSize*sizeof(float); kernel<<<nBlocks, blockSize,

smBytes>>>(…); …}

Page 69: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

GPU Thread Synchronization

void __syncthreads();Synchronizes all threads in a block

Generates barrier synchronization instructionNo thread can pass this barrier until all threads in the block reach itUsed to avoid RAW / WAR / WAW hazards when accessing shared memory

Allowed in conditional code only if the conditional is uniform across the entire thread block

Page 70: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

GPU Atomic Operations

Associative operationsadd, sub, increment, decrement, min, max, ...and, or, xorexchange, compare, swap

Atomic operations on 32-bit words in global memory

Requires compute capability 1.1 or higher (G84/G86/G92)

Atomic operations on 32-bit words in shared memory and 64-bit words in global memory

Requires compute capability 1.2 or higher

Page 71: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Built-in Vector Types

Can be used in GPU and CPU code

[u]char[1..4], [u]short[1..4], [u]int[1..4], [u]long[1..4], float[1..4], double[1..2]

Structures accessed with x, y, z, w fields: uint4 param; int y = param.y;

dim3Based on uint3Used to specify dimensionsDefault value (1,1,1)

Page 72: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

CUDA Error Reporting to CPU

All CUDA calls return error code:Except for kernel launchescudaError_t type

cudaError_t cudaGetLastError(void)Returns the code for the last error (no error has a code)Can be used to get error from kernel execution

char* cudaGetErrorString(cudaError_t code)Returns a null-terminated character string describing the error

printf(“%s\n”, cudaGetErrorString( cudaGetLastError() ) );

Page 73: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

CUDA Programming Resources

We only covered basic featuresSee Programming Guide for more of the APIAdditional features covered in the Optimization session

CUDA SDK examples

CUDA Zone - http://www.nvidia.com/cudaCUDA UForums

Page 74: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Outline

CUDA programming modelBasics of CUDA programming

Software stackData managementExecuting code on the GPU

CUDA librariesBLASFFT

Page 75: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

CUBLAS

Implementation of BLAS (Basic Linear Algebra Subprograms) on top of CUDA driver

Self contained at the API level, no direct interaction with CUDA driver

Basic model for useCreate matrix and vector objects in GPU memory spaceFill objects with dataCall CUBLAS functionsRetrieve data

CUBLAS library helper functionsCreating and destroying data in GPU spaceWriting data to and retrieving data from objects

Page 76: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Supported Features

Single Precision Double Precision*

Real Complex Real Complex

Level 1 ✔ ✔ ✔

Level 2 ✔ dgemv, dger,dsyr, dtrsv

Level 3 ✔ cgemm ✔ zgemm

*Double-precision functions only supported on GPUs with double-precision hardware

Page 77: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Using CUBLAS

Interface to CUBLAS library is in cublas.hFunction naming convention

cublas + BLAS nameeg. cublasSgemm

Following BLAS convention, CUBLAS uses column-major storage Error handling

CUBLAS core functions do not return an errorCUBLAS provides function to retrieve last error recorded

CUBLAS helper functions do return an errorImplemented using C-based CUDA tool chain

Interfacing to C/C++ applications is trivial

Page 78: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

CUBLAS Helper Functions

cublasInit()Initializes CUBLAS library

cublasShutdown()Releases resources used by CUBLAS library

cublasGetError()Returns last error from CUBLAS core function (+ resets)

cublasAlloc()Wrapper around cudaMalloc() to allocate space for array

cublasFree()destroys object in GPU memory

cublas[Set|Get][Vector|Matrix]()Copies array elements between CPU and GPU memoryAccommodates non-unit strides

Page 79: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

sgemmExample.c#include <stdio.h>#include <stdlib.h>#include "cublas.h"

int main(void){ float *a_h, *b_h, *c_h; float *a_d, *b_d, *c_d; float alpha = 1.0f, beta = 0.0f; int N = 2048, n2 = N*N; int nBytes, i;

nBytes = n2*sizeof(float);

a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); c_h = (float *)malloc(nBytes);

for (i=0; i < n2; i++) {

a_h[i] = rand() / (float) RAND_MAX;

b_h[i] = rand() / (float) RAND_MAX; }

cublasInit();

cublasAlloc(n2, sizeof(float), (void **)&a_d); cublasAlloc(n2, sizeof(float), (void **)&b_d); cublasAlloc(n2, sizeof(float), (void **)&c_d);

cublasSetVector(n2, sizeof(float), a_h, 1, a_d, 1); cublasSetVector(n2, sizeof(float), b_h, 1, b_d, 1);

cublasSgemm('n', 'n', N, N, N, alpha, a_d, N, b_d, N, beta, c_d, N);

cublasGetVector(n2, sizeof(float), c_d, 1, c_h, 1);

free(a_h); free(b_h); free(c_h); cublasFree(a_d); cublasFree(b_d); cublasFree(c_d);

cublasShutdown(); return 0;}

Page 80: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Additional Resources

CUDA SDK examplesimpleCUBLAS

CUBLAS Library documentationin doc folder of CUDA Toolkit or download from CUDA Zone

Page 81: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Outline

CUDA programming modelBasics of CUDA programming

Software stackData managementExecuting code on the GPU

CUDA librariesBLASFFT

Page 82: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

CUFFT

CUFFT is the CUDA FFT library1D, 2D, and 3D transforms of complex and real single-precision dataBatched execution for multiple 1D transforms in parallel1D transforms up to 8 million elements2D and 3D transforms in the range of [2,16384]In-place and out-of-place

Page 83: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

More on CUFFT

For 2D and 3D transforms, CUFFT uses row-major orderCUFFT performs un-normalized transforms

IFFT(FFT(A)) = length(A)*ACUFFT modeled after FFTW

Based on plans used to specify optimal configuration for a particular sized FFTOnce a plan is created it can be reused (to avoid recomputing the optimal configuration)

Page 84: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

CUFFT Types and Definitions

cufftHandleType used to store and access CUFFT plans

cufftResultsEnumeration of API function return values

cufftRealsingle-precision, real datatype

cufftComplexsingle-precision, complex datatype

Real and complex transformsCUFFT_C2C, CUFFT_C2R, CUFFT_R2C

DirectionsCUFFT_FORWARD, CUFFT_INVERSE

Page 85: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

CUFFT Example#include <stdio.h>#include <math.h>#include "cufft.h"

int main(int argc, char *argv[]) { cufftComplex *a_h, *a_d; cufftHandle plan; int N = 1024, batchSize = 10; int i, nBytes; double maxError;

nBytes = sizeof(cufftComplex)*N*batchSize; a_h = (cufftComplex *)malloc(nBytes);

for (i=0; i < N*batchSize; i++) { a_h[i].x = sinf(i); a_h[i].y = cosf(i); }

cudaMalloc((void **)&a_d, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

cufftPlan1d(&plan, N, CUFFT_C2C, batchSize);

cufftExecC2C(plan, a_d, a_d, CUFFT_FORWARD); cufftExecC2C(plan, a_d, a_d, CUFFT_INVERSE);

cudaMemcpy(a_h, a_d, nBytes, cudaMemcpyDeviceToHost);

// check error - normalize for (maxError = 0.0, i=0; i < N*batchSize; i++) { maxError = max(fabs(a_h[i].x/N-sinf(i)), maxError); maxError = max(fabs(a_h[i].y/N-cosf(i)), maxError); }

printf("Max fft error = %g\n", maxError);

cufftDestroy(plan); free(a_h); cudaFree(a_d);

return 0;}

Page 86: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Additional CUFFT Resources

CUDA SDK examplessimpleCUFFTconvolutionFFT2DoceanFFT

CUFFT Library documentationIn doc folder of CUDA Toolkit or download from CUDA Zone

Page 87: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

© 2008 NVIDIA Corporation.

Getting Started with CUDAGreg Ruetsch, Brent Oster

Page 88: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 26

Guido Juckeland

7. Wie weiter?

Page 89: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 27

Bekannte Probleme

2 getrennte (Haupt-) Speicherbereiche für GPU und CPU

Keine Fehlererkennung/-korrektur im GPU-Speichersystem(!!)

Bei vorhandenem Algorithmus dauert eine schnelle Implementierung etwa 3-6 MM

Was, wenn eine GPU nicht reicht??

Guido Juckeland

Page 90: Einführung in die Programmierung von Grafikprozessorenjuckel/slides... · CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal

Folie 28

Fragen & Manöverkritik

Guido Juckeland