OpenCL - The Open Standard for Heterogeneous Parallel Programming
-
Upload
andreas-schreiber -
Category
Technology
-
view
6.470 -
download
1
description
Transcript of OpenCL - The Open Standard for Heterogeneous Parallel Programming
Folie 120090714-1 TechTalk OpenCL SC-VK Basermann
OpenCL – The Open Standard for Heterogeneous Parallel Programming
Dr.-Ing. Achim BasermannDeutsches Zentrum für Luft- und Raumfahrt e.V.
Abteilung Verteilte Systeme und Komponentensoftware, Köln-Porz
Folie 220090714-1 TechTalk OpenCL SC-VK Basermannt
Survey
MotivationTrends in parallel processingRole of OpenCL
Khronos´ OpenCLPlatform modelExecution modelMemory modelProgramming model
Conclusions
Material used: OpenCL survey and specification from http://www.khronos.org/opencl/
Folie 320090714-1 TechTalk OpenCL SC-VK Basermannt
Motivation: Trends in Parallel Processing, HW
Trend #1: Multicore processor chipsMaintain (or even reduce) frequency while replicating cores
Trend #2: Accelerators (GPGPUs, e.g.)Previously, processors would “catch” up with accelerator function in thenext generation
Accelerator design expense not amortized wellNew accelerator designs more likely to maintain performance advantage
And will maintain an enormous power advantage for target workloads
Trend #2b: Heterogeneous multicore in general (IBM Cell, e.g.)Mixes of powerful cores, smaller cores, and accelerators potentially offerthe most efficient nodesThe challenge is harnessing them efficiently
Folie 420090714-1 TechTalk OpenCL SC-VK Basermannt
Motivation: Programming Issues
Many cores per node, and accelerators/heterogeneity
Future performance gains will come via massive parallelism (not clock speed)An unwelcome situation for HPC apps!
Need new programming models to exploit
At the system/cluster level:Message-passing to connect node-level languages, orGlobal addressing to make communication implicit?
Folie 520090714-1 TechTalk OpenCL SC-VK Basermannt
Motivation: Classic Programming Models
Folie 620090714-1 TechTalk OpenCL SC-VK Basermannt
Motivation: OpenCL
New open standard that specifically addresses parallel compute accelerators
Extension to C
Provides data parallel and task parallel models
Facilitates natural transition from the growing number of CUDA (ComputeUnified Device Architecture, NVIDIA) programs
Porting of Cell applications to a standard model
Play wells with MPI
Can interoperate with Fortran and OpenMP
Folie 720090714-1 TechTalk OpenCL SC-VK Basermannt
Motivation: Roles of OpenCL
Folie 820090714-1 TechTalk OpenCL SC-VK Basermannt
Motivation: Before OpenCL
Folie 920090714-1 TechTalk OpenCL SC-VK Basermannt
Motivation: The promise of OpenCL
Folie 1020090714-1 TechTalk OpenCL SC-VK Basermannt
: OpenCL
CPUsMultiple cores driving
performance increases
GPUsIncreasingly general purpose
data-parallel computingImproving numerical precision
Graphics APIs and Shading Languages
Multi-processor programming –
e.g. OpenMP
EmergingIntersection
OpenCLHeterogenous
Computing
OpenCL – Open Computing LanguageOpen, royalty-free standard for portable, parallel programming of heterogeneous
parallel computing CPUs, GPUs, and other processors
OpenCL – Open Computing LanguageOpen, royalty-free standard for portable, parallel programming of heterogeneous
parallel computing CPUs, GPUs, and other processors
Folie 1120090714-1 TechTalk OpenCL SC-VK Basermannt
OpenCL Working GroupDiverse industry participation
Processor vendors, system OEMs, middleware vendors, application developersMany industry-leading experts involved in OpenCL’s design
A healthy diversity of industry perspectivesApple initially proposed and is very active in the working group
Serving as specification editor
Here are some of the other companies in the OpenCL working group
Folie 1220090714-1 TechTalk OpenCL SC-VK Basermannt
OpenCL Timeline
Six months from proposal to released specificationDue to a strong initial proposal and a shared commercial incentive to work quickly
Apple’s Mac OS X Snow Leopard will include OpenCLImproving speed and responsiveness for a wide spectrum of applications
Multiple OpenCL implementations expected in the next 12 monthsOn diverse platforms
Apple works with AMD, Intel,
NVIDIA and others on draft
proposal
Apple proposes OpenCL working
group and contributes draft specification to
Khronos
OpenCL working group develops draft
into cross-vendor
specification
Working Group sends
completed draft to Khronos Board for
Ratification
Khronos publicly releases
OpenCL as royalty-free
specification
Khronos to release
conformance tests to ensure
high-quality implementations
Jun08 Oct08Dec08
May09
Folie 1320090714-1 TechTalk OpenCL SC-VK Basermannt
OpenCL: Part of the Khronos API Ecosystem Silicon Community
Software Community
OpenCLHeterogeneous
Parallel Computing
Embedded 3D
Cross platform desktop 3D
3D Asset Interchange Format
Enhanced Audio
Vector 2D
Surface and synch abstraction
Streaming Media andImage Processing
Mobile OS Abstraction
Integrated Mixed-media Stack
Desktop 3D Ecosystem Parallel computing and
visualization in scientific and consumer applications
Umbrella specifications define coherent acceleration stacks for mobile application portability
Streamlined APIs for mobile and embedded graphics, media and
compute acceleration
OpenCL is at the center of an emerging visual computing ecosystem that includes 3D graphics, video and image processing on desktop, embedded and mobile systems
Folie 1420090714-1 TechTalk OpenCL SC-VK Basermannt
OpenCL: Platform Model
One Host + one or more Compute DevicesEach Compute Device is composed of one or more Compute Units
Each Compute Unit is further divided into one or more Processing Elements
Folie 1520090714-1 TechTalk OpenCL SC-VK Basermannt
OpenCL: Execution ModelOpenCL Program:
KernelsBasic unit of executable code — similar to C functions, CUDA kernels, etc.Data-parallel or task-parallel
Host ProgramCollection of compute kernels and internal functionsAnalogous to a dynamic library
Kernel ExecutionThe host program invokes a kernel over an index space called an NDRange
NDRange, “N-Dimensional Range”, can be a 1D, 2D, or 3D space
A single kernel instance at a point in the index space is called a work-itemWork-items have unique global IDs from the index spaceCUDA: thread IDs
Work-items are further grouped into work-groupsWork-groups have a unique work-group IDWork-items have a unique local ID within a work-groupCUDA: Block IDs
Folie 1620090714-1 TechTalk OpenCL SC-VK Basermannt
Total number of work-items = Gx * Gy
Size of each work-group = Sx * Sy
Global ID can be computed from work-group ID and local ID
OpenCL: Execution Model,example 2D NDRange
Folie 1720090714-1 TechTalk OpenCL SC-VK Basermannt
OpenCL: Execution ModelContexts are used to contain and manage the state of the “world”
Kernels are executed in contexts defined and manipulated by the hostDevicesKernels - OpenCL functionsProgram objects - kernel source and executableMemory objects
Command-queue - coordinates execution of kernelsKernel execution commandsMemory commands: Transfer or map memory object dataSynchronization commands: Constrain the order of commands
Execution order of commandsLaunched in-order Executed in-order or out-of-orderEvents are used to implement appropriate synchronization of execution instances
Folie 1820090714-1 TechTalk OpenCL SC-VK Basermannt
OpenCL: Memory ModelShared memory
Relaxed consistency (similar to CUDA)
Multiple distinct address spaces(which can be collapsed)
Global / Constant Memory Data Cache
Global Memory
Compute Device Memory
Compute DeviceGlobal memory
- Qualifier: __global; Ex.: __global float4 *p;- Global memory in CUDA
Constant memory- Qualifier: __constant- Constant memory in CUDA
Local Memory Local Memory
Local memory (to work group)- Qualifier: __local- Shared memory in CUDA
PE PE PE PE
Private memory (to a work item)- Qualifier: __private; Ex.: __private char *px;- local memory in CUDA
Folie 1920090714-1 TechTalk OpenCL SC-VK Basermannt
A relaxed consistency memory model
Across work-items (CUDA: threads) no consistency
Within a work-item (CUDA: thread) load/store consistency
Consistency of memory shared between commands is enforcedthrough synchronization
OpenCL: Memory Consistency
Folie 2020090714-1 TechTalk OpenCL SC-VK Basermannt
OpenCL: Programming Model
DataData--ParallelParallel Model
Must be implemented by all OpenCL compute devices
Define N-Dimensional computation domainEach independent element of execution in an N-Dimensional domain is called a
work-itemN-Dimensional domain defines total # of work-items that execute in parallel
= global work size
Work-items can be grouped together — work-groupWork-items in group can communicate with each otherCan synchronize execution among work-items in group to coordinate memory access
Execute multiple work-groups in parallelMapping of global work size to work-group can be implicit or explicit
Folie 2120090714-1 TechTalk OpenCL SC-VK Basermannt
OpenCL: Programming Model
TaskTask--ParallelParallel Model
Some compute devices can also execute task-parallel compute kernels
Execute as a single work-item
Users express parallelism byusing vector data types implemented by the device,enqueuing multiple tasks (compute kernels written in OpenCL), and/orenqueing native kernels developed using a programming model
orthogonal to OpenCL (native C / C++ functions, e.g.).
Folie 2220090714-1 TechTalk OpenCL SC-VK Basermannt
Work-items in a single work-group (work-group barrier)Similar to _synchthreads () in CUDA
No mechanism for synchronization between work-groups
Synchronization points between commands in command-queuesSimilar to multiple kernels in CUDA but more generalizedCommand-queue barrierWaiting on an event
OpenCL: Programming Model, Synchronization
Folie 2320090714-1 TechTalk OpenCL SC-VK Basermannt
OpenCL C for Compute Kernels
Derived from ISO C99A few restrictions: recursion, function pointers, functions in C99 standard headers ...Preprocessing directives defined by C99 are supported
Built-in Data TypesScalar and vector data types, PointersData-type conversion functions: convert_type<_sat><_roundingmode> Image types: image2d_t, image3d_t and sampler_t
Built-in Functions — RequiredWork-item functions, math.h, read and write imageRelational, geometric functions, synchronization functions
Built-in Functions — Optionaldouble precision (latest CUDA supports this), atomics to global and local memoryselection of rounding mode, writes to image3d_t surface
Folie 2420090714-1 TechTalk OpenCL SC-VK Basermannt
Function qualifiers“__kernel” qualifier declares a function as a kernelKernels can call other kernel functions
Address space qualifiers__global, __local, __constant, __privatePointer kernel arguments must be declared with an address space qualifier
Work-item functionsQuery work-item identifiersget_work_dim()get_global_id(), get_local_id(), get_group_id()
Image functionsImages must be accessed through built-in functionsReads/writes performed through sampler objects from host or defined in source
Synchronization functionsBarriers - all work-items within a work-group must execute the barrier function before any work-item can continueMemory fences - provides ordering between memory operations
OpenCL C Language Highlights
Folie 2520090714-1 TechTalk OpenCL SC-VK Basermannt
Pointers to functions not allowedPointers to pointers allowed within a kernel, but not as an argumentBit-fields not supportedVariable-length arrays and structures not supportedRecursion not supportedWrites to a pointer of types less than 32-bit not supportedDouble types not supported, but reserved3D Image writes not supported
Some restrictions are addressed through extensions
OpenCL C Language Restrictions
Folie 2620090714-1 TechTalk OpenCL SC-VK Basermannt
Basic OpenCL Program Structure
Folie 2720090714-1 TechTalk OpenCL SC-VK Basermannt
OpenCL: Kernel Code Example
Simple element by element vector addition
For all i,
C(i) = A(i) + B(i)
__kernel void VectorAdd(__global const float* a,__global const float* b,__global float* c)
{int iGID = get_global_id(0);c[iGID] = a[iGID] + b[iGID];
}
Folie 2820090714-1 TechTalk OpenCL SC-VK Basermannt
OpenCL VectorAdd: Contexts and Queues
// create the OpenCL context on a GPU devicecxMainContext = clCreateContextFromType (0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);
// get the list of GPU devices associated with contextclGetContextInfo (cxMainContext, CL_CONTEXT_DEVICES, 0, NULL, &szParmDataBytes);cdDevices = (cl_device_id*)malloc(szParmDataBytes);clGetContextInfo (cxMainContext, CL_CONTEXT_DEVICES, szParmDataBytes, cdDevices, NULL);
// create a command-queuecqCommandQue = clCreateCommandQueue (cxMainContext, cdDevices[0], 0, NULL);
cl_context cxMainContext; // OpenCL contextcl_command_queue cqCommandQue; // OpenCL command queuecl_device_id* cdDevices; // OpenCL device list size_t szParmDataBytes; // byte length of parameter storage
Folie 2920090714-1 TechTalk OpenCL SC-VK Basermannt
OpenCL VectorAdd: Create Memory Objects,Program and Kernel
Create Memory Objects
Create Program and Kernel
// allocate the first source buffer memory object… source data, so read only …// allocate the second source buffer memory object … source data, so read only …// allocate the destination buffer memory object … result data, so write only …
// create the program…// build the program…// create the kernel...// set the kernel Argument values…
Folie 3020090714-1 TechTalk OpenCL SC-VK Basermannt
OpenCL VectorAdd: Launch Kernel
// set work-item dimensionsszGlobalWorkSize[0] = iTestN;szLocalWorkSize[0]= 1;
// execute kernelciErrNum = clEnqueueNDRangeKernel (cqCommandQue, ckKernel, 1, NULL,
szGlobalWorkSize, szLocalWorkSize, 0, NULL, NULL);
// Cleanup: release kernel, program, and memory objects…
cl_command_queue cqCommandQue; // OpenCL command queuecl_kernel ckKernel; // OpenCL kernel “VectorAdd”size_t szGlobalWorkSize[1]; // Global # of work itemssize_t szLocalWorkSize[1]; // # of Work Items in Work Groupint iTestN = 10000; // Length of demo test vectors
Folie 3120090714-1 TechTalk OpenCL SC-VK Basermannt
Summary: OpenCL versus CUDA
Using synch_threadsbetween threads
Synchronization using a work-group barrier(between work-items)
Synchronization
Weak consistencyWeak consistencyMemory consistency
Global/constant/shared/local+ texture
Global/constant/local/private
Memory model
Block/ThreadWork-groups/work-itemsExecution Model
CUDAOpenCL
Folie 3220090714-1 TechTalk OpenCL SC-VK Basermannt
ConclusionsOpenCL meets the trends in HPC
Supports multicore SMPs
Supports accelerators (in particular GPGPUs)
Allows programming of heterogenous processors
Supports vectorization (SIMD operations)
Explicitly supports parallel image processing
Suitable for massive parallelism: OpenCL + MPI
OpenCL is a low-level parallel language and complicated.
If you master it you master (heterogeneous) parallel programming.
It might be the basis for new high-level parallel languages.
Folie 3320090714-1 TechTalk OpenCL SC-VK Basermannt
Questions?
Folie 3420090714-1 TechTalk OpenCL SC-VK Basermannt
Next TechTalk on CUDA: Wednesday, July 22, 2009(15:00-16:00, Raum-KP-2b-06, Funktional)
CUDA - The Compute Unified Device Architecturefrom NVIDIA
Jens Rühmkorf
Folie 3520090714-1 TechTalk OpenCL SC-VK Basermannt
Additional material: OpenCL Demos
NVIDIA: http://www.youtube.com/watch?v=PJ1jydg8mLg
AMD (1): http://www.youtube.com/watch?v=MCaGb40Bz58&feature=related
AMD (2): http://www.youtube.com/watch?v=mcU89Td53Gg