Kommunikationsmethoden - pc2.uni-paderborn.de · 1 J. Simon -Architecture of Parallel Computer...

1

J. Simon - Architecture of Parallel Computer Systems SoSe 2019 < 1 >

Kommunikationsmethoden

buffered / nicht buffered• Nachricht wird (beim Sender) zwischengespeichert

synchron / nicht synchron• Programmflusskontrolle wird nicht zurückgegeben, bis die zu sendende

Nachricht angekommen ist.

blockierend / nicht blockierend• Programmflusskontrolle wird nicht zurückgegeben, bis die zu sendende

Nachricht gesichert ist (entweder beim Empfänger oder in einem Systembuffer)

• User-Speicher ist nach Rückkehr wieder frei nutzbar


Send Kommunikationsmodi (1)• Standard Mode Send

– passendes Recv muss nicht vor Send ausgeführt werden.– Größe des Buffers nicht im MPI-Standard definiert.– Falls Buffer genutzt, kann das Send beendet werden bevor das passende

Recv erreicht ist.• Buffered Mode

– Send kann vor dem passenden Recv starten und beenden.– Speicher für Buffer muss explizit allokiert werden MPI_Buffer_attach().

• Synchronous Mode– Send und Recv können beliebig starten, beenden aber gemeinsam.

• Ready Mode– Send kann nur starten, wenn passendes Recv bereits erreicht wurde.

Ansonsten kommt eine Fehlermeldung.

2


Send Kommunikationsmodi (2)

• Alle vier Modi können mit einem blocking oder non-blocking Send kombiniert werden.

• Nur der Standardmodus ist für das blocking und non-blocking Recvmöglich.

• Jeder Send Typ kann mit jedem Recv Typ matchen.


Beispiel: Blockierendes Senden

Blockierendes Senden eines Werts x von MPI-Prozess 0 zu MPI-Prozess 1

int myrank;int msgtag = 4711;int x;

…MPI_Comm_rank (MPI_COMM_WORLD, & myrank); /* get process rank */if (myrank == 0)

MPI_Send (&x, 1, MPI_INT, 1 , msgtag, MPI_COMM_WORLD);else if (myrank == 1) {

int status;int x;MPI_Recv (&x, 1, MPI_INT, 0, msgtag, MPI_COMM_WORLD, status);

}...

3


Non-Blocking Routinen• Non-blocking Send:

– MPI_Isend(buf, count, datatype, dest, tag, comm, request)– Kommt sofort zurück, obwohl die zu versendenden Daten noch nicht

geändert werden dürfen.• Non-blocking Receive:

– MPI_Irecv(buf, count, datatype, dest, tag, comm, request)– Kommt sofort zurück, obwohl ggf. noch keine Daten vorliegen

Beendigung durch MPI_Wait() und MPI_Test() erkennen– MPI_Waits() wartet bis Operation beendet ist– MPI_Test() kommt sofort mit dem Zustand der Send- / Recv-routine

(beendet bzw. nicht-beendet) zurück– Dafür muss der request Parameter verwendet werden


Beispiel: Nichtblockierendes SendenNichtblockierendes Senden eines Werts x von MPI-Proz. 0 zu MPI-Proz. 1, wobei MPI-Prozess 0 mit Ausführung unmittelbar fortfahren kann.

int myrank;int msgtag = 4711;…

MPI_Comm_rank (MPI_COMM_WORLD, & myrank); /* find process rank */if (myrank == 0) {

int status;int x;MPI_Isend (&x, 1, MPI_INT, 1 , msgtag, MPI_COMM_WORLD, req1);compute();MPI_Wait (req1, status);

} else if (myrank == 1) {int status;int x;MPI_Recv (&x, 1, MPI_INT, 0, msgtag, MPI_COMM_WORLD, status);

}...

4


Kollektive Kommunikation

• Kommunikation innerhalb einer Gruppe aus Prozessen• keine Message-Tags nutzbar

• Broadcast- und Scatter-Routinen– MPI_Bcast() - Broadcast from root to all other processes

– MPI_Gather() - Gather values for group of processes

– MPI_Scatter() - Scatters buffer in parts to group of processes

– MPI_Alltoall() - Sends data from all processes to all processes

– MPI_Reduce() - Combine values on all processes to single value

– MPI_Reduce_Scatter() - Combine values and scatter results

– MPI_Scan() - Compute prefix reductions of data on processes


Kollektive KommunikationVorher Nachher

root

root

root

MPI_BCAST

MPI_SCATTER

MPI_GATHER

MPI_ALLGATHER

MPI_ALLTOALL

0 1 2 3 RANK 0 1 2 3

B

BA C DBA C D

B B BB

BA C D

BA C D BA C DBA C D

BA C D BA C DBA C DBA C D BA C D BA C D

BA C D FE G H JI K L NM O PEA I M FB J N GC K O HD L P

BA C D FE G H JI K L NM O P

5


Beispiel: MPI_Gather

Beachte, dass MPI_Gather von allen Prozessen inkl. Root aufgerufen werden muss!

int data[10]; /* data to be gathered from processes */

…MPI_Comm_rank (MPI_COMM_WORLD, & myrank); /* find process rank */if (myrank == 0) {

MPI_Comm_Size(MPI_COMM_WORLD, &grp_size);buf = (int*)malloc(grp_size*10*sizeof(int)); /* allocate memory /*

}MPI_Gather (data, 10, MPI_INT, buf, grp_size*10, MPI_INT, 0, MPI_COM_WORLD);

...


Einseitige Kommunikation• Verfügbar ab MPI Version 2• Remote Memory Access, put + get Operationen• Initialisierungen

– MPI_Alloc_mem(), MPI_Free_mem()– MPI_Win_create(), MPI_Win_free()

• Kommunikationsroutine– MPI_Put()– MPI_Get()– MPI_Accumulate()

• Synchronizationen– MPI_Win_fence()– MPI_Win_post(), MPI_Win_start(), MPI_Win_complete(), MPI_Win_wait()– MPI_Win_lock(), MPI_Win_unlock()

6


Message-Passing-Modell

• Pros:– Programmer controls data and work

distribution– Locality of data usage is patently obvious

• Cons:– quite high overhead in communication of

small messages– correctness of a program is hard to verify

• Example:– Message Passing Interface (MPI)

Network

mem

CPU

mem

CPU

mem

CPU mem

CPU

mem

CPU

mem

CPU

mem

CPU

Adress space

Process


Shared-Memory-Modell

• Pros:– Well known instructions

read / write of remote memory within an arithmetic operation / value assignment

– Lots of programming tools are available• Cons:

– manipulation of shared variable often requires explicit synchronization operations

– Locality of variable usage is often not that clear

• Examples:– OpenMP, POSIX Threads

Core Core Core Core…

Thread

shared address space

shared variable x

7


Hybrid Model: Shared Mem. + Message Passing

• Examples:– POSIX Threads within a node and MPI between nodes, or– OpenMP for intra- and MPI for inter-node communication

Process

Thread

Address space

Network

……

… …


Hybrid Programming: MPI + X

X = OpenMP, Pthreads, specific Libraries

• Pros– reduction in amount of MPI processes

• better performance with high numbers of cores– reduced overhead in buffering/copying of data– fast synchronization in smaller (sub-) groups

• Cons– usage of two different programming paradigms– portability can become a issue

8


OpenCL• Open Computing Language

– parallel execution on single or multiple processors– for heterogeneous platforms of CPUs, GPUs, and other processors– Desktop and handheld profiles– works with graphics APIs such as OpenGL– based on a proposal by Apple Inc.– supported processors: Intel, AMD, Nvidia, and ARM– work in progress: FPGA

• Design Goals– Data- and task-parallel compute model based on C99 language– Implements a relaxed consistency shared memory model with multiple distinct

address spaces

• OpenCL 2.0– Device partitioning, separate compilation and linking, Enhanced image support,

Built-in kernels, DirectX functionality


OpenCL

• Implements a relaxed consistency shared memory model– Multiple distinct address spaces– Address space can be collapsed depending on the device’s memory

subsystem– Address qualifiers

• __private• __local• __constant / __global• Example: __global float4 *p;

• Built-in Data Types– Scalar and vector data types– Pointers– Data-type conversion functions– Image type (2D/3D)

9


OpenACC

• Open Accelerator– “High level gateway to lower level CUDA GPU programming

language”– accelerate C and FORTRAN code– directives identifies parallel regions– Initially developed by PGI, Cray, Nvidia– supported processors: AMD, NVIDIA, Intel?

• Design Goals– No modification or adaption of programs to use accelerators

• OpenACC compiler– First compilers from Cray, PGI, and CAPS– GCC ab Version 5 (offloading to Nvidia targets)


Heterogeneous and Many-Core Programming (1)

10


Heterogeneous and Many-Core Programming (2)


Hybride Programmierung

SIMD

MPI

Pthreads

OpenMP

decomposition

GPU

FPGAMIC

sharedmemory

messagepassingprocesses

threading

acceleratedoperations

…

DomainLibraries

CUDA

OpenCL

OpenACC

11


Partitioned Global Address SPACE

• PGAS parallel programming model• Global memory address space that is logically partitioned

– A portion of the memory is local to each process or thread– Combines advantages of SPMD programming for distributed

memory systems with data referencing semantics of shared memory systems

localaddressspace

globaladdressspace

… … …

…node0

s[ ..] s[..] s[..]s[ ..] s[..] s[..]

node1 nodem

loc[

..]

• Variables and arrayscan be shared or local

• Thread can use references to its local variables and all shared variables

• Thread has fast access toits local variables and itsportion of shared variables

• Higher latencies to remote portions of shared variables


Implementations of PGAS

• PGAS fits well to the communication hierarchy of hybrid architectures– Communication network, shared memory nodes

• Affinity of threads to parts of the global memory effects the efficiency of program execution

• Languages– Unified Parallel C http://upc.lbl.gov– Co-array Fortran http://www.co-array.org/– Titanium http://titanium.cs.berkeley.edu/– Fortress http://labs.oracle.com/projects/plrg/– Chapel http://chapel.cray.com/– X10 http://x10-lang.org/

12


PGAS Example: X10

Storage classes: Activity-local Place-local Partitioned global Immutable

Locally Synchronous:Guaranteed coherence for local heap sequential consistency

Globally Asynchronous:No ordering of inter-place activities use explicit synchronization for coherence

PGAS:Replicated DataLocal HeapRemote Heap

Place = collection of resident activities and objects

Ordering Constraints (Memory Model)

Activity = sequential computation that runs at a place

Locality Rule:Any access to a mutable datum must be performed by a local activity remote data accesses can be performed by creating remote activities

Quelle: David Grove (IBM)


What are users doing?

Source: Indeed.com

13


Grundlagen der Rechnerarchitektur


Von Neumann Architecture

1947: John von Neumann developed a program controlled universal

computer engine

I/O

Address bus

Data bus

Program + Data

RAM

CPU

Arithmetic-Logic

Unit

Program Control

Unit

Control bus

14


The von Neumann Computer

Simple Operation

c = a + b

1. Get first instruction2. Decode: Fetch a3. Fetch a to internal register4. Get next instruction5. Decode: fetch b6. Fetch b to internal register7. Get next instruction8. Decode: add a and b (c in register)9. Do the addition in ALU10.Get next instruction11.Decode: store c in main memory12.Move c from internal register to main memory Execution is strong sequential


Nach wie vor eingesetzte Architektur

Prozessor(CPU)

Hauptspeicher Controller Controller

GerätGerät Gerät

Daten-BusAdress-BusKontroll-Bus

15


Prozessor• Grundelemente eines Prozessors sind

– Rechenwerk: führt die arithmetischen und die logischen Operationen aus

– Steuerwerk: stellt Daten für das Rechenwerk zur Verfügung, d.h. holt die Befehle aus dem Speicher, koordiniert den internen Ablauf

– Register: Speicher mit Informationen über die aktuelle Programmbearbeitung, z.B.

• Rechenregister, Indexregister• Stapelzeiger (stack pointer)• Basisregister (base pointer)• Befehlszähler (program counter, PC)• Unterbrechungsregister,…

• Moderne Prozessoren bestehen aus Prozessorkernen und Caches

• Jeder Prozessorkern hat mehrere Rechenwerke (Funktionseinheiten)

Steuerwerk

Befehlsdekodierungund Ablaufsteuerung

PC, Befehlsregister,Zustandsregister

Rechenwerk

Gleitkommaeinheit

Register R1 - Rn

CPU CPU

L1-$ L1-$

L2-$

…

Prozessorkern

Prozessor

arithmetische/logische Einheitarithmetische/logische Einheit

Gleitkommaeinheit


Hauptspeicher

• Hauptspeicher (Arbeitsspeicher):– flüchtige Speicherung der aktiven Programme

und der dazugehörigen Daten• typische Größe derzeit bei PCs

– 4 GByte bis 64 GByte• Organisation des Speichers

– Folge aus Bytes, auf die einzeln lesend oder schreibend zugegriffen werden kann (Random Access Memory, RAM)

– theoretische Größe des Speichers wird durch die Breite des Adressbusses festgelegt

• Virtualisierung des Speichers– einheitliche logische Adressräume– effizienter Nutzung des physischen Speichers

0123

.

.

.

max

16


Daten-Cache

Die Geschwindigkeit des Hauptspeichers ist zu gering für die Rechenleistung der Prozessoren

Einsatz von Cache-Speichern notwendig

Block-Offset:00000 00010 …

00001 00011 11111

Prozessor CacheHaupt-speicher

Adresse

DatenDaten

Adresse

Adress-Tag*

Datenblock

Cache-LineByte ……

…

Byte Byte Byte Byte10010

1100

11010110

Index:

000

001

010

011

100

101

110

111

1001000000011

Adresse:


Assoziativität eines Caches

Direct-mapped Cache 2-Way Set-Associative Cache

V Tag Datenblock

Tag Index Block-Offset

HIT Datenwort oder -Byte

2k

Lines

/k

/t

=

/ t

/b V Tag Datenblock

Tag Index Block-Offset

HIT Datenwort oder -Byte

/k

/t

=

/ t

/b

V Tag Datenblock

=

17


Cache-Aufbau und Charakteristika

• Cache-Lines (Blockgröße z.B. 64 Bytes)• Caches deutlich kleiner als Hauptspeicher

mapping von HS-Blöcken notwendig– direct mapped jeder Block wird auf festen Bereich abgebildet– fully associate Block kann überall im Cache abgelegt werden– m-way set associative:

• Block kann beliebig in einer Menge von m Cache-Lines abgelegt werden• Replacement: zufällig oder LRU• verallgemeinert die beiden anderen Prinzipien

• Weitere Charakteristika:– Latenzzeit– Bandbreite– Kapazität des Caches


Cache-Miss-Typen

• Compulsory-Miss:– auch cold-start-miss genannt– erster Zugriff auf Block führt zu einem Cache-Miss

• Capacity-Miss:– Cache nicht groß genug für alle benötigten Blöcke– Cache-Miss ist aufgetreten, obwohl der Block schon einmal im Cache war

und verdrängt wurde• Conflict-Miss:

– Adresskollision– auch collision-miss oder interferences-miss genannt– „Capacity-Miss“ der durch eine zu geringe Assoziativität des Caches oder

zu große Blöcke begründet ist– „Ping-Pong“ Effekt möglich

18


Compulsory-Misses

• Problem:– Noch nicht genutzte Daten sind typischerweise nicht im Cache– Folge: erster Zugriff auf Daten potentiell sehr teuer

• Optimierung:– größere Cache-Lines (räumliche Lokalität nutzen)

• erhöht Latenzzeit bei Cache-Miss• bei gleichbleibender Kapazität, kann dies Anzahl der Conflict-Misses erhöhen

– Prefetching• Datenwort laden bevor es tatsächlich benötigt wird• Überlappung von Berechnungen und Ladeoperationen• nur sinnvoll bei nicht-blockierenden Caches• Varianten

– nächstes Datenwort laden (unit stride access)– wiederkehrende Zugriffsmuster erkennen und laden (stream prefetch)– Zeigerstrukturen laden


Strategien beim Design

• Latenzzeit bei Hit und Miss reduzieren– Speicherzugriffe überlappend ausführen (Pipelining)– Multi-Level-Cache– diverse andere Hardwaretechniken (s. Hennessy „Computer Architecture“

• Cache-Trefferrate erhöhen– Datenzugriffstransformationen– Daten-Layout-Transformationen– Spezielle Hardware-Techniken

• Latenzzeit verdecken– Out-of-Order Execution– Expensives Multithreading

19


Speicherhierarchien

• Memory WallDer Unterschied zwischen Prozessor- und Speichergeschwindigkeit wird immer größer

• Leistungsfähigere Speicherhierarchien notwendig– Größere Speicherkapazitäten

• kleinere Fertigungsstrukturen nutzbar– Höhere Frequenzen

• Steigerung beschränkt durch die „Physik“– Breitere Datenpfade

• beschränkt durch Parallelität in der Anwendung– Mehr Speicherebenen

• Potentiell auf mehreren Ebenen nacheinander folgende Misses


Example: Processor Architecture - IBM POWER

Source: IBM

Power ISA is an instruction set architecture (ISA) developed by the OpenPOWER Foundation.

OpenPOWER Systems

POWER

PowerPC1991 - 2006

2004

Power ISA

1990 - 1993

POWER 2 -6

IBM POWER Systems

IBM, Motorola, Apple Systems

20


Beispiel: Highend Server IBM Power E870

• Modell: IBM Power E870

• 4 bis 8 Power8 Prozessormodule (je 8-10 Kerne)– max. 64 Prozessorkerne 4.0 GHz– max. 80 Prozessorkerne 4.2 GHz

• Shared-memory, NUMA-Architektur– max. 4 TByte DDR3-1600


Beispiel: Highend Server IBM Power E880

• Modell: IBM Power E880

• 4 bis 16 Power8 Prozessormodule (je 8-12 Kerne)– max. 128 Prozessorkerne 4.3 GHz– max. 192 Prozessorkerne 4.0 GHz

• Shared-memory, NUMA-Architektur– max. 4 TByte DDR3-1600

21


IBM Power8• Chip with 12 cores• Core clock rate 2.5 – 5.0 GHz• Massive multithreaded chip with 96 hw-

threads• Each core with

– 64 kB L1 D-cache– 32 kB L1 I-cache– 512 kB SRAM L2-cache

• 96 MB eDRAM shared L3-cache– 8 MB per core

• Up to 128 MB eDRAM off-chip L4-cache• On-chip memory controller

– Ca. 1 TByte memory– ~ 230 GByte/s

• 4.2 billion transistors, 22nm• Available 5/2014 Power8 Chip

Power8 Core


Power8 Caches

• L2: 512 kB 8 way per core• L3: 96 MB (12 x 8MByte 8 way bank)• “NUCA” cache policy (Non-Uniform Cache Architecture)

– Scalable bandwidth and latency– Migrate “Hot” lines to local L2, then local L3 (replicate L2 contained footprint)

• Chip Inteconnect: 150 GB/s x 12 segments per direction = 3.6 TB/s

22


POWER8 Memory Buffer Chip


POWER8 Memory Organization

• Up to 8 high speed channels, each running up to 9.6 GB/s (230 GB/s sustained)• Up to 32 total DDR ports (410 GB/s peak at the DRAM)• Up to 1 Tbyte memory capacity

23


IBM POWER8


IBM Power Supercomputer

Source: IBM

24


Interconnection Network

• HUB/Switch (one per SMP node)– 192 GB/s to host node– 336 GB/s to 7 other nodes in same

drawer– 240 GB/s to 24 nodes in other 3 drawers

in same SuperNode– 320 GB/s to hubs in other SuperNodes

Source: IBM


Speicherhierarchie: Beispiel IBM Power E870 (Power8)

1 8MB per Core x 102 shared by 10 Cores2 shared by 80 Cores

Swap Space on SSD > X * 500 GByte/s < 1 ms

Swap Space on Harddisk >> X * 200 MByte/s ~5 ms

Remote Main Memory3 8192 GByte 230 GByte/s < 1 µs

Lokal Main Memory2 1024 GByte 230 GByte/s < 90 ns

3. Level Cache1 80 MByte 150 GByte/s < 30 ns

2. Level Cache 512 kByte 150 GByte/s 4 ns

1. Level Cache 64 kByte 75 GByte/s 1 ns

Register 256 Byte 120 GByte/s 0.2 ns

CPU Kapazität Bandbreite Latenz

Buffer Cache2 128 Mbyte ? ?

25


Intel Xeon Scalable Processor - Overview

Source: Intel


Intel Xeon SP – Core Microarchitecture

Source: Intel

26


Intel Xeon SP- Cache Hierarchy

Source: Intel


Intel Xeon SP – On-chip Mesh Interconnect

Source: Intel

27


AMD – Zen 2 Architecture

• Technology– 8 dies 7nm and one IO 14nm die

• up to 64 cores per processor– up to AVX2

• 8 memory channels per processor– DDR4-2666– up to 4 TiB

• 128 lanes PCI 4.0

• AMD “Rome”– Designed for servers with 2 processors– Up to 8 TiB DRAM– 64 PCI-lanes used communication between

processors (Infinity Fabric protocol)

Source: AMD


AMD Epyc 7000 Series

• ZEN Microarchitecture– L1 D-cache with 32 kiB, 8 way– L1 I-cache with 64 kiB, 4 way– L2 cache with 512 kiB, 8 way

• CPU Complex– Four cores connected to an L3 cache– L3 cache with 8 MiB, 16 way associative

• Multi chip processors– Four CCX per processor

• Infinity Fabric– 42 GiB/s bi-directional bandwidth per link– Fully connected coherent Infinity Fabric within socket– Dual socket systems with two processors connected

with 4 x 38 GiB/s links

Source: AMD

28


AMD Epyc 7000 Series

• AMD EPYC 7601– 32 Cores, 2.2 GHz (max boost clock 3.2 GHz, all cores max boost 2.7

GHz)– 64 MiB L3-cache– TDP 180 Watt– 1 or 2 sockets

• AMD EPYC 7451– 24 cores, 2.3 GHz (max boost clock 3.2 GHz)– TDP 180 Watt– 1 or 2 sockets


NVIDIA GPU

• Graphics Processing Unit (GPU)– used for accelerator cards– requires a host processor

• 3D „stacked“ memory, up to 16 GiB• PCIe Gen3

– limited bandwidth of 16 GiB/s• NVLink optional

– ~80 GiB/s bandwidth

PCIe switch

mem

Source: NVIDIA

29


NVIDIA Tesla Products

• GPU with several Streaming Processors (SMs)

• thousands of “CUDA” cores• high bandwidth memory• fixed amount of main memory• moderate frequencies


GP100 Streaming Multiprocessor

GP100• 60 SMs (56 enabled)• 64 CUDA-cores per SM• Total 3,584 cores• 16 GiB HBM2• 720 GiB/s bw to HBM2• 4 NVLinks (optional)• half precision floating point

~ 21 TFLOPS

Source: NVIDIA

30


GP100 Streaming Multiprocessor - SM

• 64 CUDA cores– usable in two groups

• 64 KiB SM local SRAM with 24 KiB L1 cache

• 4 MiB L2 cache for data sharing across the GPU

Source: NVIDIA


Accelerators become part of the Processor

• Floating-Point Unit– 1978: Intel 8086 + Intel 8087 Math-Co processor (16 Bit)– 1989: Intel i486 with integrated floating-point units (32 bit)

• Vector Unit– 1993: CM5 with Sparc processor + Vector Unit Accelerators (MBUS)– 1995: Intel Pentium P55C with MMX instructions– 1996: Motorola PowerPC with AltiVec

• Stream Processing– 2006: Workstation + GPU graphic card (PCI)– 2011: Intel HD Graphics 3000 with integrated GPU (OpenCL)

31


Leistungsentwicklung eines Prozessorkerns

• Von 1986 bis 2002 ca. 50% Leistungszuwachs pro Jahr• Derzeit Einzelprozessorleistung langsamer zunehmend• Höherer Leistungszuwachs nur noch über Erhöhung der Anzahl an Prozessor(kerne)

möglich

52% /year

25% /year


Hauptspeichergeschwindigkeit

Quelle: Rambus Inc.

Geschwindigkeit pro Speichermodul.

Leistungsunterschied zwischen CPU und RAM wird weiter wachsen (52% p.a vs 25% p.a)

25% p.a

32


Memory Bandwidth/Latency

Generation Type Peak Bandwidth Latency(1st word)

SDRAM (1990s) PC-100 0.8 GByte/s 20 nsDDR (2000) DDR-200 1.6 GByte/s 20 nsDDR DDR-400 3.2 GByte/s 15 nsDDR2 (2003) DDR2-667 5.3 GByte/s 15 nsDDR2 DDR2-800 6.4 GByte/s 15 nsDDR3 (2007) DDR3-1066 8.5 GByte/s 13 nsDDR3 DDR3-1600 12.8 GByte/s 11.25 nsDDR4 (2014) DDR4-2133 17 GByte/s 13 nsDDR4 (2017) DDR4-2666 21 GByte/s 9 nsDDR4 DDR4-4000 32 GByte/s 9.5 ns


Trends• „Power Wall“

– Energieaufnahme / Kühlung– Lösungen

• geringere Taktfrequenzen• mehr Ausführungseinheiten

• „Memory Wall“– Speicherbandbreite u. Latenz– Lösungen

• bessere Speicherhierarchien u. Anbindung an CPUs• Latency-Hidding

• „ILP Wall“– Beschränkte Parallelität im sequentiellen Instruktionsstrom– Lösungen

• mehr Parallelität in Programmen erkennen (Compiler)• mehr explizite Parallelität in Programmen

(Programmiersprachen)

33


Parallelverarbeitung vs. Parallelitätsebenen

Technik derParallelverarbeitung

Suboperationsebene

Anweisungsebene

Blockebene

Prozessebene

Programmebene

SIMD-TechnikenVektorrechnerprinzip X XFeldrechnerprinzip X X

ProzessorarchitekturBefehlspipelining XSuperskalar XVLIW XÜberlappung von E/A- mit CPU-Operationen XFeinkörniges Datenflußprinzip X

ProzessorkopplungSpeicherkopplung (SMP) X X XSpeicherkopplung (DSM) X X XGrobkörniges Datenflußprinzip X XNachrichtenkopplung X X

RechnerkopplungWorkstation-Cluster X XGrid- und Cloud-Computer X X


Teil 3:Architekturen paralleler

Rechnersysteme

34


Einfache Definition Parallelrechner

„A parallel computer is a collection of processing elements that communicate and cooperate to solve large problems fast.”

George S. Almasi, IBM Thomas J. Watson Research CenterAllan Gottlieb, New York University, 1989


Rechnerarchitektur

Eine Rechnerarchitektur ist bestimmt durch ein Operationsprinzip für die Hardware und die Struktur ihres Aufbaus aus den einzelnen Hardware-Betriebsmitteln

Giloi 1993

OperationsprinzipDas Operationsprinzip definiert das funktionelle Verhalten der Architektur durch Festlegung einer Informationsstruktur und einer Kontrollstruktur.

Hardware-StrukturDie Struktur einer Rechnerarchitektur ist gegeben durch Art und Anzahl der Hardware-Betriebsmittel und deren verbindenden Kommunikationseinrichtungen.

35


... in anderen Worten

Operationsprinzip

• Vorschrift über das Zusammenspiel der Komponenten

Aufbau

• Einzelkomponenten• Struktur der Verknüpfung

der Komponenten

• Grundlegende Strukturbausteine sind– Prozessor (CPU), als aktive Komponente zur Ausführung von

Programmen,– Hauptspeicher (ggf. hierarchisch strukturiert, …),– Übertragungsmedium zur Verbindung der einzelnen

Architekturkomponenten,– Steuereinheiten für Anschluss und Kontrolle von Peripherie-geräten und– Geräte, als Zusatzkomponenten für Ein- und Ausgabe von Daten sowie

Datenspeicherung.


Parallelrechner

• Operationsprinzip:– gleichzeitige Ausführung von Befehlen– sequentielle Verarbeitung in bestimmbaren Bereichen

• Arten des Parallelismus:– Explizit: Die Möglichkeit der Parallelverarbeitung wird a priori

festgelegt. Hierzu sind geeignete Datentypen bzw. Datenstrukturen erforderlich, z.B. Vektoren (lineare Felder) samt Vektoroperationen.

– Implizit: Die Möglichkeit der Parallelverarbeitung ist nicht a priori bekannt. Durch eine Datenabhängigkeitsanalyse werden die parallelen und sequentiellen Teilschritte des Algorithmus zur Laufzeit ermittelt.

Kommunikationsmethoden - pc2.uni-paderborn.de · 1 J. Simon -Architecture of Parallel Computer...

Documents

Transcript of Kommunikationsmethoden - pc2.uni-paderborn.de · 1 J. Simon -Architecture of Parallel Computer...