Rechnerarchitektur,(RA) - TU Dortmund

50
fakultät für informatik informatik 12 technische universität dortmund Rechnerarchitektur (RA) Sommersemester 2020 ArchitectureAware Optimizations HardwareSoftware coOptimizations JianJia Chen Informatik 12 Jianjia.chen@tu.. http://ls12www.cs.tudortmund.de/daes/ Tel.: 0231 755 6078

Transcript of Rechnerarchitektur,(RA) - TU Dortmund

Page 1: Rechnerarchitektur,(RA) - TU Dortmund

fakultät für informatikinformatik 12

technische universität dortmund

Rechnerarchitektur (RA)Sommersemester 2020

Architecture-­Aware Optimizations-­ Hardware-­Software co-­Optimizations-­

Jian-­Jia ChenInformatik 12Jian-­jia.chen@tu-­..http://ls12-­www.cs.tu-­dortmund.de/daes/Tel.: 0231 755 6078

Page 2: Rechnerarchitektur,(RA) - TU Dortmund

-­ 2 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Outline

High-­Level Optimization§ Loop transformation § Loop tiling/blocking§ Loop (nest) splitting

Heterogeneous System Architecture (HSA)§ Integrated CPU/GPU platforms§ Recent movement in chip designs

Architecture-­aware software designs§ Energy-­efficiency issues§ Darkroom§ Halide

Multicore revolutions§ Impact on the “safety-­critical” industry sector

Page 3: Rechnerarchitektur,(RA) - TU Dortmund

-­ 3 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Impact of memory allocation on efficiency

Array p[j][k]Row major order (C)

Column major order (FORTRAN)

j=0

j=1

j=2

k=0

k=1

k=2

j=0j=1…

j=0j=1…

j=0j=1…

Page 4: Rechnerarchitektur,(RA) - TU Dortmund

-­ 4 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Best performance of innermost loopcorresponds to rightmost array index

Two loops, assuming row major order (C):for (k=0;; k<=m;; k++) for (j=0;; j<=n;; j++)for (j=0;; j<=n;; j++) ) for (k=0;; k<=m;; k++)p[j][k] = ... p[j][k] = ...

For row major order

j=0

j=1

j=2

Good cache behavior ↑↑ Poor cache behavior

Same behavior for homogeneous memory access, but:

F memory architecture dependent optimization

Page 5: Rechnerarchitektur,(RA) - TU Dortmund

-­ 5 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

F Program transformation “Loop interchange”

F Improved localityExample:…#define iter 400000int a[20][20][20];void computeijk() int i,j,k;

for (i = 0; i < 20; i++) for (j = 0; j < 20; j++)

for (k = 0; k < 20; k++) a[i][j][k] += a[i][j][k];

void computeikj() int i,j,k;for (i = 0; i < 20; i++)

for (j = 0; j < 20; j++) for (k = 0; k < 20; k++)

a[i][k][j] += a[i][k][j] ;…start=time(&start);for(z=0;z<iter;z++)computeijk();

end=time(&end); printf("ijk=%16.9f\n",1.0*difftime(end,start));

Page 6: Rechnerarchitektur,(RA) - TU Dortmund

-­ 6 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Results:strong influence of the memory architecture

Loop structure: i j k

Time [s]

[Till Buchwald, Diploma thesis, Univ. Dortmund, Informatik 12, 12/2004]

Ti C6xx~ 57%

Intel Pentium3.2 %

Sun SPARC35%

Processorreduction to [%]

Dramatic impact of locality

Not always the same impact ..

Page 7: Rechnerarchitektur,(RA) - TU Dortmund

-­ 7 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Transformations“Loop fusion” (merging), “loop fission”

for(j=0;; j<=n;; j++) for (j=0;; j<=n;; j++)p[j]= ... ;; p[j]= ... ;;for (j=0;; j<=n;; j++) , p[j]= p[j] + ...p[j]= p[j] + ...

Loops small enough to Better locality for allow zero overhead access to p.Loops Better chances for

parallel execution.

Which of the two versions is best?Architecture-­aware compiler should select best version.

Page 8: Rechnerarchitektur,(RA) - TU Dortmund

-­ 8 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Example: simple loops

void ss1() int i,j;for (i=0;i<size;i++)for (j=0;j<size;j++)a[i][j]+= 17;

for(i=0;i<size;i++)for (j=0;j<size;j++)b[i][j]-=13;

void ms1() int i,j;for (i=0;i< size;i++)for (j=0;j<size;j++)a[i][j]+=17; for (j=0;j<size;j++)b[i][j]-=13;

void mm1() int i,j;for(i=0;i<size;i++)for(j=0;j<size;j++)a[i][j] += 17;b[i][j] -= 13;

#define size 30#define iter 40000int a[size][size];float b[size][size];

Page 9: Rechnerarchitektur,(RA) - TU Dortmund

-­ 9 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Results: simple loops

Runtime

0

20

40

60

80

100

120

X86 gcc 3.2 -­03 x86 gcc 2.95 -­o3 Sparc gcc 3xo1 Sparc gcc 3x o3

Plattform

%

FMerged loops superior;; except Sparc with –o3

ss1ms1

mm1

(100% ≙ max)

Page 10: Rechnerarchitektur,(RA) - TU Dortmund

-­ 10 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Loop unrolling

for (j=0;; j<=n;; j++) p[j]= ... ;;

for (j=0;; j<=n;; j+=2)p[j]= ... ;; p[j+1]= ...

factor = 2Better locality for access to p.Less branches per execution of the loop. More opportunities for optimizations.Tradeoff between code size and improvement. Extreme case: completely unrolled loop (no branch).

Page 11: Rechnerarchitektur,(RA) - TU Dortmund

-­ 11 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Program transformationLoop tiling/loop blocking: -­ Original version -­

for (i=1;; i<=N;; i++)for(k=1;; k<=N;; k++)

r=X[i,k];; /* to be allocated to a register*/for (j=1;; j<=N;; j++)

Z[i,j] += r* Y[k,j] % Never reusing information in the cache for Y and Z if N is large or cache is small (O(N³) references for Z).

j++

k++

i++ j++k++

i++

j++k++

i++

Page 12: Rechnerarchitektur,(RA) - TU Dortmund

-­ 12 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Loop tiling/loop blocking-­ tiled version -­

for (kk=1;; kk<= N;; kk+=B)for (jj=1;; jj<= N;; jj+=B)for (i=1;; i<= N;; i++)for (k=kk;; k<= min(kk+B-­1,N);; k++)r=X[i][k];; /* to be allocated to a register*/for (j=jj;; j<= min(jj+B-­1, N);; j++)Z[i][j] += r* Y[k][j]

Reuse factor of B for Z, N for Y

O(N³/B) accesses to main memory

k++, j++

jj

kkj++

k++

i++

jj

k++

i++

Same elements for next iteration of i

Compiler should select best option

Monica Lam: The Cache Performance and Optimization of Blocked Algorithms, ASPLOS, 1991

Page 13: Rechnerarchitektur,(RA) - TU Dortmund

-­ 13 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Transformation “Loop nest splitting”

Example: Separation of margin handling

+many if-­statements for margin-­checking

no checking,efficient

only few margin elements to be processed

Page 14: Rechnerarchitektur,(RA) - TU Dortmund

-­ 14 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Outline

High-­Level Optimization§ Loop transformation § Loop tiling/blocking§ Loop (nest) splitting

Heterogeneous System Architecture (HSA)§ Integrated CPU/GPU platforms§ Recent movement in chip designs

Architecture-­aware software designs§ Energy-­efficiency issues§ Darkroom§ Halide

Multicore revolutions§ Impact on the “safety-­critical” industry sector

Page 15: Rechnerarchitektur,(RA) - TU Dortmund

-­ 15 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

What is Heterogeneous Computing?

Use processor cores with various type/computing power to achieve better performance/power efficiency

http://www.browndeertechnology.com/docs/BDT_OpenCL_Tutorial_NBody-­rev3.html

Tasks

CPU

GPU

Page 16: Rechnerarchitektur,(RA) - TU Dortmund

-­ 16 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Advantage of Heterogeneous Computing

CPU is ideal for scalar processing§ Out of order x86 cores with low latency memory access

§ Optimized for sequential and branching algorithms

§ Runs existing applications very well

Serial/Task-­parallel workloads → CPU

GPU is ideal for parallel processing§ GPU shaders optimized for throughput computing

§ Ready for emerging workloads

§ Media processing, simulation, natural UI, etc.

Graphics/Data-­parallel workloads → GPU

Heterogeneous Computing -­> Fusion, Norm Rubin, SAAHPC 2010

Page 17: Rechnerarchitektur,(RA) - TU Dortmund

-­ 17 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

CPU/GPU Integration: CPU’s Advancement Meets GPU’s

Throughput Performance

Programmability

Single-­ThreadEra

Multi-­CoreEra

HeterogeneousSystems Era

GraphicsDriver-­basedprograms

Vertex/Pixel

Shader

System-­LevelProgammable

Unacceptable

Experts Only

Mainstream

High PerformanceTask Parallel Execution

Power-­efficientData Parallel Execution

CPU/GPUIntegration

HeterogeneousComputing

Microprocessor Advancement

GPU Advancement

Page 18: Rechnerarchitektur,(RA) - TU Dortmund

-­ 18 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

User Space

Evolution of Heterogeneous Computing

Dedicated GPU § GPU kernel is launched through the device driver § Separate CPU/GPU address space§ Separate system/GPU memory§ Data copy between CPU/GPU via PCIe

Core1 Core2 CoreN…

System memory(coherent)

CU1 CU2 CUN…

GPU memory(Non-­coherent)

L1/L2 L1/L2L1/L2 L1 L1 L1

LLC L2

CPU GPU

PCIe

Address space managed by OS

Address space managed by driver OpenCL

ApplicationOpenCL

Runtime Library

Kernel Space

GPU Device Driver

= kernel launch process

data preparation computation

Page 19: Rechnerarchitektur,(RA) - TU Dortmund

-­ 19 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Evolution of Heterogeneous Computing

Integrated GPU architecture § GPU kernel is launched through the device driver § Separate CPU/GPU address space§ Separate system/GPU memory§ Data copy between CPU/GPU via memory bus

Core1 Core2 CoreN…

System memory(coherent)

CU1 CU2 CUN…

GPU memory(Non-­coherent)

L1/L2 L1/L2L1/L2 L1 L1 L1

LLC L2

CPU GPU

PCIe

Address space managed by OS

Address space managed by driver

System memory(coherent)

GPU memory(Non-­coherent)

User Space

OpenCLApplicationOpenCL

Runtime Library

Kernel Space

GPU Device Driver

= kernel launch process

data preparation Computation

Page 20: Rechnerarchitektur,(RA) - TU Dortmund

-­ 20 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Evolution of Heterogeneous Computing

Integrated CPU/GPU architecture § GPU kernel is launched through the device driver § Unified CPU/GPU address space (managed by OS)§ Unified system/GPU memory§ No data copy -­ data can be retrieved by pointer passing

Core1 Core2 CoreN… CU1 CU2 CUN…L1/L2 L1/L2L1/L2 L1 L1 L1

LLC L2

CPU GPU

Address space managed by OS

Address space managed by driver

System memory(coherent)

GPU memory(Non-­coherent)

L2

LLC

Coherent system memory

Address space managed by OS

User Space

OpenCLApplicationOpenCL

Runtime Library

Kernel Space

GPU Device Driver

= kernel launch process

data preparation Computation

Page 21: Rechnerarchitektur,(RA) - TU Dortmund

-­ 21 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Utopia World of Heterogeneous Computing

Processors are architected to operate cooperatively § Tasks in an application are executed on different types of core§ Unified coherent memory enables data sharing across all processorsDesigned to enable the applications to run on different processors at different time § Capability to translate from high-­level language to target binary at run-­time

§ User-­level task dispatch§ Decision making module

Core1 Core2 CoreN

Coherent system memory

CU1 CU2 CUN…L1/L2 L1/L2L1/L2 L1 L1 L1

LLC

HW Coherence

Application Task 1 Task 2 Task 3

L2

Task 1 Task 2 Task 3Task 1 Task 2Task 3

Page 22: Rechnerarchitektur,(RA) - TU Dortmund

-­ 22 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

HSA Foundation

Founded in June 2012Developing a new platform for heterogeneous systemswww.hsafoundation.comSpecifications under development in working groups to define the platform

Page 23: Rechnerarchitektur,(RA) - TU Dortmund

-­ 23 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Diverse Partners Driving Future of Heterogeneous Computing

Founders

Promoters

Supporters

Contributors

Needs Updating –Add Toshiba Logo

Page 24: Rechnerarchitektur,(RA) - TU Dortmund

-­ 24 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Application

HSA Intermediate Language (HSAIL)

HSA Run-­time ………

Native machine codes

User-­levelSW

HW

Agent Scheduler

CPUFinalizer

GPUFinalizer

OpenCL Compiler

Task 1

Task 2

Task 3

HSA (Heterogeneous System Architecture) Hardware-­Software Stack

Core 1

Shared virtual address space

Core 2 …… CU 1 CU 2 …… AC 1 AC 2 ……

Task 1 Task 2 Task 3

Page 25: Rechnerarchitektur,(RA) - TU Dortmund

-­ 25 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Intel Haswell

Page 26: Rechnerarchitektur,(RA) - TU Dortmund

-­ 26 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

AMD Kaveri APU

Page 27: Rechnerarchitektur,(RA) - TU Dortmund

-­ 27 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

NVIDIA Tegra K1

Page 28: Rechnerarchitektur,(RA) - TU Dortmund

-­ 28 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Qualcomm Snapdragon

Page 29: Rechnerarchitektur,(RA) - TU Dortmund

-­ 29 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Outline

High-­Level Optimization§ Loop transformation § Loop tiling/blocking§ Loop (nest) splitting

Heterogeneous System Architecture (HSA)§ Integrated CPU/GPU platforms§ Recent movement in chip designs

Architecture-­aware software designs§ Energy-­efficiency issues§ Darkroom§ Halide

Multicore revolutions§ Impact on the “safety-­critical” industry sector

Page 30: Rechnerarchitektur,(RA) - TU Dortmund

-­ 30 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Energy Efficiency of different target

platforms

© Hugo De Man, IMEC, Philips, 2007

Page 31: Rechnerarchitektur,(RA) - TU Dortmund

-­ 31 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Signal Processing ASICs

Page 32: Rechnerarchitektur,(RA) - TU Dortmund

-­ 32 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

How about Memory?

© Horowitz, DAC 2016

Page 33: Rechnerarchitektur,(RA) - TU Dortmund

-­ 33 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Processor Energy with Corrected Cache Sizes

© Horowitz, DAC 2016

Page 34: Rechnerarchitektur,(RA) - TU Dortmund

-­ 34 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Processor Energy Breakdown

© Horowitz, DAC 2016

Page 35: Rechnerarchitektur,(RA) - TU Dortmund

-­ 35 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Data Center Energy Specs

© Malladi, ISCA 2012

Page 36: Rechnerarchitektur,(RA) - TU Dortmund

-­ 36 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

ITRS Power Consumption Projection-­Station Systems -­

© ITRS, 2010

Page 37: Rechnerarchitektur,(RA) - TU Dortmund

-­ 37 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

What Is Going On Here?

Page 38: Rechnerarchitektur,(RA) - TU Dortmund

-­ 38 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Energy Consumption (Approximate, 45nm)

© Horowitz, DAC 2016

Page 39: Rechnerarchitektur,(RA) - TU Dortmund

-­ 39 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

True Story

It’s more about the algorithm than the hardwareThe efficiency cannot be achieved unless the algorithm is right!!

(all) Algorithms

GPU Alg.

Page 40: Rechnerarchitektur,(RA) - TU Dortmund

-­ 40 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Locality, Locality, and Locality!!!

© Hegarty et al. , SIGGraph 2014

Page 41: Rechnerarchitektur,(RA) - TU Dortmund

-­ 41 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Darkroom (Stanford/MIT)

© Hegarty et al. , SIGGraph 2014

Page 42: Rechnerarchitektur,(RA) - TU Dortmund

-­ 42 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Locality versus Parallelism

Halide Programming Language:§ http://halide-­lang.org/

Performance needs a lot of tradeoffs§ Locality§ Parallelism§ Redundant recomputation

Page 43: Rechnerarchitektur,(RA) - TU Dortmund

-­ 43 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Outline

High-­Level Optimization§ Loop transformation § Loop tiling/blocking§ Loop (nest) splitting

Heterogeneous System Architecture (HSA)§ Integrated CPU/GPU platforms§ Recent movement in chip designs

Architecture-­aware software designs§ Energy-­efficiency issues§ Darkroom§ Halide

Multicore revolutions§ Impact on the “safety-­critical” industry sector

Page 44: Rechnerarchitektur,(RA) - TU Dortmund

-­ 44 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Automotive Software

© Hamann, Kramer, Ziegenbein, Lukasiewycz (Bosch), 2016

Page 45: Rechnerarchitektur,(RA) - TU Dortmund

-­ 45 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Assessment of Multi-­Core Worst-­Case Execution Behavior

© Hamann, Kramer, Ziegenbein, Lukasiewycz (Bosch), 2016

Page 46: Rechnerarchitektur,(RA) - TU Dortmund

-­ 46 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Multi-­Core Memory Access Models

© Hamann, Kramer, Ziegenbein, Lukasiewycz (Bosch), 2016

Page 47: Rechnerarchitektur,(RA) - TU Dortmund

-­ 47 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

An Industrial Challenge (FMTV 2017)

Precise analysis of worst-­case end-­to-­end latencies

§ mainly due to different involved periods and time domains

§ What is the effect on memory layout and interconnect on the execution times?

§ Automatic optimized application and data mapping

§ Evaluation of digital (multi-­core) execution platforms§ Evaluation of software growth scenarios

© Hamann, Kramer, Ziegenbein, Lukasiewycz (Bosch), 2016

Page 48: Rechnerarchitektur,(RA) - TU Dortmund

-­ 48 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

MPPA-­256 Processor Architecture (Kalray)

Kalray, 2016

Page 49: Rechnerarchitektur,(RA) - TU Dortmund

-­ 49 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

MPPA-­256 NoC

Page 50: Rechnerarchitektur,(RA) - TU Dortmund

-­ 50 -­technische universitätdortmund

fakultät für informatik

© j. chen, informatik 12, 2020

Safety-­Critical Systems with Multicore Platforms

Goal: deploy multi-­core processors for safety-­critical real-­time applications (avionics, automotive,…)

Problem: concurrent use of shared resources (e.g. interconnect, main memory)§ unknown access latency for a concrete resource access§ complicated timing analysis§ hardware platforms may not be predictable§ Many features are designed by computer architects for average cases only

Solution?§ Maybe it is up to you.§ Did you see the above challenges?