Hotpar pyrprof poster ver10 - USENIX · dt dd 2B440140motion c 211-220 ptmotion estimation loop dd...

1
B id i th P ll li ti G At ti P ll li Di d Pl i Bridging the Parallelization Gap: Automating Parallelism Discovery and Planning Bridging the Parallelization Gap: Automating Parallelism Discovery and Planning Bridging the Parallelization Gap: Automating Parallelism Discovery and Planning St i G i D h J Ch i L i S thi Kt V kt d Mi h lB df d T l Saturnino Garcia Donghwan Jeon Chris Louie Sravanthi Kota Venkata and Michael Bedford Taylor Saturnino Garcia, Donghwan Jeon, Chris Louie, Sravanthi Kota Venkata, and Michael Bedford Taylor C t Si dE i i D t t Computer Science and Engineering Department Computer Science and Engineering Department Ui it f C lif i S Di University of California San Diego University of California, San Diego Th P bl Di S B id i h G f The Problem Discovery Stage Bridging the Gap pyrprof The Problem Discovery Stage Bridging the Gap pyrprof The Problem Discovery Stage Bridging the Gap pyrprof P ll li i db l k f 1 Follow time tested gprof usage model Parallelization gap created by lack of 1. Follow timetested gprof usage model Phase 1: Region ID and Instrumentation Phase 2: Dynamic Critical Path Calculation Parallelization gap created by lack of 1. Follow time tested gprof usage model Phase 1: Region ID and Instrumentation Phase 2: Dynamic Critical Path Calculation 2 Profile not only work but also parallelism ll li di / l i t l 2. Profile not only work but also parallelism Static Region Graph Dynamic Region Graph B ild t ti i h parallelism discovery / planning tools 2. Profile not only work but also parallelism Static Region Graph Dynamic Region Graph Build static region graph critical path parallelism discovery / planning tools 3 Leverage region structure and parallelization status A Node: region (loops / functions) critical path 3. Leverage region structure and parallelization status A A0 Node: region (loops / functions) 3. Leverage region structure and parallelization status A0 Edge: direct reachability B C I d B C C0 f programmer Instrumented B C C0 B0 Static region graph is used during pyrprof programmer bi B0 + Static region graph is used during l i ll li i l pyrprof binary D + planning stage Discovery Parallelization 1 plan D D0 D1 D Discovery Parallelization 1 D0 D1 Dn Gap Gap src * pyrproflib src pyrprof cc * ST pyrprof lib trace pyrprof-cc Loop region ( poorly filled by trace Loop region ( poorly filled by il fili t l f db k F i src l i serial profiling tools feedback Func region src Planning eg gprof ) 2 (exclusion) Instruction ST Instrumented b d Planning e.g. gprof ) 2 (exclusion) ST control / data Instrumented LLVMbased control / data binary f dependence binary pyrprof-cc SUIF h d i i b ild Instruments source code with calls to E bli SUIF $> pyrprof mpeg.trace exclude=exclude.txt n5 For each dynamic region, build Instruments source code with calls to Enabling SUIF 3 $ pyrprof mpeg.trace exclude exclude.txt n 5 h f ll i i ill b lddf d i t l/d t fl h (CDFG) pyrprof profiling library functions Enabling Polaris 3 The following regions will be excluded from recommendations: D, E control / data flow graph (CDFG) pyrprof profiling library functions Transforms Polaris 3 Critical Path longest path through CDFG LLVM used to inline and highly optimize Transforms Critical Path: longest path through CDFG LLVM used to inline and highly optimize Transforms RawCC GNTES Work: cost of operations in a region instrumented code RawCC GNTES Work: cost of operations in a region instrumented code Rank ID Cum Incr File Lines Function Type Rank ID Cum Incr File Lines Function Type P ll li W k / C iti l P th f lib ki f Code 1 A 314 314 motion c 208 - 220 ptmotion estimation loop Parallelism = Work / Critical Path pyrprof-lib supports tracking of Code 4 1 A 3.14 3.14 motion.c 208 220 ptmotion estimation loop t l d dt d d 4 2 B 440 140 motion c 211 - 220 ptmotion estimation loop d d fil d d i control and data dependences Generation OpenMP 4 2 B 4.40 1.40 motion.c 211 220 ptmotion estimation loop Redundant profile data compressed using a Data dep use shadow memory Generation OpenMP , 3 G 5.50 1.25 transfrm.c 176 - 233 pttransform loop di ti b dt hi Data dep: use shadow memory Cilk 3 G 5.50 1.25 transfrm.c 176 233 pttransform loop i dictionarybased technique Control dep: tracked with special stack Cilk++, 4 H 7.17 1.30 transfrm.c 249 - 305 ptitransform loop Control dep: tracked with special stack R ti Cilk++, 5 C 960 134 ti 376 612 t tit l Runtime etc 5 5 C 9.60 1.34 putpic.c 376 - 612 ptputpict loop Runtime etc 5 Planning Stage Management 5 Planning Stage Cum GNTES GNTES compared to the serial version Management Planning Stage Cum. GNTES = GNTES compared to the serial version GNTES = Guaranteed Not To Exceed Speedup Incr. GNTES = GNTES compared to the previous step GNTES Guaranteed Not To Exceed Speedup l d d dl f kdb h fh ll l Goal: provide an ordered list of regions ranked by the impact of their parallelization A ta onom of paralleli ation Goal: provide an ordered list of regions ranked by the impact of their parallelization C St d d (ALPB h) A taxonomy of parallelization Case Study mpeg encoder (ALPBench) A taxonomy of parallelization E l S iR i G h Execution Time Estimation Model Case Study mpeg encoder (ALPBench) Example Static Region Graph Execution Time Estimation Model Case Study mpeg encoder (ALPBench) ldd i excluded region time(r profile P E)= kfl Sh i fC l A il i time(r, profile, P, E) i d i i fh i Iterative Workflow Shortcomings of Current Tools A serial region estimated execution time of the input program Iterative Workflow Shortcomings of Current Tools E ldd T 5 C fi d Shortcomings of Current Tools B C directly reachable r : region to parallelize It ti Excluded Top 5 Confirmed A ti B C directly reachable r : region to parallelize fil dt f di t Iteration R i R i P ll li bl Action serial exec time profile: data from discovery stage bl h k d Iteration Regions Regions Parallelizable Action D serial exec time P : set of parallelized regions Problem 1: High work coverage does not Regions Regions Parallelizable D = 1000 E : set of excluded regions Problem 1: High work coverage does not 1 {} {A E B D I} {A} exclude E D 1000 E : set of excluded regions 1 {} {A, E, B, D, I} {A} exclude E correlate with parallelizability {} { } { } ld correlate with parallelizability 2 {E} {A, B, D, G, H} {A, B} exclude D P ll li ti Pl 2 {E} {A, B, D, G, H} {A, B} exclude D Estimated Times for Parallelization Plan 3 {D E} {A B G H C} {A B G H C} done % l ti lf lf ttl Estimated Times for 3 {D, E} {A, B, G, H, C} {A, B, G, H, C} done % cumulative self self total Potential Parallelizations Cum. Incr. time seconds seconds calls ms/call ms/call name Potential Parallelizations Rank Region time() Cum. GNTES Incr. GNTES time seconds seconds calls ms/call ms/call name GNTES GNTES Planning Effectiveness Comparison Programmer 33.34 0.02 0.02 7208 0.00 0.00 open Region Step 1 Step 2 Step 3 0 il 1000 10 10 Planning Effectiveness Comparison Programmer 16 67 0 03 0 01 244 0 04 0 12 offtime Region Step 1 Step 2 Step 3 0 serial 1000 1.0 1.0 examines region E 16.67 0.03 0.01 244 0.04 0.12 offtime pyrprof B 800 400 1 C 600 1 66 1 66 examines region E 16.67 0.04 0.01 8 1.25 1.25 memccpy Region gprof pyrprof pyrprof interactive B 800 400 1 C 600 1.66 1.66 and finds out it is 16.67 0.04 0.01 8 1.25 1.25 memccpy 16 67 0 05 0 01 7 1 43 1 43 it Region gprof Initial pyrprof interactive C 600 and finds out it is 16.67 0.05 0.01 7 1.43 1.43 write Initial C 600 2 B 400 2.5 1.5 not easily D 700 590 390 not easily ll li bl A 7 1 1 D 700 590 390 3 D 390 2.56 1.03 parallelizable A l f t t A 7 1 1 3 D 390 2.56 1.03 A sample gprof output P= {} P = {C} P = {B, C} B 8 3 2 P= {} P = {C} P = {B, C} P ll li ti B 8 3 2 Parallelization bl G 27 41 3 Cum GNTES = GNTES compared to the serial version of C reduces Exploited region in h Proposed Algorithm: Problem 2: G 27 41 3 Cum. GNTES GNTES compared to the serial version Incr GNTES GNTES compared to the previous step the benefit of Exploited region in ll l i f ALP Static Region Graph Proposed Algorithm: Problem 2: serial region Incr. GNTES = GNTES compared to the previous step the benefit of ll li i D parallel version of ALP Static Region Graph Iteratively select the region that H 31 29 4 A serial region parallelizing D Iteratively select the region that ii i h l i i In planning H 31 29 4 A parallel region Not exploited in ALP, minimizes the total execution time In planning, parallel region Not exploited in ALP, bt ll li bl minimizes the total execution time C 35 35 5 B C but parallelizable directly reachable region structure C 35 35 5 A il bili B C directly reachable region structure F 42 13 6 Availability B C region structure F 42 13 6 Availability Difficult to parallelize d ll li ti Availability D Difficult to parallelize and parallelization D 3 4 excluded D R k ii i hi h and parallelization D 3 4 excluded Rank: position in which tt tb pyrprof is available for free download at: region appears in tool output status must be Parallelizing region D E 5 2 excluded pyrprof is available for free download at: region appears in tool output status must be Parallelizing region D E 5 2 excluded // / id d ld tb fit bl http //parallel ucsd edu/pyrprof Average rank of 5 exploited considered would not be profitable Exploited http://parallel ucsd edu/pyrprof Average rank of 5 exploited i i ALP considered would not be profitable Exploited A 28.4 23.8 3.8 http://parallel.ucsd.edu/pyrprof regions in ALP when C is already parallelized Avg when C is already parallelized

Transcript of Hotpar pyrprof poster ver10 - USENIX · dt dd 2B440140motion c 211-220 ptmotion estimation loop dd...

Page 1: Hotpar pyrprof poster ver10 - USENIX · dt dd 2B440140motion c 211-220 ptmotion estimation loop dd fil d d i control and data dependences Generation OpenMP 4 2 B 4.40 1.40 motion.c

Bid

ith

Pllli

tiG

At

tiP

llli

Di

dPl

iBridging

theParallelizationGap:A

utom

atingParallelism

Discovery

andPlanning

Bridging

theParallelizationGap:A

utom

atingParallelism

Discovery

andPlanning

Bridging

 the Parallelization Gap: A

utom

ating Parallelism Discovery and

 Plann

ing

gg

pg

yg

St

iG

iD

hJ

ChiL

iS

thiK

tV

kt

dMih

lBdf

dT

lSaturninoGarciaDon

ghwan

Jeon

ChrisLouieSravanthiKotaVe

nkataandMichaelBe

dfordTaylor

SaturninoGarcia, Don

ghwan

Jeon

, Chris Lou

ie, SravanthiKo

ta Ven

kata, and

 Michael Bed

ford Taylor

,g

,,

,y

Ct

Si

dE

ii

Dt

tCo

mpu

terScienceandEngine

eringDep

artm

ent

Compu

ter Science and Engine

ering Dep

artm

ent

pg

gp

Ui

itfC

lifi

SDi

University

ofCaliforniaSanDiego

University

 of C

alifo

rnia, San

 Diego

y,

g

ThP

blDi

SBid

ih

Gf

TheProb

lem

Discovery

Stage

Bridging

theGap

pyrprof

TheProb

lem

Discovery

Stage

Bridging

theGap

‐pyrprof

The Prob

lem

Discovery Stage

Bridging

 the Gap

 pyrprof

yg

gg

ppyp

Pllli

idb

lk

f1Followtim

etested

gprofu

sage

mod

elPa

rallelizationga

pcreatedby

lack

of1.Followtim

e‐tested

gprofu

sage

mod

elPh

ase1:Re

gion

IDandInstrumen

tatio

nPh

ase2:Dynam

icCriticalPathCalculation

Parallelization ga

p created by lack of 

1. Follow timetested

 gprof

usag

e mod

elPh

ase 1: Region ID and

 Instrumen

tatio

nPh

ase 2: Dynam

ic Critical Path Calculation

gp

y2Profile

noto

nlyworkbu

talsopa

rallelism

llli

di/

li

tl

2.Profile

noto

nlyworkbu

talsopa

rallelism

StaticRe

gion

Graph

Dynam

icRe

gion

Graph

Bild

tti

ih

parallelism

discovery/planning

tools

2. Profile no

t only workbu

t also pa

rallelism

Static Region Graph

Dynam

ic Region Graph

Build

 staticregion

 graph

criticalpath

parallelism discovery / plann

ing tools

3Leverage

region

structureandpa

rallelizationstatus

Ag

gp

Nod

e:region

(loop

s/functio

ns)

critical path

py

pg

3.Leverage

region

structureandpa

rallelizationstatus

AA0

‐Nod

e: re

gion

 (loo

ps / fu

nctio

ns)

3. Leverage region

 structure and

 parallelization status

A0

‐Edge: dire

ct re

achability

……

BC

gy

Id

BC

C0

fprogrammer

Instrumented 

BC

C0B0

Staticregion

graphisused

during

pyrprof

programmer

biB0

+Static re

gion

 graph

 is used du

ring

 l

illli

il

pyrprof

binary

D+

planning

 stage

Discovery

Parallelization

1plan

pyp

yD

D0

…D1

D

pg

g

Discovery

Parallelization 

1p

D0

D1

Dn

yGap

Gap

src

*pyrprof‐lib

src

pyrprofcc

*ST

pyrproflib

trace

pyrprof-cc

Loop

region

(poo

rlyfilledby

trace

Loop

 region

( poo

rly filled by

il

fili

tl

fdb

kF

isrc

li

serial profiling tools 

feed

back 

Func

region

src

Plan

ning

–eggprof)

2(exclusion

)Instruction

STInstrumented

bd

Plan

ning

–e.g. gprof) 

2(exclusion

)ST

control/

data

Instrumented 

LLVM

‐based

control / data

binary

based

fde

pend

ence

binary

pyrprof-cc

ppyp

SUIF

hd

ii

bild

Instrumentssource

code

with

calls

toE

bli

SUIF

$>pyrprofmpeg.trace–exclude=exclude.txt–n5

For e

ach dyna

micregion

, build 

Instruments sou

rce code

 with

 calls to

 Enab

ling

SUIF

3$ pyrprofmpeg.trace

excludeexclude.txt

n 5

hfll

ii

illb

ld

df

di

Foreach

dyna

micregion

,build

tl/

dt

flh(CDFG

)pyrprofprofiling

library

functio

nsEnab

ling 

Polaris

3The following region

s will be exclud

ed from

 recommen

datio

ns: D

, Econtrol / data flo

w graph

 (CDFG

)pyrprofprofiling

 library functio

ns

Tran

sforms

Polaris

3eoo

geg

os

beecud

edo

eco

edato

s:,

/g

p(

)•C

riticalPath

longestp

aththroughCD

FG•LLVM

used

toinlineandhighlyop

timize

Tran

sforms

•Critical Path: longest p

ath through CD

FGLLVM

 used to inline and highly optim

ize 

Tran

sforms

RawCC

GNTES

gp

g•W

ork:costof

operations

inaregion

instrumentedcode

RawCC

GNTES

•Work: cost o

f ope

ratio

ns in

 a re

gion

instrumented code

RankIDCum

IncrFile

Lines

Function

Type

Rank ID Cum IncrFile Lines Function Type

Pllli

Wk/Citi

lPth

flib

kif

Code

1A

314

314motionc

208-220ptmotionestimationloop

Parallelism = W

ork / Critical Path

pyrprof-libsupp

orts tracking

 of 

Code

 4

1 A 3.14 3.14 motion.c

208

220 ptmotionestimation loop

/pyp

ppg

tl

dd

td

dCo

de4

2B

440

140motionc

211-220ptmotionestimationloop

dd

fild

di

controland

 datade

pend

ences

Gen

eration

Ope

nMP

42 B 4.40 1.40 motion.c

211

220 ptmotionestimation loop

Redu

ndant p

rofile data com

pressed using  a 

p•D

atade

puseshadow

mem

ory

Gen

eration

Ope

nMP,

3G

5.50

1.25transfrm.c176-233pttransform

loop

Redu

ndantp

rofiledata

compressedusinga

diti

bdt

hi

•Data de

p: use shado

w m

emory

Cilk

3 G 5.50 1.25 transfrm.c176

233 pttransform

loop

idictionary‐based

 techniqu

ep

y•C

ontrolde

p:trackedwith

specialstack

Cilk++,

4 H 7.17 1.30 transfrm.c 249 -305 ptitransform loop

yq

•Con

trol dep

: tracked

 with

 spe

cial stack

Rti

Cilk++,

..30tas

.c

9305pttaso

oop

5C

960

134

ti

376

612

ttit

lRu

ntim

eetc

55 C 9.60 1.34 putpic.c

376 -612 ptputpict

loop

Runtim

e etc

5p

ppp

pp

Planning

Stage

Man

agem

ent

5Planning

Stage

Cum

GNTES

GNTEScomparedto

theserialversion

Man

agem

ent

Planning

 Stage

Cum. G

NTES = GNTES compared to th

e serial version

GNTES=GuaranteedNot

ToExceed

Speedu

pg

gg

Incr. G

NTES = GNTES compared to th

e previous step

GNTES 

 Guaranteed Not To Exceed

 Spe

edup

ld

ddl

fkdb

hfh

lll

Goal:providean

orde

redlisto

fregions

ranked

bytheim

pactof

theirparallelization

Ata

onom

ofparalleliation

Goal: provide an

 ordered

 list of regions ra

nked

 by the im

pact of the

ir parallelization

CSt

dd

(ALPB

h)Ataxono

myof

parallelization

Case

Stud

y–mpe

gen

code

r(ALPBe

nch)

A ta

xono

my of parallelization

ElS

iR

iG

hExecutionTimeEstim

ationMod

elCase

Stud

y–mpe

gen

code

r(ALPBe

nch)

Exam

ple Static Region Graph

Execution Time Estim

ation Mod

elCase Study

 mpe

g en

code

r (ALPBe

nch)

pg

p

ld

di

exclud

ed  region

time(rprofile

PE)

=kfl

Shi

fCl

Ag

il

itim

e(r, profile, P, E) 

 i

di

ifh

iIterativeWorkflow

Shortcom

ings

ofCu

rren

tToo

lsA

serial re

gion

estim

ated

 executio

n tim

e of th

e inpu

t program

Iterative Workflow

Shortcom

ings

ofCu

rren

tToo

lsg

pp

g

Eld

dT

5C

fid

Shortcom

ings of C

urrent Too

lsB

Cdirectlyreachable

r:region

toparallelize

Itti

Exclud

edTop5

Confirmed

 A

tiB

Cdirectly re

achable

r : re

gion

 to parallelize

fildt

fdi

tIteration

Ri

pR

iP

llli

blAction

serialexec

time

profile: d

ata from

 discovery stage 

blh

kd

Iteration

Region

sRe

gion

sParallelizab

leAction

Dserial exec tim

e P : set of p

arallelized

 regions

Prob

lem

1:Highworkcoverage

does

not

Region

sRe

gion

sParallelizab

leD

=10

00p

gE:set

ofexclud

edregion

sProb

lem 1: H

igh work coverage doe

s no

t 1

{}{A

EBD

I}{A}

exclud

eE

D 100

0E : set of exclude

d region

sg

g1

{}{A,E, B, D

, I}

{A}

exclud

eE

correlatewith

parallelizab

ility

{}

{}

{}

ld

correlate with

 parallelizab

ility

2{E}

{A,B,D

,G,H

}{A,B

}exclud

eD

py

Pllli

tiPl

2{E}

{A, B, D

,G, H

}{A, B

}exclud

e D

Estim

ated

Times

for

Parallelization Plan

3{D

E}{A

BG

HC}

{ABG

HC}

done

%lti

lf

lf

ttl

Estim

ated

 Tim

es fo

r 3

{D, E}

{A,B, G

, H, C}

{A, B, G

, H, C}

done

% cumulative self self

total

PotentialParallelizations

Cum.

Incr.

time

seconds

seconds

calls

ms/call

ms/call

name

Potential Parallelizations

Rank

Region

time()

Cum. 

GNTES

Incr. 

GNTES

time seconds seconds

calls ms/call ms/call name

aeg

oe()

GNTES

GNTES

Plan

ning

Effectiven

essCo

mpa

rison

Programmer

33.34 0.02 0.02 7208 0.00 0.00 open

Region

Step

1Step

2Step

30

il

1000

10

10

Plan

ning

 Effectivene

ss Com

parison

Programmer 

p1667

003

001

244

004

012

offtime

Region

Step

 1Step

 2Step

 30

serial

1000

1.0

1.0

exam

ines

region

E16.67 0.03 0.01 244 0.04 0.12 offtime

pyrprof

B80

040

01

C60

0166

166

exam

ines re

gion

 E 

16.67

0.04

0.01

81.25

1.25

memccpy

Region

gprof

pyrprof

pyrprofinteractive

B80

040

01

C60

01.66

1.66

andfin

dsou

titis

16.67 0.04 0.01 8 1.25 1.25 memccpy

1667

005

001

7143

143

it

Region

gprof

Initial

pyrprofinteractive

C60

0and fin

ds out it is 

16.67 0.05 0.01 7 1.43 1.43 write

Initial

C60

02

B40

02.5

1.5

note

asily

D70

059

039

000

55

not e

asily 

llli

bl…

A7

11

D70

059

039

03

D39

02.56

1.03

parallelizable

Al

ft

tA

71

13

D39

02.56

1.03

pA sam

ple gprofo

utpu

tP=

{}P={C}

P={B,C}

pgp

pB

83

2P=

 {}P = {C}

P = {B, C}

Pllli

tiB

83

2Parallelization 

blG

2741

3Cu

mGNTES=GNTEScomparedto

theserialversion

of C re

duces 

Exploitedregion

inh

Prop

osed

Algorithm

:Prob

lem

2:G

2741

3Cu

m. G

NTES 

 GNTES compared to th

e serial version

Incr

GNTES

GNTEScomparedto

theprevious

step

thebe

nefit

ofExploited region

 in 

lll

ifA

LPStaticRe

gion

Graph

Prop

osed

 Algorithm

:Prob

lem 2: 

serialregion

Incr. G

NTES = GNTES compared to th

e previous step

the be

nefit of 

lllii

Dparallel version

 of A

LPStatic Region Graph

Iterativelyselecttheregion

that

H31

294

Aserial re

gion

parallelizing D 

Iteratively select th

e region

 that 

ii

ih

li

iInplanning

H31

294

Aparallelregion

Not

exploitedinALP,

minim

izes

thetotalexecutio

ntim

eIn plann

ing, 

parallel region

Not exploite

d in ALP,

bt

llli

blminim

izes th

e total executio

n tim

ep

g,C

3535

5B

Cbu

t parallelizable

directlyreachable

region

structure

C35

355

Ailbili

BC

directly re

achable

region

structure

F42

136

Availability

BC

region

 structure

F42

136

Availability

Difficulttoparallelize

dllli

tiAvailability

DDifficult to parallelize

andpa

rallelization

D3

4exclud

ed

yD

Rk

iii

hih

and pa

rallelization 

D3

4exclud

edRa

nk: position

 in which

tt

tbpyrprofisavailableforfree

downloadat:

pregion

appe

arsintoolou

tput

status

mustb

eParallelizingregion

DE

52

exclud

edpyrprofis available for free

 dow

nload at:

region

 app

ears in

 tool outpu

tstatus m

ust b

e Parallelizing region

 DE

52

exclud

ed

///

idd

gg

ldtb

fitbl

http

//pa

rallelu

csded

u/pyrprof

Averagerank

of5exploited

considered

wou

ldno

tbeprofita

ble

Exploited

http://parallelu

csded

u/pyrprof

Average rank

 of 5

 exploite

di

iALP

considered

wou

ld not be profita

ble

Exploited

A28

.423

.83.8

http://parallel.u

csd.ed

u/pyrprof

region

s in ALP

whe

nCisalreadyparallelized

Avg

p//p

/pyp

whe

n C is alre

ady parallelized