Hotpar pyrprof poster ver10 - USENIX · dt dd 2B440140motion c 211-220 ptmotion estimation loop dd...
Transcript of Hotpar pyrprof poster ver10 - USENIX · dt dd 2B440140motion c 211-220 ptmotion estimation loop dd...
Bid
ith
Pllli
tiG
At
tiP
llli
Di
dPl
iBridging
theParallelizationGap:A
utom
atingParallelism
Discovery
andPlanning
Bridging
theParallelizationGap:A
utom
atingParallelism
Discovery
andPlanning
Bridging
the Parallelization Gap: A
utom
ating Parallelism Discovery and
Plann
ing
gg
pg
yg
St
iG
iD
hJ
ChiL
iS
thiK
tV
kt
dMih
lBdf
dT
lSaturninoGarciaDon
ghwan
Jeon
ChrisLouieSravanthiKotaVe
nkataandMichaelBe
dfordTaylor
SaturninoGarcia, Don
ghwan
Jeon
, Chris Lou
ie, SravanthiKo
ta Ven
kata, and
Michael Bed
ford Taylor
,g
,,
,y
Ct
Si
dE
ii
Dt
tCo
mpu
terScienceandEngine
eringDep
artm
ent
Compu
ter Science and Engine
ering Dep
artm
ent
pg
gp
Ui
itfC
lifi
SDi
University
ofCaliforniaSanDiego
University
of C
alifo
rnia, San
Diego
y,
g
ThP
blDi
SBid
ih
Gf
TheProb
lem
Discovery
Stage
Bridging
theGap
pyrprof
TheProb
lem
Discovery
Stage
Bridging
theGap
‐pyrprof
The Prob
lem
Discovery Stage
Bridging
the Gap
pyrprof
yg
gg
ppyp
Pllli
idb
lk
f1Followtim
etested
gprofu
sage
mod
elPa
rallelizationga
pcreatedby
lack
of1.Followtim
e‐tested
gprofu
sage
mod
elPh
ase1:Re
gion
IDandInstrumen
tatio
nPh
ase2:Dynam
icCriticalPathCalculation
Parallelization ga
p created by lack of
1. Follow timetested
gprof
usag
e mod
elPh
ase 1: Region ID and
Instrumen
tatio
nPh
ase 2: Dynam
ic Critical Path Calculation
gp
y2Profile
noto
nlyworkbu
talsopa
rallelism
llli
di/
li
tl
2.Profile
noto
nlyworkbu
talsopa
rallelism
StaticRe
gion
Graph
Dynam
icRe
gion
Graph
Bild
tti
ih
parallelism
discovery/planning
tools
2. Profile no
t only workbu
t also pa
rallelism
Static Region Graph
Dynam
ic Region Graph
Build
staticregion
graph
criticalpath
parallelism discovery / plann
ing tools
3Leverage
region
structureandpa
rallelizationstatus
Ag
gp
Nod
e:region
(loop
s/functio
ns)
critical path
py
pg
3.Leverage
region
structureandpa
rallelizationstatus
AA0
‐Nod
e: re
gion
(loo
ps / fu
nctio
ns)
3. Leverage region
structure and
parallelization status
A0
‐Edge: dire
ct re
achability
……
BC
gy
Id
BC
C0
fprogrammer
Instrumented
BC
C0B0
Staticregion
graphisused
during
pyrprof
programmer
biB0
+Static re
gion
graph
is used du
ring
l
illli
il
pyrprof
binary
D+
planning
stage
Discovery
Parallelization
1plan
pyp
yD
D0
…D1
D
pg
g
Discovery
Parallelization
1p
D0
D1
Dn
yGap
Gap
src
*pyrprof‐lib
src
pyrprofcc
*ST
pyrproflib
trace
pyrprof-cc
Loop
region
(poo
rlyfilledby
trace
Loop
region
( poo
rly filled by
il
fili
tl
fdb
kF
isrc
li
serial profiling tools
feed
back
Func
region
src
Plan
ning
–eggprof)
2(exclusion
)Instruction
STInstrumented
bd
Plan
ning
–e.g. gprof)
2(exclusion
)ST
control/
data
Instrumented
LLVM
‐based
control / data
binary
based
fde
pend
ence
binary
pyrprof-cc
ppyp
SUIF
hd
ii
bild
Instrumentssource
code
with
calls
toE
bli
SUIF
$>pyrprofmpeg.trace–exclude=exclude.txt–n5
For e
ach dyna
micregion
, build
Instruments sou
rce code
with
calls to
Enab
ling
SUIF
3$ pyrprofmpeg.trace
excludeexclude.txt
n 5
hfll
ii
illb
ld
df
di
Foreach
dyna
micregion
,build
tl/
dt
flh(CDFG
)pyrprofprofiling
library
functio
nsEnab
ling
Polaris
3The following region
s will be exclud
ed from
recommen
datio
ns: D
, Econtrol / data flo
w graph
(CDFG
)pyrprofprofiling
library functio
ns
Tran
sforms
Polaris
3eoo
geg
os
beecud
edo
eco
edato
s:,
/g
p(
)•C
riticalPath
longestp
aththroughCD
FG•LLVM
used
toinlineandhighlyop
timize
Tran
sforms
•Critical Path: longest p
ath through CD
FGLLVM
used to inline and highly optim
ize
Tran
sforms
RawCC
GNTES
gp
g•W
ork:costof
operations
inaregion
instrumentedcode
RawCC
GNTES
•Work: cost o
f ope
ratio
ns in
a re
gion
instrumented code
RankIDCum
IncrFile
Lines
Function
Type
Rank ID Cum IncrFile Lines Function Type
Pllli
Wk/Citi
lPth
flib
kif
Code
1A
314
314motionc
208-220ptmotionestimationloop
Parallelism = W
ork / Critical Path
pyrprof-libsupp
orts tracking
of
Code
4
1 A 3.14 3.14 motion.c
208
220 ptmotionestimation loop
/pyp
ppg
tl
dd
td
dCo
de4
2B
440
140motionc
211-220ptmotionestimationloop
dd
fild
di
controland
datade
pend
ences
Gen
eration
Ope
nMP
42 B 4.40 1.40 motion.c
211
220 ptmotionestimation loop
Redu
ndant p
rofile data com
pressed using a
p•D
atade
puseshadow
mem
ory
Gen
eration
Ope
nMP,
3G
5.50
1.25transfrm.c176-233pttransform
loop
Redu
ndantp
rofiledata
compressedusinga
diti
bdt
hi
•Data de
p: use shado
w m
emory
Cilk
3 G 5.50 1.25 transfrm.c176
233 pttransform
loop
idictionary‐based
techniqu
ep
y•C
ontrolde
p:trackedwith
specialstack
Cilk++,
4 H 7.17 1.30 transfrm.c 249 -305 ptitransform loop
yq
•Con
trol dep
: tracked
with
spe
cial stack
Rti
Cilk++,
..30tas
.c
9305pttaso
oop
5C
960
134
ti
376
612
ttit
lRu
ntim
eetc
55 C 9.60 1.34 putpic.c
376 -612 ptputpict
loop
Runtim
e etc
5p
ppp
pp
Planning
Stage
Man
agem
ent
5Planning
Stage
Cum
GNTES
GNTEScomparedto
theserialversion
Man
agem
ent
Planning
Stage
Cum. G
NTES = GNTES compared to th
e serial version
GNTES=GuaranteedNot
ToExceed
Speedu
pg
gg
Incr. G
NTES = GNTES compared to th
e previous step
GNTES
Guaranteed Not To Exceed
Spe
edup
ld
ddl
fkdb
hfh
lll
Goal:providean
orde
redlisto
fregions
ranked
bytheim
pactof
theirparallelization
Ata
onom
ofparalleliation
Goal: provide an
ordered
list of regions ra
nked
by the im
pact of the
ir parallelization
CSt
dd
(ALPB
h)Ataxono
myof
parallelization
Case
Stud
y–mpe
gen
code
r(ALPBe
nch)
A ta
xono
my of parallelization
ElS
iR
iG
hExecutionTimeEstim
ationMod
elCase
Stud
y–mpe
gen
code
r(ALPBe
nch)
Exam
ple Static Region Graph
Execution Time Estim
ation Mod
elCase Study
mpe
g en
code
r (ALPBe
nch)
pg
p
ld
di
exclud
ed region
time(rprofile
PE)
=kfl
Shi
fCl
Ag
il
itim
e(r, profile, P, E)
i
di
ifh
iIterativeWorkflow
Shortcom
ings
ofCu
rren
tToo
lsA
serial re
gion
estim
ated
executio
n tim
e of th
e inpu
t program
Iterative Workflow
Shortcom
ings
ofCu
rren
tToo
lsg
pp
g
Eld
dT
5C
fid
Shortcom
ings of C
urrent Too
lsB
Cdirectlyreachable
r:region
toparallelize
Itti
Exclud
edTop5
Confirmed
A
tiB
Cdirectly re
achable
r : re
gion
to parallelize
fildt
fdi
tIteration
Ri
pR
iP
llli
blAction
serialexec
time
profile: d
ata from
discovery stage
blh
kd
Iteration
Region
sRe
gion
sParallelizab
leAction
Dserial exec tim
e P : set of p
arallelized
regions
Prob
lem
1:Highworkcoverage
does
not
Region
sRe
gion
sParallelizab
leD
=10
00p
gE:set
ofexclud
edregion
sProb
lem 1: H
igh work coverage doe
s no
t 1
{}{A
EBD
I}{A}
exclud
eE
D 100
0E : set of exclude
d region
sg
g1
{}{A,E, B, D
, I}
{A}
exclud
eE
correlatewith
parallelizab
ility
{}
{}
{}
ld
correlate with
parallelizab
ility
2{E}
{A,B,D
,G,H
}{A,B
}exclud
eD
py
Pllli
tiPl
2{E}
{A, B, D
,G, H
}{A, B
}exclud
e D
Estim
ated
Times
for
Parallelization Plan
3{D
E}{A
BG
HC}
{ABG
HC}
done
%lti
lf
lf
ttl
Estim
ated
Tim
es fo
r 3
{D, E}
{A,B, G
, H, C}
{A, B, G
, H, C}
done
% cumulative self self
total
PotentialParallelizations
Cum.
Incr.
time
seconds
seconds
calls
ms/call
ms/call
name
Potential Parallelizations
Rank
Region
time()
Cum.
GNTES
Incr.
GNTES
time seconds seconds
calls ms/call ms/call name
aeg
oe()
GNTES
GNTES
Plan
ning
Effectiven
essCo
mpa
rison
Programmer
33.34 0.02 0.02 7208 0.00 0.00 open
Region
Step
1Step
2Step
30
il
1000
10
10
Plan
ning
Effectivene
ss Com
parison
Programmer
p1667
003
001
244
004
012
offtime
Region
Step
1Step
2Step
30
serial
1000
1.0
1.0
exam
ines
region
E16.67 0.03 0.01 244 0.04 0.12 offtime
pyrprof
B80
040
01
C60
0166
166
exam
ines re
gion
E
16.67
0.04
0.01
81.25
1.25
memccpy
Region
gprof
pyrprof
pyrprofinteractive
B80
040
01
C60
01.66
1.66
andfin
dsou
titis
16.67 0.04 0.01 8 1.25 1.25 memccpy
1667
005
001
7143
143
it
Region
gprof
Initial
pyrprofinteractive
C60
0and fin
ds out it is
16.67 0.05 0.01 7 1.43 1.43 write
Initial
C60
02
B40
02.5
1.5
note
asily
D70
059
039
000
55
not e
asily
llli
bl…
A7
11
D70
059
039
03
D39
02.56
1.03
parallelizable
Al
ft
tA
71
13
D39
02.56
1.03
pA sam
ple gprofo
utpu
tP=
{}P={C}
P={B,C}
pgp
pB
83
2P=
{}P = {C}
P = {B, C}
Pllli
tiB
83
2Parallelization
blG
2741
3Cu
mGNTES=GNTEScomparedto
theserialversion
of C re
duces
Exploitedregion
inh
Prop
osed
Algorithm
:Prob
lem
2:G
2741
3Cu
m. G
NTES
GNTES compared to th
e serial version
Incr
GNTES
GNTEScomparedto
theprevious
step
thebe
nefit
ofExploited region
in
lll
ifA
LPStaticRe
gion
Graph
Prop
osed
Algorithm
:Prob
lem 2:
serialregion
Incr. G
NTES = GNTES compared to th
e previous step
the be
nefit of
lllii
Dparallel version
of A
LPStatic Region Graph
Iterativelyselecttheregion
that
H31
294
Aserial re
gion
parallelizing D
Iteratively select th
e region
that
ii
ih
li
iInplanning
H31
294
Aparallelregion
Not
exploitedinALP,
minim
izes
thetotalexecutio
ntim
eIn plann
ing,
parallel region
Not exploite
d in ALP,
bt
llli
blminim
izes th
e total executio
n tim
ep
g,C
3535
5B
Cbu
t parallelizable
directlyreachable
region
structure
C35
355
Ailbili
BC
directly re
achable
region
structure
F42
136
Availability
BC
region
structure
F42
136
Availability
Difficulttoparallelize
dllli
tiAvailability
DDifficult to parallelize
andpa
rallelization
D3
4exclud
ed
yD
Rk
iii
hih
and pa
rallelization
D3
4exclud
edRa
nk: position
in which
tt
tbpyrprofisavailableforfree
downloadat:
pregion
appe
arsintoolou
tput
status
mustb
eParallelizingregion
DE
52
exclud
edpyrprofis available for free
dow
nload at:
region
app
ears in
tool outpu
tstatus m
ust b
e Parallelizing region
DE
52
exclud
ed
///
idd
gg
ldtb
fitbl
http
//pa
rallelu
csded
u/pyrprof
Averagerank
of5exploited
considered
wou
ldno
tbeprofita
ble
Exploited
http://parallelu
csded
u/pyrprof
Average rank
of 5
exploite
di
iALP
considered
wou
ld not be profita
ble
Exploited
A28
.423
.83.8
http://parallel.u
csd.ed
u/pyrprof
region
s in ALP
whe
nCisalreadyparallelized
Avg
p//p
/pyp
whe
n C is alre
ady parallelized