Heiko Schröder, 2003

Post on 19-Jan-2016

63 views 0 download

description

ROUTING ? Sorting? Image Processing? Sparse Matrices?. Reconfigurable Meshes !. Heiko Schröder, 2003. Reconfigurable architectures. FPGAs reconfigurable multibus reconfigurable networks (Transputers, PVM) dynamically reconfigurable mesh Aim: efficiency - PowerPoint PPT Presentation

Transcript of Heiko Schröder, 2003

Heiko Schröder, 2003

ROUTING ?Sorting?Image Processing?Sparse Matrices?

Reconfigurable Meshes !

Heiko Schröder, 2003 Reconfigurable mesh 2

Reconfigurable architecturesReconfigurable architectures

• FPGAs

• reconfigurable multibus

• reconfigurable networks (Transputers, PVM)

• dynamically reconfigurable mesh

Aim:efficiency

special purpose --> general purpose architectures

Heiko Schröder, 2003 Reconfigurable mesh 3

contentscontents

1.) Motivation for the reconfigurable mesh

2.) Routing (and sorting):• better than PRAM

• better than mesh

3.) Image processing

4.) Sparse matrix

multiplication

5.) Bounded bus length

Heiko Schröder, 2003 Reconfigurable mesh 4

PRAMPRAM

0 1 2 3 4 5 6 7 8 9

8 976 5432 0 1

0 1 2 3 4 5 6 7 8 9

diameter O(1) bisection width (n)

cut

EREW CRCW

Heiko Schröder, 2003 Reconfigurable mesh 5

Mesh/TorusMesh/Torus

Diameter ( ) bisection width ( )

nn

2D mesh

Heiko Schröder, 2003 Reconfigurable mesh 6

HypercubeHypercube

0-D0

11-D

00

01

10

112-D

000 010

001 011

100 110

101 111

3-D

0 1

4-D

diameter O(log n)bisection width (n)

Heiko Schröder, 2003 Reconfigurable mesh 7

reconfigurable meshreconfigurable mesh

reconfigurable mesh = mesh + interior connections

15 positionsdiameter 1 !!

low cost

Heiko Schröder, 2003 Reconfigurable mesh 8

global ORglobal OR

1 0 000 1 0

* * “V”

Time: O(1) on RM-- (log n) on EREW-PRAM

Heiko Schröder, 2003 Reconfigurable mesh 9

Prefix sumPrefix sum

0 1 1 0 1 0 0 1 1 1

*

6

012345

789

Fast butexpensive

Time : O(1)Area: (nxn)

Heiko Schröder, 2003 Reconfigurable mesh 10

Modulo 3 counterModulo 3 counter

10 11 10

*1 mod 3

Time: O(1) on RM (log n / log log n) on CRCW-PRAM

Heiko Schröder, 2003 Reconfigurable mesh 11

• 2 digit numbers to the basis of k represent all numbers smaller than k2.

• 1.) determine x mod k (=lsd)

• 2.) count number of “wraps” (=msd).

modulo k2 counter (ranking)modulo k2 counter (ranking)

10 11 10

*1 mod k

--> modulo k2 counting in 2 steps on a k x k2 array

Heiko Schröder, 2003 Reconfigurable mesh 12

enumeration / prefix sumenumeration / prefix sum

1 1 1 1 1 1 1 11 2 1 2 1 2 1 21 2 3 4 1 2 3 41 2 3 4 5 6 7 8

time: O(log n)

wire efficiency ! -- (compared with tree)1/2 number of processors

Heiko Schröder, 2003 Reconfigurable mesh 13

permutation routing - 2 stepspermutation routing - 2 steps

n x n

2 steps !!!

Heiko Schröder, 2003 Reconfigurable mesh 14

Kunde’s all-to-all mappingKunde’s all-to-all mapping

Sorting:sort blocksall-to-all (columns)sort blocks all-to-all (rows)o-e-sort blocks

Heiko Schröder, 2003 Reconfigurable mesh 15

sorting in constant timesorting in constant time

n2

3

n1

3

Complete sort: sort blocks all-to-all (2) sort blocks all-to-all (2) o-e-sort blocks

block

broadcast (1)

Sort blocks:

broadcast (1)

rank (2)

Heiko Schröder, 2003 Reconfigurable mesh 16

• better than PRAM --- but useless!!!

Heiko Schröder, 2003 Reconfigurable mesh 17

Kunde’s all-to-all mappingKunde’s all-to-all mapping

n2

3

n x n

Heiko Schröder, 2003 Reconfigurable mesh 18

vertical all-to-allvertical all-to-all

Heiko Schröder, 2003 Reconfigurable mesh 19

horizontal all-to-allhorizontal all-to-all

Heiko Schröder, 2003 Reconfigurable mesh 20

Use of bus -- no conflictUse of bus -- no conflict

1 step

2 steps

3 steps

k/2 steps

3 steps

2 steps

1 step

(k/2)2 steps

Heiko Schröder, 2003 Reconfigurable mesh 21

sorting in optimal time Kunde / Schröder

sorting in optimal time Kunde / Schröder

(k/2)2 stepsk=n1/3

each step takes n1/3 time --> T= n/4

x 2

T = n/2all-to-all

Sorting:sort blocks (O(n2/3))all-to-all (n/2)sort blocks (O(n2/3))all-to-all (n/2)o-e sort blocks (O(n2/3))(snake like order of blocks)

time: n + o(n)

x 2

/2

Heiko Schröder, 2003 Reconfigurable mesh 22

Why optimal?Why optimal?

Sorter for n keys

Bisection of data with k wires

Sorting time n/k

Heiko Schröder, 2003 Reconfigurable mesh 23

Use of theoremUse of theorem

1.) n keys on a kxk RM:Time n/k

Proof:Wherever the data is stored there is always a bisection of length k-- this can be demonstrated sweeping left right through the array.Q.e.d.

2.) nxn keys on an nxn RM:Time n.

Proof: trivial

Heiko Schröder, 2003 Reconfigurable mesh 24

n + o(n)n + o(n)

Optimal --- but ...

Heiko Schröder, 2003 Reconfigurable mesh 25

enumeration / prefix sumenumeration / prefix sum

1 1 1 1 1 1 1 11 2 1 2 1 2 1 21 2 3 4 1 2 3 41 2 3 4 5 6 7 8

time: O(log n)

wire efficiency ! -- (compared with tree)1/2 number of processors

Heiko Schröder, 2003 Reconfigurable mesh 26

• move and smooth

ABCD-routingABCD-routing

A

BC

D

Row-major enumeration of A, B, C and D packets within each quadrant in time 4 log n.Determine destination position of each packet.

Heiko Schröder, 2003 Reconfigurable mesh 27

elementary stepselementary steps

21

108

5

7

9

36

4

21

108

5

7

9

36

4

move

21

109

5

7

8

36

4

smooth

21109

5 78

3 64

collect

Heiko Schröder, 2003 Reconfigurable mesh 28

time analysistime analysis

A B C D

move smooth

A B C D

collect

time: 3 x n/2

T=3n+o(n)

Heiko Schröder, 2003 Reconfigurable mesh 29

T < 2nT < 2n

4 destination squarestime: 3n + 4 log n

16 destination squarestime: 2n + 16 log n

64 destination squarestime: 12/7 n + 64 log n

mesh-diameter: 2n

Heiko Schröder, 2003 Reconfigurable mesh 30

enough of routing/sortingenough of routing/sorting

Constant factor !Can we do better ?What kind of problems ?

Image processingSparse problems !

Heiko Schröder, 2003 Reconfigurable mesh 31

Image processingImage processing

•Border following

•Edge detection

•Component labeling

•Skeletons

•Transforms

Heiko Schröder, 2003 Reconfigurable mesh 32

Component labellingComponent labelling

ObjectDefine border (candidates)Set bus

While own label is not received:1.) Candidates brake busand send their label a) clockwiseb) anti-clockwise2.) Candidates switch offand restore bus if they see smaller labelTime: O(1) -- O(log n)

Heiko Schröder, 2003 Reconfigurable mesh 33

TransformsTransforms

• Wavelet transform: Time log n on RM

-- time n on mesh

• FFT: Time n on RM and mesh

• Hough transform: Time m x log n on RM

-- time m x n on mesh

Heiko Schröder, 2003 Reconfigurable mesh 34

systolic matrix multiplicationsystolic matrix multiplication

B

A C

c a bij ik kjk

n

1

time: ni

j

ijc

Heiko Schröder, 2003 Reconfigurable mesh 35

sparse matrix multiplicationsparse matrix multiplication

Ax

B=

C

c a bij ik kjk

n

1

Time: n (nxn mesh)

A and B column sparse (k2)A and B row sparse (k2)A row sparse, B column sparse (k2)A column sparse, B row sparse k n

Heiko Schröder, 2003 Reconfigurable mesh 36

unlimited bus lengthunlimited bus length

• ring broadcast

1 2 32 2 2 2 2 2 2 2 2 3 3 3 3 33 3 3 3 1

2 2 2 2 3 1 1 1 1 1 1 1 1 1 2 2 2 2 21 1 1 1 2 3 3 3 3 3 3 3 3 3 1 1 1 1 1

Heiko Schröder, 2003 Reconfigurable mesh 37

A row-sparse B column-sparseA row-sparse B column-sparse

Repeat r times

Begin

horizontal ring broadcast aik

Repeat c times

vertical ring broadcast bkj

End. B

A C

r

c

Heiko Schröder, 2003 Reconfigurable mesh 38

A and B column-sparseA and B column-sparse

Repeat c1 times

Begin

horizontal ring broadcast aik (meets bkj)

Repeat c2 times

vertical ring broadcast product to final position

End.

A B/C

c1

c2 elements

T

i

i}k{

j

Heiko Schröder, 2003 Reconfigurable mesh 39

lower bound (c,r) c=r=klower bound (c,r) c=r=k

c a bij ik kjk

n

1

A

B

C

k=3

nk

nk

n=48

t nk

Heiko Schröder, 2003 Reconfigurable mesh 40

splitting the problemsplitting the problem

Repeat k times

Begin

vertical ring broadcast

Repeat s times

horizontal ring broadcast

End.

A B/C

first s

s

k B-elements

T

A=As +Ar

C=AsB+ArB

s

time: ks

s

C=C+Cs r

Heiko Schröder, 2003 Reconfigurable mesh 41

CRCR

A

first s

s

s

A has nk non-zero elements Ar has at most nk/s non-zero rows for s= n Ar has at most k n non-zero rows.

As B is a RR- problem it takes time k n .

A=As +Ar

Heiko Schröder, 2003 Reconfigurable mesh 42

Ar B calculating productsAr B calculating products

Ar

k A-elements

Ar

k B-elements

B/CT

time: k2

elementsk n2

Heiko Schröder, 2003 Reconfigurable mesh 43

column sumcolumn sum

ii+1i-1

j

row itime: log n

k n only elements per column rout time: k n

Heiko Schröder, 2003 Reconfigurable mesh 44

routing within columnsrouting within columns

rout time: k n

Heiko Schröder, 2003 Reconfigurable mesh 45

Reconfigurable architecturesReconfigurable architectures

Reconfigurable mesh ?

constant diameter !

No !!!

Physical laws!

Heiko Schröder, 2003 Reconfigurable mesh 46

Physical limitsPhysical limits

c=300 000 km/sec • 30cm/ns

• on chip: 1cm/ns

• --> bounded bus length

good idea !

Heiko Schröder, 2003 Reconfigurable mesh 47

bounded broadcastbounded broadcast

1 2 31 2 2 3 3 3

1 1 2 2 2 2 2 3 33 33 3 1 1 1 1 1 2 2

3 1 1 2 2 23 3 1 1 1 2 22 2

2 2 3 3 3 3 3 1 11 11 1 2 3 3

time: k + n/l

Heiko Schröder, 2003 Reconfigurable mesh 48

creating main stationscreating main stations

1 2 3 1 2 3 1 2 3

1 1 1 1 1 1 1 1 1 1 1 1 1 1 12 2 2 2 2 2 2 2 2 2 2 2 2 2 23 3 3 3 3 3 3 3 3 3 3 3 3 3 3

time: k

Heiko Schröder, 2003 Reconfigurable mesh 49

Create main stations 1,…,k for A and B (time: n/l+k)

For i=1,…,k do

Begin

horizontal ring broadcast i of A

For j=1,…,k do

vertical ring broadcast j of B

End.

A row-sparse B column-sparseA row-sparse B column-sparse

B

A C

k

k

Heiko Schröder, 2003 Reconfigurable mesh 50

Create main stations 1, … , k for A (time: n/l+k)

For i=1,…,k do

Begin

horizontal ring broadcast i

k bounded vertical broadcasts of products

merging new products

End.

A and B column-sparseA and B column-sparse

A B/C

k

k elements

T

i

i

Heiko Schröder, 2003 Reconfigurable mesh 51

remove minor stationsremove minor stations

1 2 32 21 3 3 31 1 2 2 2 2 2 3 33 3

1 1 1 1 1 2 2 2 2 23 3 3 3 33 3 3 3 3 1 1 1 1 1 2 22 2

2 2 2 2 2 3 3 3 3 3 1 11 1

Heiko Schröder, 2003 Reconfigurable mesh 52

resultsresults

Time: n (nxn mesh)A and B column sparse (k2) (k2+2n/l)A and B row sparse (k2) (k2 +2n/l)A row sparse, B column sparse (k2) (k2 +n/l)A column sparse, B row sparse (+11n/l) 3k n

•image processing

•sorting

•routing

•load balancingbetter than the mesh !

(Kunde, Middendorf, Schmeck, Schröder, Turner)

(Kapoor, Kunde, Kaufmann, Schroeder, Sibeyn)

•The RM is in some cases “better” than PRAM•The RM is always at least as “good” as mesh •The RM is often “better” than the mesh

Heiko Schröder, 2003 Reconfigurable mesh 54

Fault tolerant On-board computingFault tolerant On-board computing

10 km/s 1 image/s 100 Mbit/image 4000 s/orbit 400 Gbit/orbit download: 400 Mbit/orbit

On-board imageanalysis andcompression

800 km

Singapore

Heiko Schröder, 2003 Reconfigurable mesh 55

Due to radiation:•Single event upsets (many)•latch ups (extra hardware)•total loss (rare at 800 km)

1 task per processorseveral tasks/instruments/sources per processor1 task per 3 to 4 processors

!

?16 processors (+ spares?)fault tolerant reconfigurable network

Heiko Schröder, 2003 Reconfigurable mesh 56

Methods currently usedMethods currently used

shadow-processorsmajorityvoting

Byzantine systems ASTRIUM, deep space

Heiko Schröder, 2003 Reconfigurable mesh 57

1 CAN2 CANs •Industrial spec.

•mil-spec.•radiation tolerant•radiation hardened

386 is modern

Heiko Schröder, 2003 Reconfigurable mesh 58

Fault tolerance through reconfigurationin regular networks

Heiko Schröder, 2003 Reconfigurable mesh 59

Every fault pattern, that does not contain a 2x2 array of faulty PEs survives.

PS(7)=0.7

A simple solution with high fault tolerance (torus)

processor Data sourceinstrument

“atomic fault pattern”

Heiko Schröder, 2003 Reconfigurable mesh 60

To the right

up

Replacement paths

Replace to the right -- 1 fault per row

Faulty processors

Heiko Schröder, 2003 Reconfigurable mesh 61

To the right

Replacement paths

Heiko Schröder, 2003 Reconfigurable mesh 62

Preserving horizontal connections Preserving vertical connections

spares Replacement paths

Two separate networks for horizontal and vertical connections

Heiko Schröder, 2003 Reconfigurable mesh 63

Number of switches: 2, 4, 6, 30Wire area: 0, 2.8, 3.8, 4

S

N

E

W

P

I

kNp 1

1

PS

# faults

1

161 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Heiko Schröder, 2003 Reconfigurable mesh 64

… an arrary of SHARCs to provides throughput 160 Mb/s.… 2.5 billion floating point operations per second. … first demonstration of real-time image processing in space.

image cube froma 30 km wide swath of Korea’s coastline.(Launch: 2001?)

Nemo

Heiko Schröder, 2003 Reconfigurable mesh 65

1.) You need to know which algorithm you want to use.But in image processing (if you do not use FFT – wavelet can also be a problem)You can usually assume that every calculation you do depends only on data in the close neighbourhood. That results in the fact that at any stge of your processing you need to have in your memory only the data of 3 neighbouring planes. 2.) you have several choices of processing the data (each time you do all the processing related to a single plane in parallel). You can either slice your cube into le to cut “Horizontal planes or into vertical planes (sometimes it is desirable to cut “diagonally”).3.) It is important that you read every data element only once.4.) Such data processing you would call systolic, i.e. you move the raw data through the architecture with constant speed and constant direction.

In the picture below I have cut the data-cube vertically -- obviously there are many directions of slicing vertically.

If the memory of the processors is large enough, you might be able to hold the complete cub in memory – then you can also process FFTs and wavelets easily.

Heiko Schröder, 2003 Reconfigurable mesh 66

Compressionratio (CR=4loss-less)

Segmentation gain (SG=16, 1/16 of a useful image is useful)

Classification gain(CG=5, 1 in 5 images contain useful information)

U=.8

U=.2

U=4

U=1 U=16

The satellite efficiency cube

Not likely

LOSSY=60U=32

U=64

(0,0,0)

Heiko Schröder, 2003 Reconfigurable mesh 67

Our aim: High performance via COTS16 processors (+ spares) off-the-shelfconnected via afault tolerant reconfigurable network

In X-SAT restricted to image processing

Mesh/torus

Heiko Schröder, 2003 Reconfigurable mesh 68

processorsfault

tolerantmesh

on-board

Heiko Schröder, 2003 Reconfigurable mesh 69

switch

current communication

FPGA

ctrlh/vo/er/w

Instructionsto PEs

link to PE

Heiko Schröder, 2003 Reconfigurable mesh 70

spares

C3 -- torus

spares

Replacement algorithm exists for up to 4 faults.Reconfiguration software runs on FPGA.Could be repaired within << 1sec.

Heiko Schröder, 2003 Reconfigurable mesh 71

ctrlh/vo/er/w

Instructionsto PEs

Diagnosticset switches

Heiko Schröder, 2003 Reconfigurable mesh 72

4 FPGAsConnected to k*(k+1) SA-processors each

k+1 horizontal and vertical connections plus diagonals

Theorem:Given a 2x2 array of FPGAs, each connected to k2+k processors, with k+1 vertical and horizontal connections and 1 connection in the diagonals.

The processors can be connected to a 2kx2k mesh as long as the sum of working processors is at least 4k2.

Heiko Schröder, 2003 Reconfigurable mesh 73

Proof:There are 3 cases, which we treat separately:1. Two neighbouring FPGAs have more than k2 working processors 2. No two neighbouring FPGAs have more than k2 working processors but two opposite FPGAs have more than k2 working processors 3. Only one FPGA has more than k2 processors

Case 1: red and green are greater than k2

Figure 1 shows all possible combinations forred and green having more then k2 processors.(in the drawing k=8).The light red and green areas show the minimal Number of working processors. If more than k2 processors are Working in the red and green FPGAs, they are added in the order indicted by the arrows in the dark red and dark green areas. These placesare otherwise occupied by processors belonging to the yellow and blue areas.

The border between yellow and blue is determined by the 4 numbers and is within the orange area. The maximal number for the Yellow or green area is k2+k. If red has also this many elements then the yellow areaneeds to be extended to the right into the orange area. The maximal size of yellow plus orange is(k-1)(k+1)+2(k-2)=k2-3+2k= k(k+1)+k-5 which is greater or equal to k(k+1) for k 5. Please note that due to the above inequality for k=8 (as shown in the drawing) the length of the left and right arrow can be reduced by 3 each. It is easy to see from the drawing that for any possible sum of red and yellow this can be done with the length of the yellow/blue borderline not exceeding k+1. Remark: The length of the border between orange and blue is k+1. Also there is at most one diagonal connection required.

Let the sum of red and yellow be smaller than 2k(k+1), i.e. smaller or equal to 2k2+2k-1. The maximal number of red yellow (assuring that no border is longer than k+1) and orange elements is 2k2+3k-5, which is sufficient to cover all 2k2+2k-1 elements as long as 3k-5 2k-1, i.e. k 4. The case where red and yellow have together 2k2+2k elements can be solved easily as shown in Figure …

Figure 1

Heiko Schröder, 2003 Reconfigurable mesh 74

Proof:There are 3 cases, which we treat separately:1. Two neighbouring FPGAs have more than k2 working processors 2. No two neighbouring FPGAs have more than k2 working processors but two opposite FPGAs have more than k2 working processors 3. Only one FPGA has more than k2 processors

1

aAll

All cases with only 3 FPGAs >0All cases with 4 FPGAs>0 under case 1 above

Heiko Schröder, 2003 Reconfigurable mesh 75

Case 2: Two FPGAs have more than 64 processors, but no two neighbouring FPGAs do:Let the red and blue have between k2+1 and k2+k working processors. Case 2a: Assume that either green or yellow have more than k2-k elements.Lets assume that green has at least k2-k+1 elements (and due to the general assumption of case 2 it has at most k2 elements). If yellow has more than k2-k elements we produce the mirror image.

Under these assumptions red can have up to k2+k elements and green can have up to k2 elementsAs indicated by the horizontal arrows (see Figure 2).Green plus blue can range from 2k2-k+2 to 2k2+k. Thus 2k-4 k needs to be satisfied, thus k 4.

Case 2b: Neither green nor yellow are greater k2-k, then green and yellow have to be exactly k2-k and red and blue need to be k2+k in order to have 4k2 elements.This can be solved as shown in Figure 3.

Figure 2 Figure 3

Heiko Schröder, 2003 Reconfigurable mesh 76

Case 3: Only red is larger than k2.All other colours are at least k2-k and at most k2.Case 3a: No colour is k2-k.Yellow occupies the yellow area plus orange if required (see Figure 4). Red occupies the bright red area plus the rest of orange plus the dark red in the order indicated by the two vertical arrows. Blue occupies the light blue and dark blue area in the order indicated by the arrow.

Case 3b: One has k2-k elements. This can then either be a neighbour of red (say green) – see Figure 5, or it can be opposite red – see Figure 6.

This completes the proof that for k >4 there is always a Solution, as long as at least 4k2 processors are active.

Figure 5 Figure 6Figure 4

Heiko Schröder, 2003 Reconfigurable mesh 77

For the case of more than 4 FPGAsWe also assume that we have k +1Horizontal and vertical connections plusone diagonal connection. We also attachk2 +k processors to every FPGA.

Heiko Schröder, 2003 Reconfigurable mesh 78

Heiko Schröder, 2003 Reconfigurable mesh 79

27x27=7298x90=72021 yellow (9 lower bound)

24x24=5768x72=5768 yellow (0 lower bound)

Heiko Schröder, 2003 Reconfigurable mesh 80

30x30=9008x110=88029 yellow (20 lower bound)

Heiko Schröder, 2003 Reconfigurable mesh 81