Heiko Schröder, 2003

Heiko Schröder, 2003

ROUTING ?Sorting?Image Processing?Sparse Matrices?

Reconfigurable Meshes !

Heiko Schröder, 2003 Reconfigurable mesh 2

Reconfigurable architecturesReconfigurable architectures

• FPGAs

• reconfigurable multibus

• reconfigurable networks (Transputers, PVM)

• dynamically reconfigurable mesh

Aim:efficiency

special purpose --> general purpose architectures


contentscontents

1.) Motivation for the reconfigurable mesh

2.) Routing (and sorting):• better than PRAM

• better than mesh

3.) Image processing

4.) Sparse matrix

multiplication

5.) Bounded bus length


PRAMPRAM

0 1 2 3 4 5 6 7 8 9

8 976 5432 0 1

0 1 2 3 4 5 6 7 8 9

diameter O(1) bisection width (n)

cut

EREW CRCW


Mesh/TorusMesh/Torus

Diameter ( ) bisection width ( )

nn

2D mesh


HypercubeHypercube

0-D0

11-D

00

01

10

112-D

000 010

001 011

100 110

101 111

3-D

0 1

4-D

diameter O(log n)bisection width (n)


reconfigurable meshreconfigurable mesh

reconfigurable mesh = mesh + interior connections

15 positionsdiameter 1 !!

low cost


global ORglobal OR

1 0 000 1 0

* * “V”

Time: O(1) on RM-- (log n) on EREW-PRAM


Prefix sumPrefix sum

0 1 1 0 1 0 0 1 1 1

*

6

012345

789

Fast butexpensive

Time : O(1)Area: (nxn)


Modulo 3 counterModulo 3 counter

10 11 10

*1 mod 3

Time: O(1) on RM (log n / log log n) on CRCW-PRAM


• 2 digit numbers to the basis of k represent all numbers smaller than k2.

• 1.) determine x mod k (=lsd)

• 2.) count number of “wraps” (=msd).

modulo k2 counter (ranking)modulo k2 counter (ranking)

10 11 10

*1 mod k

--> modulo k2 counting in 2 steps on a k x k2 array


enumeration / prefix sumenumeration / prefix sum

1 1 1 1 1 1 1 11 2 1 2 1 2 1 21 2 3 4 1 2 3 41 2 3 4 5 6 7 8

time: O(log n)

wire efficiency ! -- (compared with tree)1/2 number of processors


permutation routing - 2 stepspermutation routing - 2 steps

n x n

2 steps !!!


Kunde’s all-to-all mappingKunde’s all-to-all mapping

Sorting:sort blocksall-to-all (columns)sort blocks all-to-all (rows)o-e-sort blocks


sorting in constant timesorting in constant time

n2

3

n1

3

Complete sort: sort blocks all-to-all (2) sort blocks all-to-all (2) o-e-sort blocks

block

broadcast (1)

Sort blocks:

broadcast (1)

rank (2)


• better than PRAM --- but useless!!!


Kunde’s all-to-all mappingKunde’s all-to-all mapping

n2

3

n x n


vertical all-to-allvertical all-to-all


horizontal all-to-allhorizontal all-to-all


Use of bus -- no conflictUse of bus -- no conflict

1 step

2 steps

3 steps

k/2 steps

3 steps

2 steps

1 step

(k/2)2 steps


sorting in optimal time Kunde / Schröder

sorting in optimal time Kunde / Schröder

(k/2)2 stepsk=n1/3

each step takes n1/3 time --> T= n/4

x 2

T = n/2all-to-all

Sorting:sort blocks (O(n2/3))all-to-all (n/2)sort blocks (O(n2/3))all-to-all (n/2)o-e sort blocks (O(n2/3))(snake like order of blocks)

time: n + o(n)

x 2

/2


Why optimal?Why optimal?

Sorter for n keys

Bisection of data with k wires

Sorting time n/k


Use of theoremUse of theorem

1.) n keys on a kxk RM:Time n/k

Proof:Wherever the data is stored there is always a bisection of length k-- this can be demonstrated sweeping left right through the array.Q.e.d.

2.) nxn keys on an nxn RM:Time n.

Proof: trivial


n + o(n)n + o(n)

Optimal --- but ...


enumeration / prefix sumenumeration / prefix sum

1 1 1 1 1 1 1 11 2 1 2 1 2 1 21 2 3 4 1 2 3 41 2 3 4 5 6 7 8

time: O(log n)

wire efficiency ! -- (compared with tree)1/2 number of processors


• move and smooth

ABCD-routingABCD-routing

A

BC

D

Row-major enumeration of A, B, C and D packets within each quadrant in time 4 log n.Determine destination position of each packet.


elementary stepselementary steps

21

108

5

7

9

36

4

21

108

5

7

9

36

4

move

21

109

5

7

8

36

4

smooth

21109

5 78

3 64

collect


time analysistime analysis

A B C D

move smooth

A B C D

collect

time: 3 x n/2

T=3n+o(n)


T < 2nT < 2n

4 destination squarestime: 3n + 4 log n

16 destination squarestime: 2n + 16 log n

64 destination squarestime: 12/7 n + 64 log n

mesh-diameter: 2n


enough of routing/sortingenough of routing/sorting

Constant factor !Can we do better ?What kind of problems ?

Image processingSparse problems !


Image processingImage processing

•Border following

•Edge detection

•Component labeling

•Skeletons

•Transforms


Component labellingComponent labelling

ObjectDefine border (candidates)Set bus

While own label is not received:1.) Candidates brake busand send their label a) clockwiseb) anti-clockwise2.) Candidates switch offand restore bus if they see smaller labelTime: O(1) -- O(log n)


TransformsTransforms

• Wavelet transform: Time log n on RM

-- time n on mesh

• FFT: Time n on RM and mesh

• Hough transform: Time m x log n on RM

-- time m x n on mesh


systolic matrix multiplicationsystolic matrix multiplication

B

A C

c a bij ik kjk

n

1

time: ni

j

ijc


sparse matrix multiplicationsparse matrix multiplication

Ax

B=

C

c a bij ik kjk

n

1

Time: n (nxn mesh)

A and B column sparse (k2)A and B row sparse (k2)A row sparse, B column sparse (k2)A column sparse, B row sparse k n


unlimited bus lengthunlimited bus length

• ring broadcast

1 2 32 2 2 2 2 2 2 2 2 3 3 3 3 33 3 3 3 1

2 2 2 2 3 1 1 1 1 1 1 1 1 1 2 2 2 2 21 1 1 1 2 3 3 3 3 3 3 3 3 3 1 1 1 1 1


A row-sparse B column-sparseA row-sparse B column-sparse

Repeat r times

Begin

horizontal ring broadcast aik

Repeat c times

vertical ring broadcast bkj

End. B

A C

r

c


A and B column-sparseA and B column-sparse

Repeat c1 times

Begin

horizontal ring broadcast aik (meets bkj)

Repeat c2 times

vertical ring broadcast product to final position

End.

A B/C

c1

c2 elements

T

i

i}k{

j


lower bound (c,r) c=r=klower bound (c,r) c=r=k

c a bij ik kjk

n

1

A

B

C

k=3

nk

nk

n=48

t nk


splitting the problemsplitting the problem

Repeat k times

Begin

vertical ring broadcast

Repeat s times

horizontal ring broadcast

End.

A B/C

first s

s

k B-elements

T

A=As +Ar

C=AsB+ArB

s

time: ks

s

C=C+Cs r


CRCR

A

first s

s

s

A has nk non-zero elements Ar has at most nk/s non-zero rows for s= n Ar has at most k n non-zero rows.

As B is a RR- problem it takes time k n .

A=As +Ar


Ar B calculating productsAr B calculating products

Ar

k A-elements

Ar

k B-elements

B/CT

time: k2

elementsk n2


column sumcolumn sum

ii+1i-1

j

row itime: log n

k n only elements per column rout time: k n


routing within columnsrouting within columns

rout time: k n


Reconfigurable architecturesReconfigurable architectures

Reconfigurable mesh ?

constant diameter !

No !!!

Physical laws!


Physical limitsPhysical limits

c=300 000 km/sec • 30cm/ns

• on chip: 1cm/ns

• --> bounded bus length

good idea !


bounded broadcastbounded broadcast

1 2 31 2 2 3 3 3

1 1 2 2 2 2 2 3 33 33 3 1 1 1 1 1 2 2

3 1 1 2 2 23 3 1 1 1 2 22 2

2 2 3 3 3 3 3 1 11 11 1 2 3 3

time: k + n/l


creating main stationscreating main stations

1 2 3 1 2 3 1 2 3

1 1 1 1 1 1 1 1 1 1 1 1 1 1 12 2 2 2 2 2 2 2 2 2 2 2 2 2 23 3 3 3 3 3 3 3 3 3 3 3 3 3 3

time: k


Create main stations 1,…,k for A and B (time: n/l+k)

For i=1,…,k do

Begin

horizontal ring broadcast i of A

For j=1,…,k do

vertical ring broadcast j of B

End.

A row-sparse B column-sparseA row-sparse B column-sparse

B

A C

k

k


Create main stations 1, … , k for A (time: n/l+k)

For i=1,…,k do

Begin

horizontal ring broadcast i

k bounded vertical broadcasts of products

merging new products

End.

A and B column-sparseA and B column-sparse

A B/C

k

k elements

T

i

i


remove minor stationsremove minor stations

1 2 32 21 3 3 31 1 2 2 2 2 2 3 33 3

1 1 1 1 1 2 2 2 2 23 3 3 3 33 3 3 3 3 1 1 1 1 1 2 22 2

2 2 2 2 2 3 3 3 3 3 1 11 1


resultsresults

Time: n (nxn mesh)A and B column sparse (k2) (k2+2n/l)A and B row sparse (k2) (k2 +2n/l)A row sparse, B column sparse (k2) (k2 +n/l)A column sparse, B row sparse (+11n/l) 3k n

•image processing

•sorting

•routing

•load balancingbetter than the mesh !

(Kunde, Middendorf, Schmeck, Schröder, Turner)

(Kapoor, Kunde, Kaufmann, Schroeder, Sibeyn)

•The RM is in some cases “better” than PRAM•The RM is always at least as “good” as mesh •The RM is often “better” than the mesh


Fault tolerant On-board computingFault tolerant On-board computing

10 km/s 1 image/s 100 Mbit/image 4000 s/orbit 400 Gbit/orbit download: 400 Mbit/orbit

On-board imageanalysis andcompression

800 km

Singapore


Due to radiation:•Single event upsets (many)•latch ups (extra hardware)•total loss (rare at 800 km)

1 task per processorseveral tasks/instruments/sources per processor1 task per 3 to 4 processors

!

?16 processors (+ spares?)fault tolerant reconfigurable network


Methods currently usedMethods currently used

shadow-processorsmajorityvoting

Byzantine systems ASTRIUM, deep space


1 CAN2 CANs •Industrial spec.

•mil-spec.•radiation tolerant•radiation hardened

386 is modern


Fault tolerance through reconfigurationin regular networks


Every fault pattern, that does not contain a 2x2 array of faulty PEs survives.

PS(7)=0.7

A simple solution with high fault tolerance (torus)

processor Data sourceinstrument

“atomic fault pattern”


To the right

up

Replacement paths

Replace to the right -- 1 fault per row

Faulty processors


To the right

Replacement paths


Preserving horizontal connections Preserving vertical connections

spares Replacement paths

Two separate networks for horizontal and vertical connections


Number of switches: 2, 4, 6, 30Wire area: 0, 2.8, 3.8, 4

S

N

E

W

P

I

kNp 1

1

PS

# faults

1

161 2 3 4 5 6 7 8 9 10 11 12 13 14 15


… an arrary of SHARCs to provides throughput 160 Mb/s.… 2.5 billion floating point operations per second. … first demonstration of real-time image processing in space.

image cube froma 30 km wide swath of Korea’s coastline.(Launch: 2001?)

Nemo


1.) You need to know which algorithm you want to use.But in image processing (if you do not use FFT – wavelet can also be a problem)You can usually assume that every calculation you do depends only on data in the close neighbourhood. That results in the fact that at any stge of your processing you need to have in your memory only the data of 3 neighbouring planes. 2.) you have several choices of processing the data (each time you do all the processing related to a single plane in parallel). You can either slice your cube into le to cut “Horizontal planes or into vertical planes (sometimes it is desirable to cut “diagonally”).3.) It is important that you read every data element only once.4.) Such data processing you would call systolic, i.e. you move the raw data through the architecture with constant speed and constant direction.

In the picture below I have cut the data-cube vertically -- obviously there are many directions of slicing vertically.

If the memory of the processors is large enough, you might be able to hold the complete cub in memory – then you can also process FFTs and wavelets easily.


Compressionratio (CR=4loss-less)

Segmentation gain (SG=16, 1/16 of a useful image is useful)

Classification gain(CG=5, 1 in 5 images contain useful information)

U=.8

U=.2

U=4

U=1 U=16

The satellite efficiency cube

Not likely

LOSSY=60U=32

U=64

(0,0,0)


Our aim: High performance via COTS16 processors (+ spares) off-the-shelfconnected via afault tolerant reconfigurable network

In X-SAT restricted to image processing

Mesh/torus


processorsfault

tolerantmesh

on-board


switch

current communication

FPGA

ctrlh/vo/er/w

Instructionsto PEs

link to PE


spares

C3 -- torus

spares

Replacement algorithm exists for up to 4 faults.Reconfiguration software runs on FPGA.Could be repaired within << 1sec.


ctrlh/vo/er/w

Instructionsto PEs

Diagnosticset switches


4 FPGAsConnected to k*(k+1) SA-processors each

k+1 horizontal and vertical connections plus diagonals

Theorem:Given a 2x2 array of FPGAs, each connected to k2+k processors, with k+1 vertical and horizontal connections and 1 connection in the diagonals.

The processors can be connected to a 2kx2k mesh as long as the sum of working processors is at least 4k2.


Proof:There are 3 cases, which we treat separately:1. Two neighbouring FPGAs have more than k2 working processors 2. No two neighbouring FPGAs have more than k2 working processors but two opposite FPGAs have more than k2 working processors 3. Only one FPGA has more than k2 processors

Case 1: red and green are greater than k2

Figure 1 shows all possible combinations forred and green having more then k2 processors.(in the drawing k=8).The light red and green areas show the minimal Number of working processors. If more than k2 processors are Working in the red and green FPGAs, they are added in the order indicted by the arrows in the dark red and dark green areas. These placesare otherwise occupied by processors belonging to the yellow and blue areas.

The border between yellow and blue is determined by the 4 numbers and is within the orange area. The maximal number for the Yellow or green area is k2+k. If red has also this many elements then the yellow areaneeds to be extended to the right into the orange area. The maximal size of yellow plus orange is(k-1)(k+1)+2(k-2)=k2-3+2k= k(k+1)+k-5 which is greater or equal to k(k+1) for k 5. Please note that due to the above inequality for k=8 (as shown in the drawing) the length of the left and right arrow can be reduced by 3 each. It is easy to see from the drawing that for any possible sum of red and yellow this can be done with the length of the yellow/blue borderline not exceeding k+1. Remark: The length of the border between orange and blue is k+1. Also there is at most one diagonal connection required.

Let the sum of red and yellow be smaller than 2k(k+1), i.e. smaller or equal to 2k2+2k-1. The maximal number of red yellow (assuring that no border is longer than k+1) and orange elements is 2k2+3k-5, which is sufficient to cover all 2k2+2k-1 elements as long as 3k-5 2k-1, i.e. k 4. The case where red and yellow have together 2k2+2k elements can be solved easily as shown in Figure …

Figure 1


Proof:There are 3 cases, which we treat separately:1. Two neighbouring FPGAs have more than k2 working processors 2. No two neighbouring FPGAs have more than k2 working processors but two opposite FPGAs have more than k2 working processors 3. Only one FPGA has more than k2 processors

1

aAll

All cases with only 3 FPGAs >0All cases with 4 FPGAs>0 under case 1 above


Case 2: Two FPGAs have more than 64 processors, but no two neighbouring FPGAs do:Let the red and blue have between k2+1 and k2+k working processors. Case 2a: Assume that either green or yellow have more than k2-k elements.Lets assume that green has at least k2-k+1 elements (and due to the general assumption of case 2 it has at most k2 elements). If yellow has more than k2-k elements we produce the mirror image.

Under these assumptions red can have up to k2+k elements and green can have up to k2 elementsAs indicated by the horizontal arrows (see Figure 2).Green plus blue can range from 2k2-k+2 to 2k2+k. Thus 2k-4 k needs to be satisfied, thus k 4.

Case 2b: Neither green nor yellow are greater k2-k, then green and yellow have to be exactly k2-k and red and blue need to be k2+k in order to have 4k2 elements.This can be solved as shown in Figure 3.

Figure 2 Figure 3


Case 3: Only red is larger than k2.All other colours are at least k2-k and at most k2.Case 3a: No colour is k2-k.Yellow occupies the yellow area plus orange if required (see Figure 4). Red occupies the bright red area plus the rest of orange plus the dark red in the order indicated by the two vertical arrows. Blue occupies the light blue and dark blue area in the order indicated by the arrow.

Case 3b: One has k2-k elements. This can then either be a neighbour of red (say green) – see Figure 5, or it can be opposite red – see Figure 6.

This completes the proof that for k >4 there is always a Solution, as long as at least 4k2 processors are active.

Figure 5 Figure 6Figure 4


For the case of more than 4 FPGAsWe also assume that we have k +1Horizontal and vertical connections plusone diagonal connection. We also attachk2 +k processors to every FPGA.


27x27=7298x90=72021 yellow (9 lower bound)


Heiko Schröder, 2003

Documents

Transcript of Heiko Schröder, 2003