t W k h VLSI c - icdevice.co.kr1 ASIC SYSTEM LAB./AJOU UNIV. t ÷ W k h VLSI c » ¿ Ã ï ÃO ´ ?...

ASIC SYSTEM LAB./AJOU UNIV.

t��÷Wkh VLSI �c��

�»¿�� ÃïÃO´�?

ßo D�

Contents

● Digital Signal Processing● Basic Architectures for DSP Algorithms● Comparison with Microprocessors● Fixed-Point DSP Chips : DSP56100 (Motorola)● Multimedia DSP Chips

◆ MediaProcessor◆ TriMedia

● Trends of Future DSPs● VLSI Architectures for Communications

◆ Fast Fourier Transform◆ Viterbi Decoder◆ Reed-Solomon Decoder◆ Equalizer

What is Digital Signal Processing?

● Analog Signal vs. Digital Signal◆ Analog Signal : Continuous Time and Continuous Amplitude◆ Discrete Time Signal : Discrete Time and Continuous Amplitude◆ Digital Signal : Discrete Time and Discrete Amplitude

● Advantages of Digital Signal Processing◆ Guaranteed Accuracy

à Specify Sampling Rate, Word Length and Algorithmà Independent on Time, Temperature, Humidity

◆ Low Sensitivity of Noise and Error Correctable◆ Digital system : Small, Cheaper, Less Power because of VLSI◆ Flexibility of System : Reprogrammable◆ Reliable & Predictable

● Disadvantages◆ Finite Sampling Rate & Word Length Problem◆ Wide Bandwidth for Data Transfer

Why Digital Signal Processor?

Low-passFilter

High-passFilter

Amplifier

ConvolverFourier

TransformAnalog

Systems

D/AConverter

DSPD/A

Converter

ManyAlgorithms

AnalogSignal

DigitalSignal

AnalogSignal

Digital Domain

Analog Domain

DSP Algorithms

● Convolution

y[n] =

◆ Basic Output Sequence of LTI Digital Systems

● Correlation

y[n] =

◆ Signal Matching

● Discrete Fourier Transform (DFT) & Fast Fourier Transform (FFT)

X[k] = x[n]exp(-j2πkn / N) X[k] = x[2n] + x[2n+1]

◆ Spectral Analysis of Signals

∑∞

k]h[k]x[n

x [n]x [n k]1 2 +=

∑n 0

∑−

2nkNW ∑

= exp(-j2πk / N)kNW

DSP Algorithms (cont.)

● Z-Transform

X(z) =

◆ System and Signal Analysis

● Finite Impulse Response (FIR) Filtering

y[n] =

◆ Linear Phase and Stable Response Filtering

● Infinite Impulse Response (IIR) Filtering

y[n] =

◆ Sharper Cutoff Filtering than FIR with the Same Number of Taps

h[k]x[n - k]k 0

N 1−

x[n] Z-n

a x[n - k] - b y[n - k]k k

∑∑

Basic Architecture for DSP Algorithms

Inst.Memory

X DataMemory

Y DataMemory

AddressGeneration

Multiplier

Adder & Acc

Inst. bus

A=X*Y+A

X Data bus

ProgramControl

Y Data bus

X Address bus

Y Address bus

Comparisons with Microprocessors

● Harvard Architecture◆ X&Y Data Memories, Instruction Memory

● Multi-Bus Structure◆ Minimize Bottleneck Problem

● Three Separate Parallel Units◆ Data Calculation Unit◆ Program Control Unit◆ Address Generation Unit<Example> MAC x1, y1, A X:(R0)+, y1 X:(R3)+, x1

● On-chip Peripherals◆ A/D and D/A Converter, PLL, DMA, Host Interface,

SIO and PIO Ports, Timer, Viterbi Accelerator, etc.

Comparisons with Microprocessors (cont.)

● Data Calculation Unit◆ MAC Unit: Multiply and Accumulate in a Single Inst. Cycle◆ Extended ALU and Accumulator

à Prevent Overflow and Support Multiprecision

◆ Barrel Shifterà Variable Length Shift within One Cycle

à Multi-precision and Scaling Operations

◆ Sine or Cosine ROM Table for DFT, FFT, DCT Algorithms● Program Control Unit

◆ Fast Interrupt Service for Real-time Applications◆ Multiple Level Hardware Stack for Nested Hardware Do Loop

● Address Generation Unit◆ Many Memory Address Registers

à Various Addressing Mode

à Linear, Modulo(filtering), Bit-reverse(FFT), Offset

Multimedia DSPs

● Superscalar, VLIW, SIMD, Multithreading Architectures◆ Multiple Funtional Units◆ Large and Multi-port Register Files

● Handle Various Data Types◆ Four PackedPacked Data Types

à Packed Bytes, Packed Words, Packed Double Words and PackedQuad Words

Packed Double Words

Packed Bytes

Packed Words

Packed Quad Words

Multimedia DSPs (cont.)

● Load/Store Units

◆ Block Load/Store Scheme

◆ Various Addressing Modes

◆ Big- or Little-endian Addressing Modes

● Packed Operations for Group Data

ex) Packed Addition : Add two Packed Words and Clips (Saturation),

the Results to the Maximum Values if there is an Overflow

a2a1 7FFFha3

b2b1 0001hb3

A2+b2A1+b1 7FFFhA3+b3

● Switching Network

◆ Deal with Mixed-Precision Data

◆ Rearrange, Expand, Pack, Merge

● Compression for MPEG-2 (Motion Estimation)

ex) SAD (Sum of Absolute Difference)

a1 a2 a3 a4

|a1-b1| |a2-b2| |a3-b3| |a4-b4|

b1 b2 b3 b4

● Multiple Operations in One Inst. Cycle

ex) Group-Multiply-and-Add

Multiply four Packed Bytes and Add four Packed Word

a b c d

e f g h

i j k l

a*e+i b*f+j c *g+k d*h+l

Special DSP Instructions

● Multiply and Accumulate instruction (MAC)◆ Major Operation of DSP Algorithms

● Normalization Instruction◆ Normalize Extended Value in ALU and Accumulator

● Various Arithmetic and Logical Shift Instructions◆ Multi-precision Data Operations

● Hardware Do Loop Instruction◆ Useful for Do Loop Type Algorithms

● Block Data Move Instruction◆ Use Instruction Memory in Single Data Memory DSPs

● Stand-by Instruction◆ Low Power in Mobile Communications

Special Multimedia DSP Instructions

● Computation◆ Partitioned add/subtract◆ Partitioned Multiply◆ Partitioned Compare◆ Group-Multiply-and-Add

● Data Format Conversions◆ Pixel expand◆ Pixel packing◆ Pixel merge

● Compression◆ Pixel distance (SAD)

< Partitioned Add/Subtract >

< Pixel Expand >

Commercial Fixed-Point DSP Chips

Company

Data/Microcode

Inst. Set

Pipeline Dep th

MemSize

Cache Size

Acc. Size

DSP56100

AT&TMoto rola

DSP1610Tex as Instrum ents

TMS320C5x ADSP2100 OAK D950CORE uPD77017

Analog Device DSP-Group SGS-Thomson NEC

16/16 16/16 16/16 16/24 16/16 16/16 16/32

87 48 124 31 NA NA 57

3 3 4 2 3 3 3

64Kx16 16Kx24, 16Kx1664Kx16 64Kx16 64Kx16 64Kx16 64Kx16

2Kx16 512x16 9Kx16 2Kx24 xx

(12K+256)x32

6Kx16x22Kx16 , 2Kx161Kx161056x16 (Dual)8Kx16 (Dual)4Kx16 (Dual)

x 15x16 x 16x24 x x x

6 5 5 5 6 6 3

2x40 2x36 (Buf. 2x36) 32 (Bu f. 32 ) 40 4x36 2x40 x

Sh ifter1,4 ,16

Hardwired36 Barrel Shifter 16 Barrel Shifter 32 Barrel Shifter 32 Barrel Shifter 40 Barrel Shifter 40 Barrel Shifter

ALU Size 32 36 32 16x40 36 40 40

DataMemRegs

Offset

Modu lo

R0-R3 (4x16) R0-R3 (4x16) AR0-AR7 (8x16) I0-I7 (8x14)AX0-1,AY0-1

(4x16)X poin terY poin terGeneral

PurposePo int

Registers(8x16)

N0-N3 (4x16) j,k (2x16) INDX M0-M7 (8x14)DX0-3,DY0-3

(4x16)

BX,MX,BY,MY(4x16)

L0-L7 (8x14)CBSR1-2,

CBER1-2 (4x16)rb ,re (2x16)M0-M3 (4x16)

GeneralRegister Bank

(8x40)

DSP56100 Features

● Performance : 66MIPS@15ns● Instruction/Data Width : 16/16● Multi-bus Structure (Program : 2, Data : 4 )● Pipeline Stage : Fetch, Decode, Execute● Hardware Stack Levels : 15 x 32 bit● Fast Interrupt Processing● Hardware Loop Structure

◆ LA (16 bit), LC (16 bit)● Accumulation Width : 2 x40 bit● 1, 4, 16 Hardware Shifter● Modulo, Bit-reverse Addressing● Division, Double-precision Multiplication Instruction

DSP 56100 Architecture

Data ALU

Data ALU (cont.)

● Two 40 bit Accumulator : 2x32 bit Accumulator Registers,2x8 bit Accumulator Extension Registers

● MAC Unit◆ 16x16 Multiplier with 32 bit Product◆ Arithmetic Operation : 40 bit Result◆ Logical Operation : 16 bit Result◆ ZB Multiplexer

● Accumulator Shifter, Output Shifter● Data Shifter / Limiter : Scaling, Limiting● Data ALU Arithmetic and Rounding

◆ Fractional, Integer, Multiprecision Arithmetic support◆ Rounding : Convergent, Two’s Complement Rounding

Program Control Unit

Program Counter

Loop Address

Loop Count

Stack Pointer

OMR SR

32 X 15

HardwareStack

Interrupts

Control

Address Data

Global Data Bus

Program Control Unit

● Program Address Generation● Instruction Decoding● Hardware Do Loop Control● Interrupt Control● Components

◆ Program Counter (PC)◆ Loop Address (LA) : Where to End of Loop◆ Loop Counter (LC) : Number of Iteration◆ Status Register (SR)◆ Operating Mode Register (OMR)◆ Stack Pointer (SP)◆ System Stack : Store PC and SR for Subroutine Call and

Long Interrupt

Fast Interrupt

Address Generation Unit

Address Generation Unit (cont.)

● Effective Address Calculation● Perform Linear, Modulo, Bit-reverse Addressing● Components

◆ Address Register File (Rn), Offset Register File (Nn),Modifier Register File (Mn), Temporary Address Registerà Where : n=0 ~ 3

◆ AGU Status Register◆ PC Relative Addressing Unit◆ Secondary Offset Adder Unit◆ Modulo Arithmetic Unit : Offset Adder, Modulo Adder,

Reverse Carry Adder

DSP56100 Instruction Set

● Number of Instructions : 87● Arithmetic Instruction : Within Data ALU

◆ Add/Sub Group : ADC, ADD, SBC, SUB, SUBL, DEC, DEC24, INC, INC24◆ Mul/Div Group : IMPY, MPY, MPYR, MPY(su,uu), DIV◆ MAC Group : MAC, MACR, DMAC, MAC(su,uu), IMAC◆ Shift Group : ASL, ASL4, ASR, ASR4, ASR16, NORM◆ Transfer Group : Tcc, TFR, TFR2, TST, TST2, SWAP◆ ABS, CLR, EXT, ZERO, etc.

● Logical Instructions◆ AND,EOR,NOT,OR,LSL,LSR,ROL,ROR◆ ANDI,ORI : Immediate Program Controller Register

● Bit Field Manipulation Instructions◆ BFCLR, BFSET, BFCHG, BFTSTL, BFTSTH

DSP56100 Instruction Set (cont.)

● Move Instructions◆ LEA : Load Effective Address◆ MOVE, MOVE(C), MOVE(I), MOVE(M), MOVE(P), MOVE(S)

● Program Control Instructions◆ Bcc, BSR, BRA, BScc : Branch Instruction◆ Jcc, JMP, JSR, JScc : Jump Instruction◆ REP, REPcc : Repeat Instruction◆ DO, DO FOREVER, ENDDO : Loop Instruction◆ BRKcc : Conditional Exit from Hardware Loop◆ DEBUG, DEBUGcc : Debug Mode Instruction◆ RTI, RTS : Return Instruction (Interrupt, Subroutine)◆ NOP, STOP, WAIT, SWI

FIR Filter Implementation Example

FIR Filter Segmemt

MOVE #XADDR, R0 MOVE #K-1, M0 MOVE X:INPUT, X:(R0) MOVE #CADDR, R3 MOVE #K-1, M3 CLR A X:(R0)+, y1 MOVE X:(R3)+, x1 REP #K MAC x1, y1, A X:(R0)+, y1 X:(R3)+, x1 RND A MOVE A, X:OUTPUT

SDSP 56116

● Ó�� 3�t�h 16 �w �Ô K×Ï DSP (◆ Motorolak� DSP56116 (� Døs �(» ÷�◆ ó¿h Ã�ï� �ð �÷Wk Î I/O ïd� ×H�S 4« ?G

● Performance : 20MIPS@40MHz● Instruction/Data Width : 16/16● Multi-bus Structure (Program : 2, Data : 4)● Pipeline Stage : Fetch, Decode, Execute● Hardware Stack Levels : 15 x 32 bit● Fast Interrupt Processing● Hardware Loop Structure : LA (16 bit), LC (16 bit)● Accumulation Width : 2 x40 bit● 1, 4, 16 Hardware Shifter● Modulo, Bit-reverse Addressing● Power Down Mode : STOP, WAIT

Multimedia DSPs

● Architecture Features◆ VLIW : Multiple Functional Units◆ SIMD : Partitioned Operations for Multiple Data◆ Multithread : Multiple Threads executed in Parallel◆ Vector Processor : Vectorized Operations

● Multimedia DSP◆ MediaProcessor (MicroUnity) : Multithreading◆ TriMedia (Philips) : VLIW◆ Mpact (Chromatic) : VLIW, SIMD, Vector Processor◆ TMS320C6x (Texas Instruments) : VLIW

TMS320C67x

● Features◆ Advanced very long instruction word (VLIW) architecture: VelociTI◆ Performance

à 1 GFLOPS - single precisionà 420 MFLOPS - double precisionà 1336MIPS@167MHz

◆ Eight Highly Independent functional unit :à Four ALUs (Floating- and Fixed-Point), Two ALUs (Fixed-Point)à Two Multipliers (Floating- and Fixed-Point)à Load-Store Architecture with 32 32-bit Registersà Instruction Packing Reduces Code Sizeà All Instructions Conditional

◆ 8/16/32-bit data support◆ Six 32-bit floating-point instructions per cycle◆ Two 32-Bit General-Purpose Timers◆ Flexible PLL Clock Generator

TMS320C67x (cont.)

texttext

P rogram M em ory32-bit add res s

256-bit data

D ata M em ory32-bit add res s

8-, 16-,32-bit data

A dditionalP eriph erals :

T im ers , S erialP orts,

E xternal M em oryInterfac e

C on trolreg isters

C on trollogic

E mu lation

Interru pts

R eg ister fi le BR eg ister fi le A

.L1 .S 1 .M 1 .D 1 .D2 .M 2 .S2 .L2

IeIeXXY�Y� erwerw

D ata P ath A D ata P ath B

P rogram fetc hInstruction dis patch

Instruction dec od e

TMS320C67x (cont.)

● Instruction Set◆ Hardware Support for IEEE 754 Single-precision Instructions◆ Hardware Support for IEEE 754 Double-Precision Instructions◆ Byte-Addressable (8-, 16- ,32-Bit Data)◆ 32-Bit Address Range◆ 8-Bit Overflow Protection◆ Saturation◆ Bit-field Extract, Set, Clear◆ Bit-counting◆ Normalization

Philips TriMedia

e�� d��

h��

w�� S

��S

��T

��

h��

w�� T

h��

w�� TY

k��

�� S

k��

�� T

k��

�� U

k��

�� V

k��

�� W

t�� JSTZ � UT �� K

SW �� W ��

k�� e��

● VLIW Architecture◆ Five RISC Operations per Clock at

100 MHz◆ Register File : 15 Reads/5 Writes◆ Crossbar Network◆ Instruction Coding

à Uncompressed RISC InstructionEncoding : 42 bit

● Performance : 2 to 4 BOPS@100MHz● Interface

◆ PCI Master/Slave Bridge (400 Mbps)◆ Digital Camera, Video Encoder,

Stereo Audio ADC/DAC◆ V.34 Modem Analog Front End or

ISDN Terminal

Philips TriMedia (cont.)

● 27 Functional Units◆ 5 Constants, 5 Integer ALUs, 2 Load/Store Units, 2 Shifters, 3

Branch Units, 2 Integer/FP Multipliers, 2 FP ALUs, 1 FP Compare,1 FP Sqrt/Div, 2 DSP ALUs, 2 DSP Multipliers

◆ Number of decoders : 5à 27 Fu, Classify 5 Groups to Reduce Decoder Size

● VLIW Instruction Size◆ Uncompressed : 42 bit x 5 = 210 bit◆ Compressed : 32 bit (Huffman Coding)

● DMA Mastering-video & Audio-I/O Units (Data Prefetch) Configuration◆ Video/Audio DMA In, Out◆ VLD (Variable Length Decoder) Coprocessor◆ Image Coprocessor (MPEG-1, MPEG-2)

ó¿h Multimedia; Ã� �Ô K×Ï DSP (MDSP)

Fixed-point DSP(Multimedia)

Multimedia DSP(Portability)

MultimediaMultimedia PortabilityPortability

MDSPMDSP(Portable(Portable

Multimedia)Multimedia)

● ï3 DSP (� ÷ÛÏ◆ ÿ¯·Ós Ãh DSP (

à £×� ïd ß�w; kh3 �¿C

à Die +ï¿ +� ÃäKg¿ +£

à ó¿h �hÀ¬� À(� £

◆ �Ô K×Ï DSP (à �ÿ ¯o� Wk� × �£

à /3ï çc¿ �ÔÀ3£

à ÿ¯·Ós �÷Wk� �gçÀ3£

● ÛÛ� MDSP� x�

◆◆ ¿¿¿¿¿¿¿¿, , ¿Ã¿Ãää��¿Ã¿Ãää�� óó¿¿óó¿¿ tt��tt��hhÀ¬À¬��ÛÛ��hhÀ¬À¬��ÛÛ�� ÿÿ¯̄··ÓÓssÿÿ¯̄··ÓÓss ��÷÷WWkk��÷÷WWkk��

ÀÀ((��ÀÀ((�� ÛÛÛÛÛÛÛÛ◆ ÿ¯·Ós /3� Wkïß + �ÔK×Ï DSP ïß

MDSP Features

● ¿¿, ¿Ãä �Ô K×Ï DSP �s● SIMD + Vector Processing + DSP ïß Ðh● �ð, �d Î À� �÷Wk; Ã� ¿�À /3� ÷3 : 8, 16, 32, 40 �w● cÃ, ãw�+ + Packing ãw�+

◆ gçÀ7 /3� ��, `ï ¯o� Ô�● Ç�cï Ë; ï· +/ (12 × 32 �w + 4 × 40 �w)● 9§� Ë3Ã;7 : F ➞ D ➞ R ➞ BS ➞ Ex1 ➞ Ex2 ➞ Ex3 ➞ Ex4 ➞ WB● 8Û� Nested ��§s FOR §Ã ��● Memory-to-Memory ¯o ��● 32 �w �3+� ��● Barrel +Ãï, Prescaler● £P� /3� s�Ç° g�● Ãh Døs �� : Viterbi, MODEM ,´K× °�● Power ÇÏ� Ã� Døs : WAIT, IDLE

MDSP Architecture

● 3Û� ß�w : �ë ×H◆ DPU (Data Processing Unit)

à DPU0ÿ DPU1û� +ð

◆ PCU (Program Control Unit)◆ AGU (Address Generation

Unit)à Memory-to-memory ¯o

● Block Move◆ �×S Ç�cï Ë;,/3�S �gk� ¿d

◆ ¯oï� �� 4Û�ã;¯�; load ' 5%� �5HVXOW %XV�

'38�'38�

6ZLWFKLQJ 1HWZRUN�Û� 903<

ÀÀ �Û� 9$/8� 9DGGHU3DFNLQJ ãw�+

'5%� �'DWD 5HDG %XV�

$*8$*8

'5%� �'DWD 5HDG %XV�

'$%��'

$GGUHVV%XV�

'$%��'

$GGUHVV%XV�

'DWD'DWD

0HPRU\0HPRU\�� ELW�

3&83&8

3$%��3URJUDP

$GGUHVV %XV�3URJUDP3URJUDP

0HPRU\0HPRU\�� ELW�

'38�'38��Û� 903<

�Û� 9$GGHU

3'%��3URJUDP'DWD %XV�

�� [ ��

� [ ��5HJLVWHU5HJLVWHU

)LOH)LOH

5%� �5HVXOW %XV�

What Should We Do for the Next Century?

● Lots of Circuits Level works◆ High Speed Clock◆ Low Power, Low Cost

● Parallel Programmable DSP Architectures◆ Employ VLIW / RISC Superscalar (RISC-SS) Architecture

à High Speed Coupled with Parallel Execution

à Good Compiler Efficiency

à Poor Code Density (VLIW) vs. Good code Density (RISC-SS)

à High Power (VLIW) vs. Low Power (RISC-SS)

à Difficult (VLIW) vs. Easy (RISC-SS) to Program by Hand

◆ High Level Languages suitable for Parallel Architectures● Architecture Driven Algorithms for Multimedia Functions● Hardware / Software Co-design Approach should be used for

Optimized Systems

Multimedia Terminal Should Have

● 2 MPEG-2 Codecs : 8 GOPS● 2 CG Generators : 4 GOPS● Stereo Echo Canceler : 4 GOPS● Background Removal : 4 GOPS

20 GOPS

● Future DSP Chips should be◆ Low Price◆ Programmable◆ 20 GOPS DSP Chip in the Year 2000

MPU History

SSRRRRRR

SSRRRR

SSPPRR

RRPPSS

SRQVPW��

e��

UT ��

u��

XV ��

��O��

ttkkueue

fufurrw��

d�d��OO��

xxnnkyQkyQoo��

SS[[ZRZR TTRRRRRR

SSRikRikrruubbTTRRRRRRookrkruu

Programmable DSP Chips

SRRSRR

oorrggiiOOT fT f��P��PukukooffMMNN xnxnkkyyNN ttkukueOeOuuuu

S[ZRS[ZR TRRRTRRR

oqroqruu

SRRRSRRR

S[ZVS[ZV S[ZZS[ZZ S[[TS[[T S[[XS[[X

fufurr �� xx��

fufurr �� c�c�� HH u�u��

ukukooff

fufurr �� o�o��

ee��

● Fast Fourier Transform (FFT) Algorithm

◆ Fast version of Discrete Fourier Transform (DFT)

◆ Reduce Computation

◆ FFT Method : Radix-2, Radix-4

● Example : Orthogonal Frequency Division Multiplexing (OFDM)

Fast Fourier Transform

e��

u��

o��h hv

�pOS

e��

i��

k� ��

t��

f��

e��

i��

k� ��

e��

k h hvrQu

e��

�pOS

e��

� � � ��

u��

o��

u�� f ��

k� ��

u�� f ��

q��

● Radix-2 Butterfly Algorithm

OUT0 = IN0 + IN1

OUT1 = (IN0 - IN1) WNk

● Radix-4 Butterfly Algorithm

OUT0 = [(IN0 + IN2) + (IN1 + IN3)]

OUT1 = [(IN0 - IN2) - j(IN1 - IN3)] WNk

OUT2 = [(IN0 + IN2) - (IN1 + IN3)] WN2k

OUT3 = [(IN0 - IN2) + j(IN1 - IN3)] WN3k

where WNk = e(-2Nnk/N)

FFT Algorithm

ypT�

ypU�

O SO �

● Butterfly Architecture

● Number of butterflies(N-point) : N/2(log2N-1)● Number of complex adders : N(log2N-1)● Number of complex multipliers : N/2(log2N-1)

Radix-2 FFT Architecture

La tch GW

PR O M

La tch E

k� JrO sK

t� JrO sK

M U L 1

M U L 2

q��R

q��S

Radix-4 FFT Architecture

● Butterfly Architecture

● Number of butterflies(N-point) : N/2(log4N-1)● Number of complex adders : N(log4N-1)● Number of complex multipliers : 3N/4(log4N-1)

euceuc

oo��

euceuc

oo��

tt�� e��e��

kk��

ee��

kk��

ee��

kk��RR

eencnc

kk��SS

kk��TT

kk��UU

qqwwvSvS

tt�� r�r��

kk�� r�r��

tt�� r�r��

Comparison between Radix-4 and Radix-2

● Algorithm comparisons

* data : Complex number

- Radix-4 reduces the number of additions and multiplicationscompared with radix-2

● Architecture comparisons- A Butterfly architecture of radix-4 is more complex than that of radix-2

- However, as N increases, the computation time of radix-4 reducesabout two times than that of radix-2

p��

��

p��

c��

p��

o��

t��O T

t��O V

��TpO S

��VpO S

pJ��TpO SK

pJ��VpO SK

pQTJ��TpO SK

UpQVJ��VpO SK

FFT Processor

● Cached -1K Point FFT Processorà ÷/) Bevan M. Baas,"A Low-Power,

High Performance, 1024-Point FFTProcessor",IEEE Journal of solid-statecircuits. vol 34, No 3, March 1999

● Features :◆ FFT Algorithm : radix-2 algorithm

◆ 0.7 CMOS process

◆ �PL� : 173MHz at 3.3V

8 b a n k x 1 2 8 x 3 6 -b itS R A M

I/O In te rfa c e

C lo ck1 6 x4 0 -b it

C a c h e

C h ipC o n tro lle r

2 0 -b it M u ltip l ie r 2 0 -b it M u ltip l ie r

2 4 -b it-S u b 2 4 -b it-A d d

1 6 x4 0 -b itC a c h e

2 4 -b it-S u b

2 4 -b it-A d d

2 4 -b it-S u b

2 4 -b it-A d d

2 5 6 x4 0 -b it R O M2 5 6 x4 0 -b it R O M

_ e��

�� o��

MainMemory

Cache0A

Cache0B

Cache1A

Cache1B

FFT Processor (con’t)

● The FFT Processor for ADSLTranceivers

à ÷/)Chin-Liang Wang and Ching-Hsien Chang,"A Novel DHT-basedFFT/IFFT Processor for ADSLTranceivers",in Proc. IEEEinternational Symposium on Circuitsand Systems, May,1999

● Features:◆ Used DMT Based ADSL

Transceivers

◆ 512-Point FFT Processor

◆ 0.6 CMOS technology

◆ Chip Area : 4838 x4032

◆ �PL� : over 40MHz

mµ2mµX

tc fq f k

R S R S R S R S

R SS R

k �� f ��

e��

tcQyc ��

tc ��

yc ��

tcoOSNT

fjv��

● Viterbi Decoding Procedure◆ Branch Metric Calculation (BMC)

à Calculate Hamming Distance or Euclidean Distance

◆ Path Metric Calculation (PMC)à Accumulate BM of Previous Survival Path (has smaller PM of two path)

◆ Add - Compare - Select (ACS)à Add : PM + BMà Compare : Compare Two Previous PMà Select : Select Smaller PM

◆ Trace-Back (TB)à We define the Length of TB Depthà Usually, TB Depth = K x 5 or 6à After fill TB depth, Trace Back the TB Memory and Decode the

Received Code

VITERBI Decoding

● Punctured Code : One of Modified Coding Scheme◆ Increase Code Rate◆ Decrease Coding Gain (c.f. Coding Gain is

10log(Pwithout FEC/Pwith FEC))◆ Example : r = 3/4 Punctured Convolutional Code

t�� SQT

e��

g��

t�� UQV

r��z { |

fJSK fJTK fJUK fJVK fJWK fJXKz

eRJSK e

RJTK e

RJUK e

RJVK e

RJWK e

eSJSK e

SJTK e

SJUK e

SJVK e

SJWK e

|eRJSK

u�� f�� Jr��K

f�� o��

Punctured Code

● Trellis Diagram for PMC (previous BM� ()◆ Example : K = 3, r = 1/2 Convolutional Code◆ Branch Metric is Hamming Distance (Hard decision, # of different bits) or

Euclidean Distance (Soft decision, difference of decimal code) betweenReceived Code and Branch Word

Trellis Diagram

\ r�� o��

\ d �� o��

u��

\ v��O d ��

e��

��

k��

e��

t��

��

S S R S S S R R

● Viterbi Decoder Architecture◆ Depunctured Logic : If Received Code is a Punctured Code◆ BMC : Hard/Soft Decision◆ ACS : After ACS, Storage PM Memory◆ TB : Trace-Back

P ath M etricM em ory

D epunc tu redLog ic

B ranchM etric

C a lcu la te

T race B ackM em ory

A ddC om pare

S e lect

f�� d��

t��

e��

VLSI Architectures for VITERBI Algorithm

If Hard decision, x is 1-bitIf Soft decision, x is 3-bit

If upper path is smaller, TB stores 0If lower path is smaller, TB stores 1

● Serial ACS Viterbi Decoder Architecture

◆ Minimum Gate

◆ Maximum DelayPM memory

��O��

o��

c��

roRR roRS

S �� R

roSR roSS

p��

e��

g��

Compare&

Select

Compare&

Select

Reference : US patent 4,536,878

Time 0

Time 1

Time 0

Time 1

u��

● Parallel ACS Viterbi Decoder Architecture

◆ Minimum Delay

◆ Maximum Gate

◆ Routing Complexity High

◆ No use PM Memory

ToTrace-Back

Mem ory

FromBMC

S �� R

Reference : US patent 4,614,933

u��

��O�O S

m ��

RS Encoder

● LFSR (Linear Feedback Shift Register)� +ð● (n, k) RS Encoder

◆ n : # of code symbols◆ k : # of message symbols◆ gi : �ð£,�� ×

RS Decoding

● Decoding Procedure◆ Syndrome (error pattern) �o

à S1, S2, yyy, S2t

◆ Error Locator Polynomial (�« Ã� £,�)◆ Error Location �o

à �« Ã� £,�� »� +'

◆ Error Ñ SÔà ��+, �« Ã� £,�� × Î »� 3h

◆ Error ÔÔà ×�Û �� xor �« Ñ => �«ÔÔ

RS Decoder

u��

½Ù ñE

ÑZÊ½Ù ñE

½Ù ÿ

hkhq Jf�� d��K

k��

f��

q��

g�� k��

R eg .

- Syndrome �o K�ÿ �« Ã� �o K�� kh- �ë» �ë ¯o g� ¿d

XOR, Finite Field Multiplier,Shift Register � +ð

�« Ã� £,� �o

● Euclid Algorithm◆ Registers, Finite Field Multiplier, etc.

z R EG

- Register : 4 x t (tS �«×Ôdä)

- Finite Field Multiplier : 2 x t

- MUXs, XORs

- Large Gate Count, High Speed

< Hardware Complexity >

“Reed-Solomon Euclid Algorithm Decoder Having aProcess Configurable Euclid Stack,”U. S. Patent 5,170,399, Dec. 8, 1992.

Channel Model

● ISI(Intersymbol Interference)

◆ Band-limited Channel Distortion (Wired Channel)

◆ Multipath Fading (Wireless Channel)

● Equalizer - �Ï� �� ¬» Ã� � ³��S Discrete Time Filter

y(n) : Equalizer Output

w(n) : Tap Coefficient

R v Tv UvOvOTvOUv

PSF Channel Equalizer

PSF : Pulse Shaping Filter

∑−

)()()(M

k knxnwny

Criteria of Equalizer

● Frequency Bandwidth◆ Baseband, Passband

● Sampling Time ◆ Symbol - One Sample/symbol◆ Fractional Symbol - Two or More Samples/symbol

● Coefficient Characteristics◆ Fixed, Adaptive

● Architecture◆ Transversal, DFE, Lattice

Transversal Structure

● Simplest Type

● Small Gate Count, Low Speed

● Low Power Consumption

T T T TInput

OutputYn

C0 C1 C2 Cn-2 Cn-1

∑−

)()()(M

k knxnCny

T : Sample Time

Ck : Tap Coefficient

Register : N

Multiplier : N

Adder : N-1

N : Number of Taps

Decision-Feedback Structure

● Good Performance in Presence of Severe ISI● Moderate Gate Count, Power Consumption● Low Speed

j ICXCI −

−=∑∑ +=

P PP P P P PP P P PP

DecisionDevice

vv vv vv

DecisionData

Training Data

Input Data

P PP P P P PP P P PP

FeedforwordFilter

FeedbackFilter

C-k1 C0C-k1+1

Register : N

Multiplier : N

Adder : N-1

N : Number of Taps

Lattice Structure

● High Power Consumption

● Large Gate Count, High Speed

fm(n) = fm-1 (n) + k*mb m-1(n-1)

bm(n) = bm-1 (n-1) + kmf m-1(n)

y(n)ÿ Transversal» Øÿ ×�

K2 KM-1

InputXn

Stage1 Stage2 StageM-1

f0(n) f1(n) f2(n) fM-1(n)

bM-1(n)b2(n)b1(n)b0(n)

Register : N

Multiplier : 2N

Adder : 2N + 1

N : Number of Taps

Comparisons of Tap Update Algorithms

ZF LMS RLS

HardwareComplexity Low Low High

Speed Medium Low High

PowerConsumption

Low Low High

Error CorrectionCapability Low Medium High

Tap Update ��kW ��

H/W implementation vs. S/W implementation

H/W implementation S/W implementation

Design Time High Low

Design Cost High Low

Flexibility Low High

Upgradabilty Low High

H/W implementation vs. S/W implementation (con’t)

● Many algorithms can be implemented with high performance DSP

● Task allocation is very important issue in H/W and S/W codesign

● As the performance of DSP chip increases, many hardwired logics can

be integrated into programmable chips

◆ Example: Qualcomm’s MSM 3000, etc.

t W k h VLSI c - icdevice.co.kr1 ASIC SYSTEM LAB./AJOU UNIV. t ÷ W k h VLSI c » ¿ Ã ï ÃO ´ ?...

Documents

Transcript of t W k h VLSI c - icdevice.co.kr1 ASIC SYSTEM LAB./AJOU UNIV. t ÷ W k h VLSI c » ¿ Ã ï ÃO ´ ?...

ﻪﺘﻓﺮﺸﯿﭘ VLSI يﺎﻫراﺪﻣunipress.sbu.ac.ir/sites/default/files/vlsi-1_0.pdf · vlsi و vlsi سورد يﺎـﻫﺶﺳﺮﭘ ﺎﻫنآ ﺐﻟﺎﻏ ﻪﮐ ،ﻢﻬﻣ

LAB VMWARE

64-210 Eingebettete Systeme · PDF fileEntwurfsmethodik - Abstraktion im VLSI-Entwurf 64-210 ES – VLSI-Einf¨uhrung ... ⇒sehr starke Spezialisierung, z.B. Routing bei Standardzell

Mixed Signal CMOS ASIC mit integrierten frei ...

cmos vlsi weste

Fast Resource Sharing in VLSI Routing - ULB Bonnhss.ulb.uni-bonn.de/2010/1989/1989.pdf · Fast Resource Sharing in VLSI Routing Dissertation zur Erlangung des Doktorgrades der Mathematisch-Naturwissenschaftlichen

Darmstadt Vlsi Design Course

ASIC Design Implementation of Memory Efficient Infinite

Lab katalog

CASSY Lab Handbuch (524 201) - Universität zu Köln · LabVIEW ist eine eingetragene Marke der Firma National Instruments. CASSY Lab 9 CASSY Lab Einführung CASSY Lab unterstützt

EN DC LAB POWER SUPPLY WITH DUAL LCD DISPLAY NL DC LAB ... · en dc lab power supply with dual lcd display nl dc lab voeding met dubbel lcd scherm fr alimentation dc lab avec double

Schako Lab

VLSI ياّراذه يحازطele.aut.ac.ir/~shalchian/_files/lecture4-CMOS Inverter.pdf · CMOS رگنوراو : مراهچ لصف - VLSI یاهرادم يحارط 1390 - نايچلاش

Elektronische Komponenten - lem.com · 2 ASIC I N ASIC-Kompensations-Stromwandler von 6 bis 25-A-Nennstrom Rüdiger Bürkel, Michel Friot, Hans Dieter Huber und François Mortier

Baumann lab

Examen Lab

VLSI 기술 회로 2019 심포지엄 기술적 하이라이트 · 2019-08-28 · vlsi 기술 & 회로 2019 심포지엄 기술적 하이라이트 vlsi 기술 & 회로 2019 심포지엄은

VLSI 기술 회로 2020 심포지아 기술 하이라이트 · 2020-05-28 · vlsi 기술 & 회로 2020 심포지아 기술 하이라이트 vlsi 기술 & 회로 2020 심포지아는

· mgw.lab-asia.coo . 2012 LAB INDONESIA the Only Platform for Future Lab Technology in Indonesia After the FIRST Lab Indonesia in 2010, Lab Indonesia expresses great potential and

Advanced VLSI Design (Module 24151) - uni-rostock.de