PCI Express DMA Engine für Active Buffer Projekt im CBM Experiment Wenxue Gao, Andreas Kugel,...
-
Upload
johann-schmieding -
Category
Documents
-
view
109 -
download
0
Transcript of PCI Express DMA Engine für Active Buffer Projekt im CBM Experiment Wenxue Gao, Andreas Kugel,...
PCI Express DMA Engine für Active Buffer Projekt im CBM
Experiment
Wenxue Gao, Andreas Kugel, Reinhard Männer, Holger Singpiel, Andreas Wurz
Uni. MannheimDPG Tagung, Gießen
14 März 2007
Inhalt
• Einleitung
• Blockdiagramm
• Realisierung
• Leistung
2 von 15
Einleitung – CBM Experiment
CBM TSR, Jan. 2006
Einleitung – PCI Express
• 2,5 Gbps pro Link
• Point-to-Point
• TLP (Transaction Layer Packet)– Post: MWr (Memory Write Request), …– Non-post: MRd (Memory Read Request), …– Completion: CplD, Cpl, …– Message: Msg
4 von 15
Host End-Point
PCI Express – Post TLP (MWr, …)
Rx
Tx
Trn.
Host End-Point
MWr1
PCI Express – Post TLP (MWr, …)
Rx
Tx
Trn.
Host End-Point
MWr1
PCI Express – Post TLP (MWr, …)
Rx
Tx
Trn.
Host End-Point
MWr1
PCI Express – Post TLP (MWr, …)
Rx
Tx
Trn.
Host End-Point
MWr2
PCI Express – Post TLP (MWr, …)
Rx
Tx
Trn.
MWr1
Host End-Point
MWr2
PCI Express – Post TLP (MWr, …)
Rx
Tx
Trn.
MWr1
Host End-Point
MWr2
PCI Express – Post TLP (MWr, …)
Rx
Tx
Trn.
MWr1
Host End-Point
MWr3
PCI Express – Post TLP (MWr, …)
Rx
Tx
Trn.
MWr2
MWr1
Host End-Point
MWr3
PCI Express – Post TLP (MWr, …)
Rx
Tx
Trn.
MWr2
MWr1
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
Tx
Trn.
End-Point
MRd1
PCI Express – Non-post TLP (MRd, …)
Host
Rx
Tx
Trn.
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
Tx
MRd1
Trn.
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
CplD1
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
CplD1
CplD2
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
CplD1
CplD2
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
CplD1
CplD2
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
CplD1 CplD2
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
CplD1
CplD2
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
CplD1
CplD2
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
CplD1
CplD2
MRd1
MRd2
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
CplD1
CplD2
MRd1
MRd2
Tag[7:0]
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
CplD1
CplD2
MRd1
MRd2
Tag[7:0]
Einleitung – SG DMA
• SG(Scatter/Gather) – Multiple-descriptor chain
• Voll-Duplex– Downstream: Host Endpoint– Upstream: Endpoint Host
• „Done“ Zustand– Status Register– Interrupt
Downstream
Upstream
Host Endpoint
Blockdiagramm
Rx
Tx
Tx Arbitrator
Memory
BRAM + FIFO + Registers
UpstreamDMA
Channel
DownstreamDMA
ChannelPIO
Channel
Rx Resolution
PCIeTransact .LayerInterface
Ch
ann
el B
uffe
r
TagRAM
Channel Buffer
• TLP Channel FIFO– Breite = 128– Tiefe = 15
• TLP ohne Payload– Alles im Word
• TLP mit Payload– Lokale Adresse– Zusätzliche Informationen
LAdr Hdr2 Hdr1 Hdr0
Rx
Tx
xxxx Hdr2 Hdr1 Hdr0
LAdr Hdr2 Hdr1 Hdr0
95127 63 31 0
9 von 15
Realisierung – DMA teilen
• 4 KB Grenze verboten
• Address/Length Combination
Realisierung – „Done“ bestätigen
• Wann ist DMA beendet?– „Done“ Zustand nötig
• CplD‘s für unterschiedliche MRd‘s kommen nicht folgend
– Mögliche Lösungen• Tag RAM lesen• CplD zählen• Channel Buffer leer• Letzten Tag triggern (x)
• Bitmap füllen– 128-bit Register für 7-bit Tags
11 von 15
Leistungsparameter• Zielbaustein
– Virtex4 XC4VFX60-11ff672• FFs
– 9 834 out of 50 560 ( 19 % )• LUT4s
– 11 464 out of 50 560 ( 22 % )• RAMb16
– 58 out of 232 ( 25 % )• Slices
– 9 426 out of 25 280 ( 37 % )• Frequenz ( trn_clk )
– 250 MHz• Verzögerung (Transaction layer)
– PIO: 52 ns (MRd CplD )– DMA: 80 ns (DMA „Start“ Tx TLP)
• Theoretische Bandbreite– 2Gbps x4 = 8Gbps, bi-directional
12 von 15
4-Lane Tests
0
1000
2000
3000
4000
5000
6000
7000
4096 8192 16384 32768 65536 131072 262144 524288
Packet Length (Bytes)
Bandwidth (Mbps)
PI O Wri teDMA Wri tePI O ReadDMA Read
Offene Fragen
• Kleinerer Channel Buffer– Meistens reichen 64-bit, statt 128-bit
• Bessere Behandlung von Fehlern– Teilweise unvollständig– Überschreiben von CplD zu vermeiden– Time-out
• tag Recycling
• Höhere Bandbreite für downstream DMA
Zusammenfassung
• PCI Express Vorteile– Parallelität– Skalierbarkeit
• Virtual channels– 2 DMA Channels– 1 PIO Channel
• Xilinx Lösung– 62,5 MHz für x1– 250 MHz für x4
15 von 15
x4-ABB• Design Summary• --------------• Logic Utilization:• Number of Slice Flip Flops: 9,834 out of 50,560 19%• Number of 4 input LUTs: 11,464 out of 50,560 22%
• Logic Distribution:• Number of occupied Slices: 9,426 out of 25,280 37%• Total Number 4 input LUTs: 12,993 out of 50,560 25%• Number used as logic: 11,464• Number used as a route-thru: 643• Number used for Dual Port RAMs: 202• Number used as Shift registers: 684
• Number of bonded IPADs: 18 out of 62 29%• Number of bonded OPADs: 16 out of 24 66%• Number of bonded IOBs: 1 out of 352 1%• Number of BUFG/BUFGCTRLs: 5 out of 32 15%• Number used as BUFGs: 4• Number used as BUFGCTRLs: 1• Number of FIFO16/RAMB16s: 58 out of 232 25%• Number used as FIFO16s: 0• Number used as RAMB16s: 58• Number of DSP48s: 2 out of 128 1%• Number of DCM_ADVs: 1 out of 12 8%• Number of GT11s: 8 out of 16 50%• Number of GT11CLKs: 1 out of 8 12%
X4 Test
DMA Prozess
• Buffer-descriptor– SA (Source Address)– DA (Destination Address)– NXA (Next Descriptor Address)– Length (Length in bytes)– Control (Control register)
• Start/Stop Befehl– Upstream: MWr + MRd (dex)– Downstream: MRd
• Busy/Done Zustände erkennen– Status Register– Interrupt (Msg)
Rx
TxTx Arbitrator
MWr_usp MWr_usp
MRd_dsdMRd_dsd
MRd_usd MRd_usd
MRd_dspMRd_dsp
Cpl/DCpl/D MWrMWr
Memory
BRAM + Registers + FIFO
Memory
BRAM + Registers + FIFO TagRAM
MR
d:
Cpl
D
Cpl
MR
d:
Cpl
D
Cpl
CplDCplD
MRdMRd
Rd
Wr
Wr
Wr
Rx Resolution
US
:
MW
r
MR
d
Msg
US
:
MW
r
MR
d
Msg
DS
:
MR
d
Msg
DS
:
MR
d
Msg
DMA Upstream
EngineRegisters
DMADownstream
EngineRegisters
Blockdiagram
m
Verifizieren
• PIO + DMA ($random)– Transaction length– Address-pair– Chain length (DMA)– Descriptor Address (DMA)– Flow control: *_rdy_n
• Output checking– tsof/teof– Data– Deskriptor abteilen
Downstream(Write)
Upstream(Read)
Root Endpoint
1
2
Memory Space
• BRAM– 16KB
• FIFO– 32 x 32– Loop-back
• Registers– Write / Read– Control / Status
• Eventuelle Erweiterung– DDR (BRAM ähnlich)– GbE (FIFO ähnlich)
BRAM
Registers
Loop-Back
Wr
Rd
OFIFO
Wr Rd
Wr Rd
IFIFO