Vorlesung / Course IN2075: Mikroprozessoren / Microprocessors · – N (usually 2) instructions are...

Lehrstuhl für Rechnertechnik und Rechnerorganisation Fakultät für Informatik Technische Universität München

Vorlesung / Course IN2075:Mikroprozessoren / Microprocessors

Superscalarity

8 Jan 2018

Carsten Trinitis

Lehrstuhl für Rechnertechnik und Rechnerorganisation (LRR)Technische Universität München

Superscalarity

Parallel Execution & ILPat

Instruction Level Parallelism

● Superscalarity:

– identification of independent instructions by hardware.

Superscalarity

● Pipeline-Stages:– Parallel fetch– Decode– Fetch Data

– Issue (assign instructions to functional units)– In-order issue: e.g. Pentium I– Out-of-Order issue (since Pentium II)

– Process Data– Write Data

Superscalar Execution

Pure pipelining: Every unit exists once in a processor In every cycle the first stage is filled One instruction per clock cycle started

TimeFI DI FD PD WD

FI DI FD PD WD

Superscalar Execution

Replicate functional units Create several pipelines in the microprocessors Start several instructions concurrently Potential speed-up: number of replicas

TimeFI DI FD PD WD

FI DI FD PD WD

Implementation Issues

Normally not implemented as N pipelines Flexible assignment of free units at each stage Possible:

Different number of units at some stages Specialised Units (load/store, arithmetic)

Important property: „Issue Width“ How many instructions can at most be issued at

one clock cycle

Name: SuperSCALAR in contrast to vectors

Dealing with Conflicts

Similar problems as with standard pipelining Need to observe dependencies Need to observe control flow changes

Additional problem Resource conflicts when accessing functional units Assignments and Reservation of Units Need for complex hardware

Possible optimisation Out of order issue / execution Instructions with open dependencies wait

From a Software Point of View

Again similar as for pipelining Arrange code to avoid dependencies Try to group instructions correctly Reduce the task of the dynamic scheduler Task of the compiler backend

Utilise software tools Play with compiler options For some architectures, additional tools exist

E.g. Parallel Studio XE from Intel Useful for performance debugging of kernels

The Intel Pentium II

Intel Sandy Bridge

AMD Bulldozer

Superscalar Pipelines (Overview)

● Conflicts:– Structural conflicts

– E.g. 2 integer units, one FPU, and one load / store unit

– Control flow conflicts

– Data conflicts– Read after write (RAW)– Write after read (WAR, commonly due to reordering)– Write after write (WAW)– Techniques to handle data conflicts:

– Scoreboarding, 1964, Control Data Corporation (CDC)– Tomasulo Algorithm, Register Renaming, 1967, IBM 360/91

Simplest Case: In-Order Execution

● Examples:– Intel Pentium I, Atom, simpler ARM architectures

● Principle:– N (usually 2) instructions are fetched at the same time.– If a RAW conflict is detected, the execution of one of the

instructions is delayed (NOP inserted).– WAW conflicts usually only occur in conjunction with a RAW:

– (1) MUL r1 ← r2,r3– (2) ADD r4 ← r1,r5 // RAW with line (1)// r1 no longer needed and could be re used:

– (3) SUB r1 ← r10,1 // WAW with line (1)

– WAR conflicts usually do not occur

Out-of-Order Execution: Motivation

(1) MUL r1, r2, r3

(2) SUB r4, r1, r5 // RAW data conflict with (1)

(3) MUL r6, r2, r3 // resource conflict with (1)

(4) SUB r1, r10, 1 // WAW data conflict with (1)// WAR data conflict with (2)

● Problem:– Line (2) has to wait for the result of line (1) (e.g. 6 cycles)– Line (3) has to wait for only 1 cycle!– If WAR and WAW conflicts could be resolved, line (4) could also start

after 1 cycle– Without reordering of instructions, lines (3) and (4) have to wait

unnecessarily.

Out of Order Execution: Overview

● Two implementations:

– Scoreboarding– Simple, centralised approach– Register file is extended by status bits

– Register-renaming & Tomasulo Algorithm– More powerful, decentralised approach– Each functional unit (FU) is extended by so-called

reservation stations

● Register file is extended by status bits:– Valid: cleared if register will be written to– Read-tag: stores number of FU waiting for a result which will be

written to this register.– Pref-tag: tells order of read / write, important to discriminate

between RAW and WAR.

Scoreboarding

Reg A no longer valid!

Scoreboarding

false SUB

A and C required for SUB

SUB unitwaits untilregister A is written tofalse

Scoreboarding

Write backto C mustwait forread by SUB

Scoreboarding

RCannot issue (4)!

Scoreboarding

● Handles RAW and WAR dependencies.● Execution / writeback of respective instructions is delayed.● For each register, only one dependency can be stored.● Further dependencies block execution.● WAW dependencies block execution.

Scoreboarding - Properties

● Get rid of WAR and WAW by register renaming.

● RAW still requires delay, though.

● RAW does not wait until writeback, but results are directly forwarded from FU to FU.

● Execution can also be blocked due to resource conflicts.

Tomasulo Algorithm - Goals

Reservation Stations

● Superscalar processors can issue more than one instruction per cycle.

● Two varieties:– In order: one conflict blocks overall stream.– Out-of-order: a conflict only blocks one instruction.

● Tomasulo Algorithm used in out-of-order superscalar processors.– Detects and avoids WAW and WAR conflicts– Detects RAW and resource conflicts

● Limitation in practice:– # of independent instructions found: ≈ 4-6

Summary

Vorlesung / Course IN2075: Mikroprozessoren / Microprocessors · – N (usually 2) instructions are...

Documents

Transcript of Vorlesung / Course IN2075: Mikroprozessoren / Microprocessors · – N (usually 2) instructions are...

Einführung in Aufbau und Funktionsweise von … · Zusammenfassung Mikroprozessoren sind inzwischen geradezu allgegenwärtig und aus der heutigen Welt nicht mehr wegzudenken. Auf

1 6Hochperformante Mikroprozessoren - 6.1 Von skalaren RISC- zu Superskalarprozessoren In den 80er Jahren gab es zwei Linien von Mikroprozessoren: Die.

personality profile - forum-schiffsfuehrung.com · 13.01.2012 · ces, very limited maneuver space, usually ... ECDIS, communication, wheel and engine systems, etc.). to implement

HERSTELLUNG VON KAUTSCHUKMISCHUNGEN · 2018-05-03 · poorly dispersed additives or divergent performances when blen-ding and processing, often fail to be detected by downstream mi-xing

Anfängerkurs zum Erlernen der Assemblersprache von · PDF fileAnfängerkurs zum Erlernen der Assemblersprache von ATMEL-AVR-Mikroprozessoren Gerhard Schmidt Dezember 2010

Vorlesung Rechnerarchitektur · RISC = Reduced Instruction Set Computer Mikroprozessoren waren früher alle RISC Prozessoren. schnellere Ausführung von Befehlen (keine Interpretation

Astronomie bei langen Wellenlängen: Geschichte und Zukunft ......2 CO (formaldehyde) is detected at ~ 6 cm Finally in 1970: the very important molecule CO (carbon monoxide) is detected

Sicherheit: Fragen und Lösungsansätze · Sicherheit: Fragen und Lösungsansätze Cryptography • is usually closely intertwined with control and monitoring • binds a successful

Fachprüfungsordnung für den Bachelorstudiengang ... · § 3 Umfang des Studiums, ... Signalverarbeitung mit Mikroprozessoren I ... außer dem Kandidaten zumindest der erste oder

Mikroprozessoren - mit.eit.h-da.demit.eit.h-da.de/Vorlesung/0850_TWI.pdf · Two Wire Interface (TWI=I 2C) TWI Protokoll. TWI Startbedingung. TWI Adressierung. TWI Datentransfer. TWI

Separation of n-hexane - ethyl acetate mixtures by …ternary system, we have detected the existence of the experimental heterogeneous ternary azeotrope in the mixture n-hexane –

Entwicklungstrends bei Mikroprozessoren Großes Seminar Oliver Becker.

normalerweise meistens gewöhnlich im Allgemeinen usually in general most of the time normally.

Natürlich weiß! - ursa.de · synthetischen Herstellung des ersten aus einer Reihe völlig neuer Antibiotika bis zur Entwicklung eines Verfahrens zur Herstellung kleinerer Mikroprozessoren.

Vorlesung Rechnerarchitektur - mobile.ifi.lmu.de · RISC= ReducedInstructionSet Computer Mikroprozessoren waren früher alle RISC Prozessoren. schnellere Ausführung von Befehlen

Mikroprozessoren für Beginner - gsc-elektronic.net · Vorbemerkungen Zum Konzept Der folgende Text geht von der Beobachtung aus, daß Lernen schneller und besser geht, wenn man vor

s3.bydzyne.com...Analyzed by 300.2 Elisa Mycotoxin Aflatoxins Ochratoxin A Heavy Metals Analyzed by 300.8 ICP/MS Element Arsenic Cadmium Lead Mercury Detected or Not Detected Not Detected

Chromosomal imbalances detected in primary bone tumors by ... · We applied a combination of comparative genomic hybridization (CGH) and fluorescence in situ hybridization (FISH),

Experimente mit Mikroprozessoren - brennecke.org · 2"! ! !! Elektronik-Serie 4000 / X9000 Buch 1 Mikrocontroller für Philips/Schuco/Kosmos Buch 2 Grafikcontroller und Serial Windows

Mikroprozessoren - mit.eit.h-da.demit.eit.h-da.de/Vorlesung/0400_Assembler.pdf · GNU-Asm: lo8(),hi8() Direktiven Auswahl AVR-Studio.EQU gibt einem Ausdruck einen Namen.INCLUDE bindet