An Energy Aware Framework for Mobile Computing - TU · PDF fileAn Energy Aware Framework for...

DISSERTATION

An Energy Aware Framework for MobileComputing

ausgefuhrt zum Zwecke der Erlangung des akademischen Gradeseines Doktors der technischen Wissenschaften

eingereicht an derTechnischen Universitat WienFakultat fur Elektrotechnik und Informationstechnik

von

Dipl.-Ing. Naeem Zafar AzeemiBrigittenauer Lande 224/ 6643, 1200 Wiengeboren in Karachi, Pakistan am 14. August 1968Matrikelnummer: 0327346

October 6, 2007 .............................................................

Advisor

Univ.Prof. Dipl.-Ing. Dr.techn. Markus RuppTechnische Universitat WienInstitut fur Nachrichtentechnik und Hochfrequenztechnik

Examiner

Univ.Prof. Dr.phil.nat. Christoph GrimmTechnische Universitat WienInstitut fur Computertechnik

To Amra, Mukashfa and Kunza

ABSTRACT

Since their inception, energy dissipation has been a critical issue for mobile computingsystems. Although a large research investment in low-energy circuit design and hardwarelevel energy management has led to more energy-efficient architectures, even then, thereis a growing realization that the contribution to energy conservation should be morerigorously considered at higher levels of the systems, such as operating systems andapplications.

This dissertation puts forth the claim that energy-aware compilation to improve appli-cation quality both in terms of execution time and energy consumption is essential fora high performance mobile computing embedded system design. Our work is a designparadigm shift from the logic gate being the basic silicon computation unit, to an in-struction running on an embedded processor. Multimedia DSP processors are the mostlucrative choice to a mobile computing system design for their optimal performance de-livery in high data throughput at low energy. They use instruction-level parallelism (ILP)in programs, for executing more than one primitive instruction at a time. In this work,we exploit the parallelism slacks, unraveled by the native multimedia DSP compilers.We propose an iterative compilation environment to optimize a given ’C’ source code.The contributions of our framework are the collaboration of an application profile mon-itor (APM) together with an optimization engine in native multimedia DSP SoftwareDevelopment Environments (SDE). We propose to monitor application behavior at alllevels (such as static, compilation, scheduling, linking and during execution). TheseAPMs are later used in an optimization engine to speculate optimal code transformationschemes. These schemes are applied successively, across the basic code blocks. Wepropose two methods for the selection of optimization schemes, a Gradient Mode Iter-ative Compilation (GMIC) and Multicriteria Stochastic Iterative Compilation (MSIC).Both schemes are tested at several multimedia applications obtained from diversifieddomains such as video transcodecs (MPEG2, H-264L), audio transcodecs (G-723, Mp3)and bioinformatics (Glimmer, Fgene), to name a few.

Finally, we propose the characterization of application-architecture correlations that sup-port our claim that an ideal performance of a mobile computing system demands a per-fect match between hardware capability and program behavior. We exposed our resultsfor 20 multimedia applications experimented at the TriMedia DSP 1300, the BlackfinDSP ADSP533, and the PIII-850 embedded processor.

Keywords: Energy Aware, Source-to-Source, Multimedia Processor, Workload Charac-terization.

vi Abstract

ZUSAMMENFASSUNG

Seit dem Bestehen von mobilen Rechensystemen ist Energieverbrauch ein entscheiden-der Faktor. Obwohl bereits zahlreiche Forschungsergebnisse zu hardwarelosungen mitniedrigem Energieverbrauch gefuhrt haben, ist mittlerweile klar geworden, dass En-ergieeinsparungen auf hoherer Ebene, wie beispielsweise bei Betriebssystemen und -anwendungen, vermehrt in Betracht gezogen werden sollten.

Diese Dissertation belegt, dass eine energiebewusste Compilierung zur Verringerung derAusfuhrungszeit fuhrt und somit ein wesentliches Kriterium darstellt, um ein effizienteseingebettetes System fur mobile Datenverarbeitung zu gewahrleisten. Unsere Arbeitbeschaftigt sich mit einem neuen Entwicklungs-Paradigma, das sich nicht mehr aufeinzelne logische Gatter als grundlegende Entwicklungselemente konzentriert, sondernsich einzelnen Instruktionen auf einem eingebetteten Prozessor widmet. Digitale Sig-nalverarbeitungsprozessoren fur Multimediaanwendungen stellen fur ein mobiles Daten-verarbeitungssystem die preiswerteste Losung dar, um eine optimale Datendurchlaufzeitbei niedrigem Energiebedarf zu gewahrleisten. Diese nutzen hierfur die Parallelitat aufInstruktionsebene (ILP) von Programmen, um damit mehrere primitive Instruktionenzur gleichen Zeit ausfuhren zu konnen. In der vorliegenden Dissertation wird die Pro-grammparalellisierung mit einem speziellen Monitor erfasst. Weiters schlagen wir eineschrittweise Compilierung vor, um den gegebenen Programmcode in ”C” zu optimieren.Ein weiterer Beitrag besteht aus einer Programmumgebung zur Analyse von Anwendun-gen und deren Optimierung. Hierbei wird das Programmverhalten auf mehreren Ebenen(statischer Ebene, Compilierung, Scheduling, Linking, und wahrend der Ausfuhrung)uberwacht. Diese Analysen werden anschließend von einem Optimierungsprogramm ver-wendet, um eine optimale Compiler-Konfiguration zu ermitteln. In dieser Arbeit wer-den zwei verschiedene Methoden fur die Auswahl der Optimierungsoptionen vorgestellt,namlich ein Gradientenverfahren und ein stochastisches Verfahren. Beide Verfahrenwerden mit verschiedenen Multimediaanwendungen aus unterschiedlichen Bereichen wiebeipsielsweise Video-Kodierung (MPEG2, H-264L), Audio-Kodierung (G-723, MP3) undBioinformatik (Gllimmer, Fgene) getestet.

Schließlich schlagen wir Metriken zur Erfassung der Korrelation zwischen Anwendung undHardware vor, die unsere Behauptung untermauern, dass eine ideale Leistung des mobilenDatenverarbeitungssystems nur dann erreicht werden kann, wenn die Hardwarekapazitatsowie das Programmverhalten perfekt zusammenpassen. Die Leistungsfahigkeit dieserMetriken wird anhand der Prozessoren Trimedia DSP 1300, Blackfin DSP ADSP533 undPIII-850 gezeigt.

viii Zusammenfassung

Schlagworter: Energy-aware, Quellcodetransformation, eingebettete Systeme, Multi-media Prozessoren, Mobile Computing, workload characterization

ACKNOWLEDGEMENTS

I would like to thank my teacher Khwaja Shamsuddin Azeemi and parents who have hada positive effect on me personally, to whom I owe a debt of gratitude for helping in oneway or another to influence the person I am today.

First and foremost, I thank my supervisor Dr. Markus Rupp, for his consistent efforts toinvoke my inherent skills to accomplish this task successfully. I appreciate his bottomlesspatience for technical review and substantive comments that improved the readabilityof the dissertation.

Thanks to my sister Farhi, and brothers Waseem and Nadeem, who provide encourage-ment in the face of every seemingly impossible task that I face.

Thanks to Afsar, Sobia, Shams Sahib, Ana Eliza and Liana for their love, support andgreat understanding, especially during vulnerable moments.

Thanks to my friends, colleagues and acquaintances: Bastian, Martin at the ChristianDoppler Laboratory; Sabine from Vienna; Naveed and Saima from Boston; Nadeem andfamily from San Francisco; Amir Malik and family from Korea for their kind assistanceand facilitation during last 45 months.

I would like to acknowledge valuable technical support from Dr. Arpad Scholtz atInstitute of Communications and Radio Frequency Engineering, Dr. Stefan Mahlknechtat Institute of Computer Technology and Aneesa Sultan at Vienna Bio Center.

I am also grateful to Dr. Christoph Grimm for his time and patience to review thismanuscript.

CONTENTS

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Mobile Embedded System Constraints . . . . . . . . . . . . . . 11.1.2 IC Fabrication Technology Constraints . . . . . . . . . . . . . . 21.1.3 Battery Technology Constraints . . . . . . . . . . . . . . . . . 31.1.4 Architecture-Application Correlation Slacks . . . . . . . . . . . 4

1.2 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Energy-Cycle Aware Compilation Framework (ECACF) 13

2.1 Energy Saving Techniques - A Review . . . . . . . . . . . . . . . . . . 142.1.1 Fabrication level power reduction . . . . . . . . . . . . . . . . . 142.1.2 Processor level power reduction . . . . . . . . . . . . . . . . . . 152.1.3 EDA tools level power reduction . . . . . . . . . . . . . . . . . 152.1.4 Compiler level power reduction . . . . . . . . . . . . . . . . . . 162.1.5 Low power data structures . . . . . . . . . . . . . . . . . . . . 162.1.6 Idle mode power reduction . . . . . . . . . . . . . . . . . . . . 172.1.7 Power reduction in distributed computing systems . . . . . . . . 172.1.8 Power reduction in communication systems . . . . . . . . . . . 172.1.9 Battery aware power reduction . . . . . . . . . . . . . . . . . . 18

2.2 Multimedia DSPCPU Architecture . . . . . . . . . . . . . . . . . . . . 192.2.1 Multimedia Processor Execution Model . . . . . . . . . . . . . 202.2.2 Multimedia Processor Operations Overview . . . . . . . . . . . 21

2.3 Workload Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3.1 Multimedia Applications . . . . . . . . . . . . . . . . . . . . . 232.3.2 Bioinformatics Workload . . . . . . . . . . . . . . . . . . . . . 24

2.4 Energy Cycle Aware Compilation Framework Methodology . . . . . . . 282.4.1 Application Expression Profile . . . . . . . . . . . . . . . . . . . 30

2.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.5.1 Related Work for Energy Measurement . . . . . . . . . . . . . . 322.5.2 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . 32

2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Gradient Mode Iterative Compilation (GMIC) 41

3.1 GMIC Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

xii Contents

3.1.1 Performance Qualifier Measurement . . . . . . . . . . . . . . . 43

3.1.2 Code Block Queuing . . . . . . . . . . . . . . . . . . . . . . . 43

3.1.3 Code Block Expression Profile . . . . . . . . . . . . . . . . . . 44

3.1.4 Transformation Scheme . . . . . . . . . . . . . . . . . . . . . . 44

3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3 Example: Optimization of an MPEG-1 encoder . . . . . . . . . . . . . 46

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 Multicriteria Stochastic Iterative Compilation (MSIC) 55

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2.1 Objects and Constraints . . . . . . . . . . . . . . . . . . . . . . 57

4.2.2 Case Study I - Arbitrary Application . . . . . . . . . . . . . . . 59

4.2.3 Case Study II - Nonlinear Interpolative Vector Quantization (NLIVQ) 61

4.3 Performance Comparison with GMIC . . . . . . . . . . . . . . . . . . . 66

4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5 Application-Architecture Characterization 69

5.1 Terminologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.1.1 Principal Component Analysis (PCA): . . . . . . . . . . . . . . 70

5.1.2 Scree Plot: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.1.3 Box Plot: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.1.4 Scatter Plot: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.1.5 Differential Application Expression Profile (dAEP): . . . . . . . 72

5.2 Application Characterization . . . . . . . . . . . . . . . . . . . . . . . 73

5.2.1 Case Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2.2 Case Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2.3 Case Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.3 Architecture-Centric Application Characterization . . . . . . . . . . . . 81

5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6 Conclusions 89

Appendices 91

A List of Application Expression Profile (AEP) Monitors 93

B VLIW Descriptor File (VDF) Format 99

C User Constraints Files (UCF) Format 103

C.1 UCF for MPEG-1 encoder example in Section 3.3 . . . . . . . . . . . . 104

C.2 UCF for NLIVQ example in Section 4.2.3 . . . . . . . . . . . . . . . . 104

Contents xiii

D Application Attributes 105

E List of Acronyms 113

LIST OF FIGURES

1.1 Power consumption for Intel CPUs [1]. . . . . . . . . . . . . . . . . . . 3

1.2 Thermal and power delivery cost in a desktop PC [2]. . . . . . . . . . . 4

1.3 Battery technologies and their capacities [3]. . . . . . . . . . . . . . . 5

1.4 Thesis Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1 TriMedia VLIW instruction [4]. . . . . . . . . . . . . . . . . . . . . . . 20

2.2 TriMedia functional unit assignment [4]. . . . . . . . . . . . . . . . . . 21

2.3 Transformation methodology. . . . . . . . . . . . . . . . . . . . . . . . 29

2.4 Vertical application profile layers. . . . . . . . . . . . . . . . . . . . . . 30

2.5 Experimental setup for instruction/program current measurement [5]. . 33

2.6 Proposed experimental setup for application current measurement atprocessor and memory. . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.7 Current consumption for vector quantization (VQ) application executionlife cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.8 CPU core current consumption versus address range for VQ application. 35

2.9 Memory current consumption versus address range for G-728 audio transcodec.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.10 CPU core current consumption versus address range for G-728 audiotranscodec. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.11 CPU peripheral current consumption versus address range for G-728 au-dio transcodec. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.1 Gradient mode Iterative Compilation Methodology (GMIC). . . . . . . . 42

3.2 Fraction of JPMO CB in an MPEG-1 application, the code blocks arenumbered from fb01 to fb34. . . . . . . . . . . . . . . . . . . . . . . . 43

3.3 Fraction of JPMO contributed by code blocks in an MPEG-1 application-(a window view for seven blocks). . . . . . . . . . . . . . . . . . . . . 44

3.4 GMIC algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

xvi List of Figures

3.5 Heuristic track of CT-Tuple for an MPEG-1 encoder application. . . . . 48

3.6 Heuristic track of CTxy tuple for FFT application. . . . . . . . . . . . . 50

3.7 Heuristic track of CTxy tuple for IDCT application. . . . . . . . . . . . 50

3.8 Heuristic track of CTxy tuple for T64 application. . . . . . . . . . . . . 51

3.9 Heuristic track of CTxy tuple for M100 application. . . . . . . . . . . . 52

3.10 Heuristic track of CTxy tuple for H-264L application. . . . . . . . . . . 52

4.1 A simplified view of framework with multicriteria methodology extension. 56

4.2 Simplified Genetic Algorithm Model [6]. . . . . . . . . . . . . . . . . . 58

4.3 Development of fitness function for Case Study 1 in TS1 and TS2. . . . 59

4.4 Fraction of IPC for Case Study 1 in TS1 and TS2. . . . . . . . . . . . 60

4.5 Fraction of IPC and Energy overlapping for Case Study 1 in TS1 and TS2. 60

4.6 Fraction of CPU cycles for CB life time (CBLT)in NLIVQ application (25CB are numbered from F01 to F25). . . . . . . . . . . . . . . . . . . . 62

4.7 Development of the fitness function for NLIVQ. . . . . . . . . . . . . . 64

4.8 Fraction of IPC for NLIVQ. . . . . . . . . . . . . . . . . . . . . . . . . 64

4.9 Fraction of energy saving for NLIVQ. . . . . . . . . . . . . . . . . . . . 65

4.10 Fraction of functional unit utilization for NLIVQ. . . . . . . . . . . . . 65

5.1 Scatter plot for 20 applications at the TriMedia processor. . . . . . . . 75

5.2 PCA Scree plot for 20 applications at the TriMedia processor. . . . . . 76

5.3 PCA box plot for 20 applications at the TriMedia processor. . . . . . . 76

5.4 PCA biplot for 20 applications at the TriMedia processor. . . . . . . . . 77

5.5 Scatter plot for 20 applications at the Blackfin processor. . . . . . . . . 79

5.6 PCA biplot for 20 applications at the Blackfin processor. . . . . . . . . 80

5.7 Scatter plot for 20 applications at the PIII 850 processor. . . . . . . . . 82

5.8 PCA biplot for 20 applications at the PIII 850 processor. . . . . . . . . 83

5.9 Differential AEP across three hardware platforms. . . . . . . . . . . . . 83

5.10 PCA biplot for 20 applications across the TriMedia processor and theBlackfin processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.11 PCA biplot for 20 applications across the Blackfin processor and the PIII850 processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.12 PCA biplot for 20 applications across the TriMedia processor and the PIII850 processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

LIST OF TABLES

2.1 Energy reduction techniques for embedded system design. . . . . . . . . 14

2.2 Multimedia Benchmarks (Speech Transcodecs). . . . . . . . . . . . . . 24

2.3 Multimedia Benchmarks (Video Transcodecs). . . . . . . . . . . . . . . 25

2.4 Multimedia Benchmarks (Audio Transcodecs). . . . . . . . . . . . . . . 25

2.5 Generic DSP application Benchmarks [7]. . . . . . . . . . . . . . . . . 26

2.6 Test Vectors Characterization. . . . . . . . . . . . . . . . . . . . . . . 26

2.7 Bio-Computation Applications Benchmark . . . . . . . . . . . . . . . . 27

3.1 Transformation Schemes. . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2 Gradient Table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.1 CBLT in CPU cycles for NLIVQ. . . . . . . . . . . . . . . . . . . . . . 63

4.2 Achieved CPU cycles (%) in ECHCB of NLIVQ application for TS04,TS07, TS09. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3 Sum of absolute difference for for TS04, TS07, TS09. . . . . . . . . . . 66

4.4 Performance comparison between GMIC and MSIC. . . . . . . . . . . . 67

5.1 MPEGdec profile for successive transformations [8]. . . . . . . . . . . . 72

D.1 Pseudonyms for 20 applications. . . . . . . . . . . . . . . . . . . . . . 105

D.2 AEP for optimized 20 applications at the TriMedia processor. . . . . . . 106

D.3 AEP for optimized 20 applications at the Blackfin processor. . . . . . . 107

D.4 AEP for optimized 20 applications at the PIII 850 processor. . . . . . . 108

D.5 dAEP for optimized 20 applications across the TriMedia and the Blackfinprocessors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

D.6 dAEP for optimized 20 applications across the Blackfin and the PIII 850processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

xviii List of Tables

D.7 dAEP for optimized 20 applications across the TriMedia and the PIII 850processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

1 INTRODUCTION

1.1 Motivation

The growing trend towards the untethered ubiquitous computing is entailed with many

performance related issues. The ideal performance of a mobile computing system de-

mands a perfect match between architecture capability and program behavior. Archi-

tecture performance can be enhanced with better hardware technology, innovative low

Integrated Circuits (IC) geometry features, and efficient resources management [9]. In

the same vein, the demand for having multimedia functions on handheld devices requires

an enormous computation power to handle large data and program sizes. Efficient ar-

chitecture utilization for both energy dissipation and execution time as well as optimal

application firmware are two important performance metrics for these embedded systems.

The optimal architecture utilization is debilitated by different design limitations, such

as high level system design constraints, fabrication level constraints, battery technology

constraints etc. They are discussed next in more detail.

1.1.1 Mobile Embedded System Constraints

Mobile embedded systems (MES) present unique challenges and opportunities for system-

level low-energy designs, e.g.,

• MES are usually severely energy constrained. In particular, handheld devices , air-

borne, and spaceborne systems are typically battery-operated and therefore have a

limited energy budget [10]. MES are also typically relatively more time-constrained

compared to portable embedded or general-purpose systems. Therefore, the chal-

lenge is to save energy while guaranteeing temporal constraints.

• Some MES applications such as avionics, robotics and deep space missions require

systems with small form factors, which in turn mandates low heat dissipation.

Since heat is a byproduct of energy dissipation, low-energy system-design ensures

a more reliable system by limiting the heat produced.

• MES are typically over-designed to ensure that the temporal deadline guarantees

are still met even if all tasks take up their Worst-Case Execution Time (WCET).

2 1 Introduction

Since, in the average case, tasks do not require their WCET, the redundancy in

hardware design in MES makes them energy inefficient.

In short, system-level techniques can decrease this energy dissipation through the

use of energy-aware task scheduling algorithms while preserving their temporal

constraints.

1.1.2 IC Fabrication Technology Constraints

Integrated circuits in their various incarnations consume some amount of electric power.

This power is dissipated both by the action of the switching devices contained in IC

(such as transistors) as well as heat due to the resistivity of the electrical circuits. This

is a major consideration in the design of microporcessors and the embedded systems

they are used in [11]. Figure 1.1 shows the power consumption for the Intel series

of processors produced over the last two decades [1]. The horizontal axis shows the

advancement in IC fabrication technology in terms of chip geometry (i.e nanometers),

while power dissipation is plotted in Watts. Each point is marked with two numbers,

showing chip geometry and power consumption, respectively. Points lying on the same

vertical axis such as (350,43) and (350,34.8) show the processors in the same technology,

but different performance. E.g., (350,43) and (350,34.8) corresponds to PII 300MHz

and PII 233MHz, respectively. Similarly, P4 3MHz was fabricated at 130 nm and 81.9

W, while in later versions at lower geometry P4 EE 3.40MHz is fabricated at 90 nm

and low power 83.9 W; further, it is improved for higher operating frequency (P4 EE

3.73MHz) at same the geometry but at a penalty of increase in power consumption

i.e., 115 W. The increasing trend towards special purpose core processors has further

reduced the geometry down to 65 nm and power consumption to 130 W (for Intel Core

2 Extreme Qx6700). Readers are encouraged to read [1] [12] [13] for a detailed view of

power versus technology trends realized by various CPU manufacturers.

Attempts to shape the power-geometry envelop (shown as a shoe in Figure 1.1) have

their limits at the fabrication technology at 50 nm, where leakage current starts dominat-

ing the power consumption (discussed further in Chapter 2). Although special purpose

core processors are implemented at 50 nm [14] [12], with a power consumption of 14.5

W (shown at bottom of heal in Figure 1.1), but their operating frequency is limited to

130 MHz, which is not sufficient to meet the current demand for multimedia process-

ing. The designers goal to achieve a low leakage ’heal’ in the power-geometry shoe is

associated with a high power cost. This cost has two components. The first is thermal

cost, which is associated with keeping the devices below the specified operating temper-

ature limits. Maintaining the integrity of packaging at higher temperatures also requires

expensive solutions. The second component is the on board power delivery cost, which

is related to on-board decoupling capacitances and interconnects associated with the

power distribution network. Moreover, the increased trend towards driving the CPU at

1.1. Motivation 3

lower operating voltage and higher frequency increases the magnitude of the current

drawn by the CPU. This exacerbates the issue of resistive and inductive noise problems

and leads to a significant increase in system cost.

Fig. 1.1: Power consumption for Intel CPUs [1].

Figure 1.2 gives an idea of the range of dollar amounts associated with the above costs

for different system components [2]. As can be seen, when the system power is in the

35-40 W range, the cost of each additional Watt tends to grow above $1/W per chip.

Designers have already pulled the fabrication limits to achieve low energy design goals

[15]. E.g., shrinking the integrated circuit geometry below 50 nm doubles the leakage

current as compared to 65 nm. Such issues exacerbate the need to consider low energy

design more rigorously at higher hierarchies of the system level [5].

1.1.3 Battery Technology Constraints

The energy constraints on mobile devices are becoming increasingly tight as complexity

and performance requirements continue to be pushed by the user demand [16]. Proces-

sor speeds have doubled as approximately every 18 months as predicted by Moore’s

law [17]. While processor speed and energy consumption have increased rapidly, the

corresponding improvement in battery technology has been slow. In fact, battery ca-

pacity has increased by a factor of less than four in the last three decades [3] [18].

4 1 Introduction

Fig. 1.2: Thermal and power delivery cost in a desktop PC [2].

Figure 1.3 shows the current state-of-the-art in battery technology. The slack in in-

crease in the battery capacity is hampered by the ionization chemistry limits [3] [19].

The design target for batteries with long life-span and short sizes is hard to achieve.

E.g., though Ni-MH is lighter in weight than Ni-Cd, it requires a higher recharging

time. In the same vein, Li-Ion batteries are more promising for higher energy density,

large number of charging cycles, little memory effect, longer shelf life, but higher cost

and increased external protection against discharging inhibits its low cost wide use. In

short, the technological constraint on the realization of high capacity, low size battery

highlights the importance of low energy consideration.

1.1.4 Architecture-Application Correlation Slacks

Traditionally, optimal MES performance is gained by focussing on the underlying hard-

ware architecture. This ignores the fact that it is the software executing on a CPU

that determines its energy consumption. The execution time and energy consumption

of a program on any parallel processor is dependent not only on the composition of

operations contained within the program, but also on the ability of users to express the

1.2. Design Space Exploration 5

Fig. 1.3: Battery technologies and their capacities [3].

parallelism at the correct granularity level for the processor. Therefore, to fairly com-

pare cycle-energy performance of two applications at a given processor, two different

mappings of the applications will be required, one for each application. An integrated

approach that considers energy-cycle performance at architecture as well as application

level is essential for energy efficient application developments.

1.2 Design Space Exploration

The program behavior is difficult to predict due to its heavy dependence on application

and run-time conditions [20] [21]. For mobile computing, the application performance

can be optimized by using parallel hardware architectures, such as Very-Long Instruction

Word (VLIW) architectures [22] [23]. VLIW architectures are a suitable alternative for

exploiting instruction-level parallelism (ILP) in programs, that is, for executing more than

one basic (primitive) instruction at a time. These processors contain multiple functional

units. They fetch from the instruction cache a Very-Long Instruction Word containing

several primitive instructions, and dispatch the entire VLIW for parallel execution. These

6 1 Introduction

capabilities are exploited by compilers which generate code that has grouped together

independent primitive instructions executable in parallel. The processors have a relatively

simple control logic because they do not perform any dynamic scheduling nor reordering

of operations (as is the case in most contemporary superscalar processors). The instruc-

tion set for a VLIW architecture tends to consist of simple instructions (RISC-like). The

compiler must assemble many primitive operations into a single ”instruction word” such

that the multiple functional units are kept busy, which requires enough instruction-level

parallelism (ILP) in a code sequence to fill the available operation slots.

In mobile computing software design, the conventional software development environ-

ment (for compilation and machine code generation) cannot be used. In these methods,

the execution time and code size are primarily considered, while the energy dissipation

issue is piggy-backed to the final design; that inevitably leads to an expensive cooling

mechanism and eventually increases the system overall cost while reducing reliability.

The software perspective on power consumption has been the subject of work in [24].

Here a detailed instruction-level power model of the Intel 486DX2 was built. The impact

of software on the CPU power and energy consumption, and software optimizations to

reduce these were studied. It is well known that the number of useful instructions is

always different from the number of instructions in a static code. The code execution

flow determines the number of useful instructions according to input data. Therefore,

computing the total energy consumed merely by adding the energy consumption of

individual instructions does not provide the actual energy consumption of the program

as claimed in [24].

In this thesis we propose a framework, where software applications optimally utilize

the hardware architecture to deliver energy-cycle performance within user defined con-

straints. Our energy aware framework in [25] meets the demand by incorporating the

following features in a native multimedia DSP compilation environment.

1) The framework transforms the legacy application source code into optimal ’C’ source

code, taking advantage of different slacks appearing in the application-to-binary devel-

opment hierarchy.

2) Unlike conventional techniques, ’C’ source code is iteratively compiled for different

performance goals both in terms of execution time as well as energy dissipation.

3) We developed post-profiling techniques published in [26] to evaluate the application

performance not only at compilation layer (as conventional compiler does) but also at

scheduling layer, linker layer, machine code generation layer and finally at loader layer.

4) We measure the real-time performance of applications running on actual hardware.

These measured parameters are further used to tune the transformation scheme of the

legacy software application.

5) We tested our framework at different applications that belong to diversified industrial

1.2. Design Space Exploration 7

domains such as audio transcodecs [27], video transcodecs [8], speech codecs, and

bioinformatics applications [28] [29].

6) The work is further extended in [30] [27] to characterize application-architecture

correlation, that are well suited for a pre-design assessment of an embedded system

design. It answers the question whether a given hardware architecture is an appropriate

choice for a given multimedia software application or not.

It may be noted, the terms power consumption and energy consumption are often in-

terchanged. It is important to distinguish between these two when we talk of either of

these in the context of programs running on mobile applications. Mobile systems run

on limited energy available in a battery. Therefore, the energy consumed by the system

or by the software running on it, determines the length of the battery life.

This thesis is based on the following publications.

• N. Zafar Azeemi, A. Sultan ”Characterization of Bioinformatics Applications on

Multimedia Processor”, in Proc. IEEE Cairo International Biomedical Engineering

Conference (CIBEC ’06), pages BI06-BI09, 195 - 200, Cairo, Egypt, December,

2006.

• N. Zafar Azeemi ”Handling Architecture-Application Dynamic Behavior in Set-

top Box Applications”, in Proc. IEEE International Conference on Information

and Automation (ICIA ’06), pages 195 - 200, Colombo, Sri Lanka, December,

2006.

• N. Zafar Azeemi, A. Sultan, A. Muhammad ”Parameterized Characterization of

Bioinfomatics Workload on SIMD Architecture”, in Proc. IEEE International Con-

ference on Information and Automation (ICIA ’06), pages 189 - 194, Colombo,

Sri Lanka, December, 2006.

• N. Zafar Azeemi ”Multicriteria Energy Efficient Source Code Compilation for De-

pendable Embedded Applications”, in Proc. IEEE International Conference on

Information Technology (IIT ’06), Dubai, UAE, November, 2006.

• N. Zafar Azeemi ”Compiler Directed Battery-Aware Implementation of Mobile Ap-

plications”, in Proc. IEEE 2nd International Conference on Emerging Technologies

(ICET ’06), pages 151 - 156, Peshawar, Pakistan, November, 2006.

• N. Zafar Azeemi ”A Multiobjective Evolutionary Approach for Constrained Joint

Source Code Optimization”, in Proc. ISCA 19th International Conference on Com-

puter Application in Industry (CAINE ’06), pages 175 - 180, Las Vegas, Nevada,

USA, November, 2006.

• N. Zafar Azeemi ”Probabilistic Iterative Compilation for Source Optimization of

Embedded Programs”, in Proc. 2006 IEEE International SoC Design Conference

(ISOCC ’06), pages 323 - 328, Seoul, Korea, October, 2006.

8 1 Introduction

• N. Zafar Azeemi, M. Rupp ”Multicriteria Low Energy Source Level Optimization of

Embedded Programs”, in Proc. Tagungsband zur Informationstagung Mikroelek-

tronik (ME ’06) IEEE Austria, pages 150 - 158, Vienna, Austria, October, 2006.

• N. Zafar Azeemi ”Architecture-Aware Hierarchical Probabilistic Source Optimiza-

tion”, in Proc. ISCA 19th International Conference on Parallel and Distributed

Computing Systems (PDCS ’06),pages 90-95, San Francisco, USA, September,

2006.

• N. Zafar Azeemi ”Power Aware Framework for Dense Matrix Operations in Mul-

timedia Processors”, in Proc. IEEE 9th International Multi-topic Conference (IN-

MIC ’05), Karachi, Pakistan, December, 2005.

• N. Zafar Azeemi, M. Rupp ”Energy-Aware Source-to-Source Transformations for

a VLIW DSP Processor”, in Proc. IEEE 17th International Conference on Micro-

electronics (ICM ’05), pages 133 - 138, Islamabad, Pakistan, December, 2005.

• N. Zafar Azeemi ”A Framework for Architecture Based Energy-Aware Code Trans-

formations in VLIW Processors”, in Proc. International Symposium on Telecom-

munication (IST ’05), pages 393 - 398, Shiraz, Iran, September, 2005.

1.3 Thesis Outline

This thesis is organized in five chapters, as shown in Figure 1.4. A brief description of

each chapter is given below.

Chapter 1: We discuss the different design limitations, such as high level system design

constraints, fabrication level constraints, battery technology constraints etc. We explore

the design slacks that exist in contemporary work [31] [24] [5] for energy aware code

optimization. We explain the thesis structure and provide a detailed list of contributions.

Chapter 2: This chapter lays the necessary foundation for the development of our

energy cycle aware iterative compilation framework. Our methodology optimizes a soft-

ware application for energy consumption, execution time as well as efficient hardware

architecture utilization. As compared to [5] [32] [33] [34], we elaborate our method

for generic multimedia processors. Unlike [35] [36] [36], we define software applica-

tion in terms of its architectural behavior. We provide a simplified overview of typical

multimedia processors. Though various multimedia operation models are presented in

[37] [31] [38] [39] [40], but their complexity refrain them to be readily usable in a real

time optimization environment. We use a simplified multimedia operation model devel-

oped in [4], that views the instruction set in terms of load/store operations, compute

operations, special register operations and control flow operations. The measurement

of energy consumption made by an application at a real-time platform is a first step

1.3. Thesis Outline 9

Fig. 1.4: Thesis Structure.

to know in any energy constrained embedded system and can be used to estimate

the battery lifetime of the system. The experimental setup proposed in [5] [32] [41]

for instruction/program current measurement, addressing modes, immediate operands,

and exhaustive characterization is very time consuming. We present here a measure-

ment platform that is generic and applicable to most off-the-shelf available multimedia

processors. It is based on current measurement at both processor and memory input

lines. Unlike the instruction based energy model presented in [42] [24], we propose a

simplified energy consumption model based on code blocks. We expose a step-by-step

procedure for the measurement of software application energy consumption at a target

hardware architecture. As compared to [24] [32] [41], we apply our framework at two

major application domains, multimedia and bioinformatics. The multimedia application

set consists of encoders and decoders (transcodecs) encompassing three media types -

speech, video, and audio (music), whereas, we categorize the basic functionality offered

by all bioinformatic tools into four groups. They are pattern recognition algorithms, rule

based analysis, biological data bases and biological taxonomy. The results published

10 1 Introduction

in [28] [29] reveal the usefulness of our framework at diversified application domains.

Several energy reduction opportunities at design level are also presented.

Chapter 3: Our energy cycle aware compilation framework is powered by a source

code transformation engine. Unlike [43] [42] [24], we implement our scheme by first

investigating the ’C’ source code of application for cycle energy taxing blocks, based

on trace data collected during a profile of the application as mentioned in Chapter 2.

Here, we present a novel heuristic that searches the solution space for an optimal source

code transformation scheme. We demonstrate that the algorithm executes a solution

and evaluates the energy-time tradeoff based on a user-defined metric. Based on the

evaluation, it selects the next solution to be evaluated. The heuristic terminates when

desired objectives are achieved. Our gradient mode iterative compilation scheme has

two salient features. First, it requires queuing code blocks such that blocks pertaining

similar expression profile most likely to benefit from the same transformation scheme.

Second, it completes in a discrete number of steps based on the number of code blocks,

whereas schemes mentioned by Sinha et al. in [33] and Tiwari et al. in [5] offer searches

that grow exponentially as the number of code blocks increases. We also expose our

scheme by analyzing a video encoding application (MPEG-1 encoder). Further merits

and demerits of the scheme are also explained in different application scenarios.

Chapter 4: The gradient mode iterative compilation as proposed in the previous chapter,

belongs to a class of compilation termed as feedback directed compilation. It brings

relatively small improvement, as it effectively restricts itself to trying different back-end

optimizations. The major impediment to such approach is the heuristic search technique

itself. Unlike [32] [41], in this chapter we consider the optimization problem as a single

task, where all desired aims have to be taken into account simultaneously. We present

a new method, which is based on the optimization of a multicriteria, objective function,

where the desired aims of architecture-based energy-cycle optimization are formulated as

penalty terms of such objective function. Further, we describe how the maximization of

the objective function can be achieved by using a Genetic Algorithm (GA). The interface

of the proposed methodology to our energy cycle aware compilation framework is also

explained. We also expose the minutia of our methodology e.g., selection of constraints,

development of fitness function, formation of Hertz matrix. We discuss two multimedia

applications in depth to elaborate the advantage of the algorithm.

Chapter 5: In this chapter we introduce the concept of application-architecture char-

acterization with the help of our ECACF and multivariate statistics techniques. To our

knowledge this is a first attempt to obtain such characterization from the application

expression profiles.

The application-architecture correlation is a bidirectional process matching algorithmic

structure with hardware architecture and vice vera. The programmer will benefit from

this efficient mapping and produce better source codes. Applications of similar function-

ality may yield similar Application Expression Profile (AEP), and hence can be suitable

1.3. Thesis Outline 11

for similar hardware platform. We explore the fact that despite the simplicity of our

methodology, the analysis of large matrices provided by an application expression profile

under different levels of transformation at different architectures is not trivial and re-

quires an advanced knowledge of discovery processes. To this end, we introduce a new

methodology to evaluate the application portability using multivariate statistics. We

demonstrate how box plot, scree plot, and PCA biplots can be used to characterize an

application at a given hardware architecture. We expose the minutia of methodology by

exploring the AEP across three different hardware platforms at diversified applications.

Finally, we demonstrate how dAEP can be used to find out the legacy code portability

across platforms.

12 1 Introduction

2. ENERGY-CYCLE AWARE COMPILATION

FRAMEWORK (ECACF)

Miniaturization of computing systems is finding applications in special areas such as

hand-held computation, tiny robots, guidance systems in automated vehicles, to name

just a few. Also, these systems or their users move from place to place. Because of

their small size and their mobility requirement, they are powered by batteries of low

rating. In order to avoid frequent recharging and/or replacement of the batteries, there

is significant interest in low-energy system design. Energy consumption is an area of

growing concern in system design. It leads to variety of system related issues, such as

battery life, thermal limits, packaging constraints, and cooling options [44]. Though

energy is actually consumed by the hardware, energy consumption can be reduced apart

from using low-energy electronics by suitably manipulating the software systems. This

is because the hardware activities are controlled through the software. Let a program

X run for T seconds to achieve its goal, VCC be the supply voltage of the system, and

I be the average current in Amperes drawn from the power source for T seconds. We

can rewrite T as T = N x τ where N is the number of clock cycles and τ is the clock

period. Then, the amount of energy consumed by X to achieve its goal is given by: E

=VCC x I x N x τ joules. Since for a given hardware, both VCC and τ are fixed, E

∝ I x N. However, at the application level, it is more meaningful to talk about T than

N, and therefore, we express energy as E ∝ I x T. This expression is the foundation of

our ECACF. It shows the main idea in the design of energy-efficient software that is to

reduce both T and I. From the running time (average case) of an algorithm we achieve

a measure of T . However, to compute I, one must consider the current drawn during

each clock cycle. This is illustrated in Section 2.5.

Given the fact that power is the rate of energy consumption, in this thesis, we refer to

power and energy interchangeably. Low power design is a complex endeavor requiring

a broad range of strategies from floor planning on silicon substrate to the design of

application software. In Table 2.1, we enlisted several strategies for achieving energy

efficiency in an energy-conscious system design. In the following section, we review some

of these strategies.

14 2 Energy-Cycle Aware Compilation Framework (ECACF)

Power Reduction Strategies MES Design LevelsFabrication Level Power Reduction Low level

Processor Level Power Reduction Intermediate level

EDA Tools Level Power Reduction High level

Compiler Level Power Reduction High level

Low Power Data Structures High level

Idle Model Power Reduction Intermediate level

Power Reduction in Distributed Computing High level

Power Reduction in Communication Systems High level

Battery Aware Power Reduction High level

Tab. 2.1: Energy reduction techniques for embedded system design.

2.1 Energy Saving Techniques - A Review

We review a wide spectrum of strategies, shown in Table 2.1, ranging from the hardware

fabrication process to energy efficient communications system. Energy saving due to

different approaches are, in the best case, multiplicative. E.g., in an IDCT application

implemented in [44] [45] [46] [47], a 30% energy saving from low-energy electronics

together with a 23% saving from compiler techniques will yield a total energy saving of

(1-((1-0.30)(1-0.23)))×100%= 46.1%.

However, generally the total energy saving is less, say, in this example 34%, because the

various energy saving strategies may adversely affect each other.

2.1.1 Fabrication level power reduction

The power consumption in a CMOS digital circuit is expressed as [48]

P = (CLV 2DDfp) + (ISCVDD) + (IleakgeVDD) (2.1)

where VDD is the supply voltage, fp is the output switching frequency, CL is the output

capacitance load, ISC is the short circuit current pulse, generated when both n- and

p-transistors are briefly turned on during the output switching, and Ileakage is the leakage

current. The first term on the righthand side of the power equation is the dominant

factor [48]. It is expected that power saving with two orders of magnitude can be

achieved using low-power electronics. About half of the power reduction will come from

architecture changes and management of switching activity (fp). The other half of

power reduction will come from using advanced materials technology to allow reduction

of VDD to 1 V or below from 5 or 3.5 V while also reducing CL [48] [49].

2.1. Energy Saving Techniques - A Review 15

2.1.2 Processor level power reduction

Mobile embedded system requires small form factors and hence processors designed for

high-end desktops are not suitable for such application. Havinga et al. in [50] show that

microprocessors can account for up to 33% of a typical notebook power budget, which

is around 15W. Therefore, processor designers include a number of features to reduce

power consumption. E.g., in TriMedia processor TM130x [4] and Blackfin processor

ADSP533S some of the power reduction features are dynamic idle-time shutdown of

separate execution units, low-power cache design, and power considerations for standard

cells, data-path elements, and clocking. The processor also supports three static power

management modes doze, nap, and sleep [51]. These modes reduce power at a global

level when the processor is idle for an extended period of time. Since CMOS circuits

consume power during the charging and discharging of capacitances, reducing switching

activity saves power. At the architecture-level, two strategies to reduce switching activi-

ties are Gray code addressing and cold scheduling of instructions [52] [53]. Experimental

results show that cold scheduling reduces switching by 20 ∼ 30%. The Gray codes ad-

vantage over the binary code is that each memory access changes the address by only

one bit. Thus, a significant number of bit switches can be eliminated using Gray code

addressing. Also, by decomposing a finite-state machine into several submachines, [54]

suggest that it is possible to selectively turn off portions of a circuit, thereby reducing

the switching activities. Tiwari et al. [31] have studied the idea of shutting off parts of

a logic circuit that are not needed in a particular computation on a per-clock-cycle basis.

This saves the power used in all the useless transitions in those parts of the circuit. Burd

et al. in [55] and Govilak et al. in [56] have suggested that power consumption in a

CPU can be reduced by dynamically changing its operating frequency and voltage. Fur-

ther studies to expose the role of prediction and of smoothing in dynamic speed-setting

policies is discussed in [57]. Havinga and Smit [50] propose energy saving by exploiting

locality of reference with dedicated, optimized modules. The idea of locality of reference

is to offload as much work as possible from the CPU to programmable modules that are

placed in the data streams.

2.1.3 EDA tools level power reduction

The design of low-power systems cannot be achieved without good power-conscious

EDA tools. EDA tools are used at all levels of hardware design: behavioral, architectural,

logic and physical. For a detailed exposition of power-conscious EDA tools, the reader

is referred to tutorials by [58] [59] [14].


2.1.4 Compiler level power reduction

Compiler design techniques contribute to energy saving in several ways [60] [61]. Kolson

and Nicolau [62] [40] [63] address the problem of allocating memory to variables in em-

bedded DSP (digital signal processing) software. The goal is to maximize simultaneous

data transfers from different memory banks to registers [64] [65] [66]. In several DSP

applications mentioned in [67] [68], two registers are loaded with the required data and

an arithmetic operation is performed. Loading two registers with a single double transfer

instruction draws a little more current than a move instruction. Both the instructions

take one clock cycle each. However, energy is saved by using the double transfer, be-

cause the double transfer instruction loads the two registers in one clock cycle, whereas

we need two clock cycles to sequentially load the registers. Experimental results for a

few applications on a Blackfin DSP processor in [30] show that up to 47% of energy

can be saved by this approach. Instructions with memory operands have much higher

energy costs than instructions with register operands [30]. This suggests that energy

can be saved by suitably assigning the live variables of a program to registers. But, a

processor has only a small number of registers. When the number of simultaneous live

variables is larger than the number of available registers, some of the variables must be

spilled to memory. Register assignment for loop variables is important because loops

are typically executed many times. Algorithms for optimal register assignment to loop

variables are presented in [69] [70] [71] [62]. This algorithm can be included in the

code generation part of a compiler.

2.1.5 Low power data structures

Kondo et al. [72] propose a method of implementing set data types with minimum power

consumption. In a programming language, one can implement the set data type using a

variety of concrete data structures such as arrays, pointer arrays, linked list and binary

tree [73]. Thus, to implement the set operations, such as locate, insert, and remove

a record from a set, one has to manipulate the memory elements in a concrete data

structure as proposed in [74] [75] [33] [42]. It is the memory accesses in the process

of set operations that actually consume power. Thus, the power consumption in set

operations is a function of the number of memory elements used in implementing a set

data type, the number of read and write operations are performed in the implementation,

and some logic details such as capacitance of memory elements, voltage level, and

frequency of operation. The concrete data structures are compared on the basis of a

filling factor, which is the fraction of the locations that would be filled if implementation

is in arrays [76] [77] [78]. It has been shown that for different levels of filling factor,

different concrete data structures lead to low values of the power cost function. E.g.,

for filling factors greater than 60%, arrays are better in implementing energy efficient

set data types [72].

2.1. Energy Saving Techniques - A Review 17

2.1.6 Idle mode power reduction

The doze mode is an innovative approach to conserving energy [79] [80] [81] [60]. It is

very attractive in a communication environment where a mobile system may occasionally

send or receive messages. In the doze mode, the clock speed is reduced and no user

process is executed. Rather, a mobile host simply waits for any incoming message. Upon

receiving a message, the host resumes its normal mode of operation. The energy saving

due to this mode depends on the local computations on a mobile and the pattern of

communication between a mobile and a support station [82]. Simulation studies in [41]

show that energy saving due to this mode spreads over a wide range of 2 ∼ 98%.

2.1.7 Power reduction in distributed computing systems

Agent based computation is a relatively new idea in distributed computing [83] [81]

[84]. General agent-based distributed computing systems have been designed using the

concept of Lindas tuple space [85]. Wei et al. [86] discuss how energy-efficient

distributed algorithms in a mobile computing environment can be designed using a tuple

space managed on the fixed network of a mobile system. Lin et al. [22] propose a power

efficient commit protocol which supports conventional two-phase commit services. A

distributed autonomous system called Noah (Network oriented application harmony)

has been proposed in [87] built in the Mitsubishi laboratory. Though the purpose of

Noah is not to save energy, it demonstrates how agent based systems can be built using

a tuple space as the medium for process communication. By shifting most workload

to peer fixed hosts, the load, the power consumption and the message exchanged via

expensive wireless links in a mobile host are greatly reduced.

2.1.8 Power reduction in communication systems

The receiver subsystem of a mobile station need not be active all the time [88]. Most

digital cellular and cordless systems provide power cycling at the mobile units. Mobile

stations can periodically relax (power cycle) their receivers as a means of conserving

energy. Since the receiver of a mobile unit is not continuously ready to receive messages

from the local support station (base station), some kind of coordination between a base

station and a mobile unit is necessary. Salkintzis et al. [89] propose a page-and-answer

protocol. Intuitively, the protocol works as follows:

When a base station has a message for a mobile unit, the base station sends a small

paging packet to the mobile unit. If the mobile unit receives the paging packet, that

is if the mobile receiver is up, the mobile sends an answer packet to the base station.

Obviously, if the paging message is sent at a time when the receiver is powered off, no

answer packet is generated by the mobile and the base station will once again page the


mobile after some time. Upon receiving an answer packet, the base station sends the

desired message to the mobile unit.

Kravets and Krishnan [90] propose power saving by selectively choosing short periods

of time to suspend communications and shut down the communication device. Applying

this method to a transport protocol and using three simulated communication patterns,

they have achieved up to an 83% saving in the energy consumed by the communication

system. Chlamtac et al. [91] address the problem of wireless access protocols which

include an energy constraint and develop three energy conserving protocols for various

loads: grouped-tag TDMA, directory, and pseudorandom. Singh et al. [92] argue that

there is a need for using power-aware metrics, such as minimize energy consumed per

packet, minimize variance in node power levels, maximize time to network partition, etc.,

in the design of power efficient routing protocols. They show that these metrics in a

shortest-cost routing algorithm reduces the cost/packet of routing packets by 5 ∼ 30%over shortest-hop routing.

2.1.9 Battery aware power reduction

Chiasserini and Rao [18] have shown how battery behavior can be exploited to prolong

battery life. In particular, they identify the phenomenon of charge recovery that takes

place under pulsed discharge conditions as a mechanism that can be exploited to enhance

the capacity of an energy cell. The bursty nature of many data traffic sources suggests

that there might be a natural fit between the two. Bai and Lai [93] implement some

methods to let the low power CPU efficiently do some kind of computation intensive

tasks, such as graphic image processing and displaying. Their methods include reducing

the computation complexity of bitmap file processing, using fixed-point math instead

of floating point math, prestoring the table of trigonometric functions, and using a few

lines of assembly language code in the inner loop of graphic image processing program

to improve its performance. These methods lead to a speed up of the programs by a

factor of three to six.

In [44], we argue that mobile applications development require us to rethink the concept

of an algorithm from the viewpoint of battery life. Instead of asking for the best result,

a user may say :

’Give me the best result you can find, using no more than X units of resource R.’

Or, one can let the system make the tradeoff between fidelity and resource consumption

by saying:

’Give me the best result you can obtain economically.’

2.2. Multimedia DSPCPU Architecture 19

2.2 Multimedia DSPCPU Architecture

A multimedia processor is a media processor for high-performance multimedia appli-

cations that deals with high-quality video and audio. Typically, an extended general-

purpose CPU ( called the DSPCPU) makes it capable of implementing a variety of

multimedia algorithms from popular multimedia standards such as MPEG-1 and MPEG-

2. The key features behind this powerful processor are as follows:

• A general-purpose VLIW processor core coordinates all the on-chip activities.

In addition to implementing the non-trivial parts of multimedia algorithms, this

processor runs a small real-time operating system that is driven by interrupts from

the other units.

• DMA-driven multimedia input/output units that operate independently and that

properly format data to make software media processing efficient.

• DMA-driven multimedia coprocessors that operate independently and in parallel

with the DSPCPU to perform operations specific to important multimedia algo-

rithms.

• A high-performance bus and memory system that provides communication between

the processing units.

• A flexible external bus interface.

A typical multimedia processor is based on a three-level hierarchy of operators:

• Instructions

• Operations

• RISC operations

One instruction may contain five operations as depicted in Figure 2.1. Each operation

may execute multiple arithmetic operations. E.g., for TriMedia DSP processor TM130x,

one such operation is the command IFIR(a, b). This command contains a total of three

arithmetic operations: Two multiplications and one addition (aHI × bHI + aLO × bLO).

Up to five operations including two IFIR commands can be issued in each machine

cycle. The ability of TriMedia’s VLIW architecture to execute multiple operations in

parallel gives it a big advantage over traditional RISC and CISC architectures found in

current mass-market microprocessors.


Fig. 2.1: TriMedia VLIW instruction [4].

2.2.1 Multimedia Processor Execution Model

The multimedia processor processor provides a large set of general purpose registers,

generally named as r0, r1, and so on. In addition to the hardware program counter PC,

there are a few user-accessible special purpose registers to hold CPU branch addresses.

The CPU issues one long instruction every clock cycle. Each instruction consists of

several operations (five operations for the TM1300 microprocessor) [4]. Each operation

is comparable to a RISC machine instruction, except that the execution of an operation

is conditional upon the content of a general purpose register. Examples of operations

are:

IF r10 iadd r11 r12 → r13 (if r10 true, add r11 and r12 and write sum in r13)

IF r10 ld32d(4) r15 → r16 (if r10 true, load 32 bits from mem[r15+4] into r16)

IF r20 jmpf r21 r22 (if r20 true and r21 false, jump to address in r22)

Each operation has a specific, known execution latency (in clock cycles). For example,

in case of TM1300, iadd takes 1 cycle. This means that the result of an iadd operation

started in clock cycle ’i’ is available for use as an argument to operations issued in cycle

’i+1’ or later. The other operations issued in cycle ’i’ cannot use the result of iadd.

Similarly the ld32d operation has a latency of 3 cycles. The result of an ld32d operation

started in cycle ’j’ is available for use by other operations issued in cycle ’j+3’ or later.

Branches, such as the jmpf example above have three delay slots. This means that if a

branch operation in cycle ’k’ is taken, all operations in the instructions in cycle k+1, k+2

and k+3 are still executed. In the above examples, r10 and r20 control the conditional

execution of the operations. This is also referred to as guarding, where r10 and r20

contain the guard of the operation.

The implementation of architecture restricts the choice of operations that can be per-

formed in parallel or can be packed into an instruction. For example, the DSPCPU in

TM1300 allows no more than two load/store class operations to be packed into a single

instruction, shown in Figure 2.2. Also, no more than five results (of previously started

operations) can be written during any one cycle. The packing of operations is not nor-

2.2. Multimedia DSPCPU Architecture 21

mally performed by the programmer. Instead, the instruction scheduler takes care of

converting the parallel intermediate format code into packed instructions ready for the

assembler. The rules are formally described in the VLIW Description File (VDF) used

by the instruction scheduler and other tools.

Fig. 2.2: TriMedia functional unit assignment [4].

2.2.2 Multimedia Processor Operations Overview

In this section we present a brief overview of the multimedia processor instruction set.

Readers are encouraged to refer to [4] for details.

Conditional Execution: In multimedia processor architectures, all operations are op-

tionally ’guarded’. A guarded operation executes conditionally, depending on the value

in the ’guard’ register. For example, a guarded add is written as:

IF R23 iadd R14 R10 → R13.

This should be taken to mean if R23 then R13 ← R14 + R10. The ’if R23’ clause

controls the execution of the operation based on the LSB of R23. Hence, depending

on the LSB of R23, R13 is either unchanged or set to contain the integer sum of R14

and R10. Guarding applies to all TM1300 operations, except the iimm and uimm (load-

immediate) operations. Guarding controls the effect on all programmer visible state of

the system, i.e. register values, memory content, exception raising and device state.

Load and Store Operations: Memory is byte addressable. Loads and stores have to

be naturally aligned, i.e. a 16-bit load or store must target an address that is a multiple

of two. A 32-bit load or store must target an address that is a multiple of four. For


TM1300, the BSX bit in the PCSW (program control status word) register determines

the byte order of loads and stores. E.g., see ld32 and st32 in Appendix A of [4], only

32-bit load and store operations are allowed to access MMIO registers in the MMIO

address aperture. The results are undefined for other loads and stores. A load from

a non-existent MMIO register returns an undefined result. A store to a non-existent

MMIO register times out and then does not happen. There are no other side effects of

an access to a nonexistent MMIO register. The state of the BSX bit has no effect on

the result of MMIO accesses. Loads are allowed to be issued speculatively. Loads that

are outside the range of valid data memory addresses for the active process return an

implementation dependent value and do not generate an exception. Misaligned loads

also return an implementation dependent value and do not generate an exception.

Compute Operations: Compute operations are register-to-register operations. The

specified operation is performed on one or two source registers and the result is written

to the destination register.

Immediate Operations load an immediate constant (specified in the opcode) and produce

a result in the destination register.

Floating-Point Compute Operations are register-to-register operations. The specified

operation is performed on one or two source registers and the result is written to the

destination register. Unless otherwise mentioned all floating point operations observe

the rounding mode bits defined in the PCSW register. All floating-point operations

not ending in flags update the PCSW exception flags. All operations ending in flags

compute the exception flags as if the operation were executed and return the flag values

(in the same format as in the PCSW); the exception flags in the PCSW itself remain

unchanged.

Multimedia Operations are special compute operations. They are like normal compute

operations, but the specified operations are not usually found in general purpose CPUs.

These operations provide special support for multi-media applications.

Special-Register Operations: Special register operations operate on special registers,

such as program control status word, branch address holding registers etc.

Control-Flow Operations: Control-flow operations change the value of the program

counter. Conditional jumps test the value in a register, and based on this value, change

the program counter to the address contained in a second register or continue execution

with the next instruction. Unconditional jumps always change the program counter

to the specified immediate address. Control-flow operations can be interruptible or

non-interruptible. The execution of an interruptible jump is the only occasion where a

multimedia processor allows special event handling to take place.

2.3. Workload Description 23

2.3 Workload Description

Our workload consists of two major application domains, multimedia and bioinformatics.

Both use compute and data intensive algorithms. In this section we present in detail the

diversity found in these application domains, that we selected for the rigorous testing of

our ECACF. The variability in the input data streams is also discussed.

2.3.1 Multimedia Applications

The multimedia application set consists of encoders and decoders (transcodecs) encom-

passing three media types - speech, video, and audio (music) - and is summarized in

Table 2.2 to Table 2.5. We obtained codes for these applications from various public

domain sources [94] [95] [96] [21]. The applications were chosen for their importance

in real systems and (we believe) to be representative enough to make the inferences in

this study. We evaluated all our applications with four inputs, summarized in Table 2.6.

Here, we only report results from a single input for each application. We chose the input

that gave the highest (normalized) standard deviation in per frame execution time on

our base system. We call these inputs the default inputs, and list them in the second

column of Table 2.6. Results with the other inputs are similar, both quantitatively and

qualitatively. The G.728, H.263, and MPEG codecs statically distinguish multiple frame

types. G.728 uses an adaptive algorithm, where certain parameters are updated every

four frames. The processing of each frame in a single four-frame cycle is different due

to the calculation of these parameters. Thus, we treat these as different types of frames

(numbered one through four). The H.263 and MPEG codecs use almost the same video

compression scheme. A key difference is that MPEG uses three different types of frames

- I frames do not exploit inter-frame redundancy, P frames exploit inter-frame redun-

dancy using a previous frame, and B frames exploit such redundancy using a previous

and a later frame. Our H.263 codecs do not use B frames. They use a single I frame at

the beginning of the video and P frames for the rest. We do not include the I frame in

our analysis. It takes excessively long to simulate a frame with the MPEG codecs using

the frame sizes specified by the MPEG-2 standard (about 4 to 16 hours per frame for

MPEGenc. We scaled down the frame size to 176x144 pixels so that we could simulate

a reasonable number of frames to assess execution time variability. We ensured that

the scaling did not affect the cache behavior by performing a working set analysis and

running representative experiments with larger frame sizes and different cache sizes. As

the chosen frame size conforms to the H.263 standard, we used the same size for the

H.263 codecs for consistency. Also for consistency, we used the same set of four inputs

for both MPEG and H.263 codecs. These inputs contain a great deal of motion to

stress the applications. H.263 was designed for low bit-rate applications such as video

conference (which typically have less motion); therefore, our results from these inputs

represent an upper bound on the expected variability for H.263.


Application Description Input Vector SampleRate/Through-put

GSMenc Low bit-rate speech codingbased on the European GSM6.10 provisional standard. UsesRPE/LTP (residual pulse ex-citation/long term prediction)coding at 13 Kb/s. Compressesframes of 160 16-bit samplesinto 264 bits.

orignova 20 ms (160 sam-ples), 8 KHz

GSMdec homemsg

G728enc High bit-rate speech codingbased on the G.728 standard.Uses low-delay CELP (code ex-cited linear prediction) codingat 16 Kb/s. Compresses framesof five 16-bit samples into 10bits.

lpcqutfe 625 µs, (5 sam-ples), 8 KHz

G728dec homemsg



G723dec homemsg



G729dec homemsg

Tab. 2.2: Multimedia Benchmarks (Speech Transcodecs).

2.3.2 Bioinformatics Workload

Due to a significant increase in biological threats against humane, plants and other

species during last two decades, there is a growing realization that bioinformatics and

molecular biology equipments should be available in small form factors, that can be

readily available in field [97]. This lead to development of battery as well as execu-



H263enc Low bit-rate video coding basedon the H.263 standard. Primar-ily uses inter-frame coding (Pframes). Widely used for bit-rates less than 64 Kb/s.

orignova 40 ms, 25 frames/s

H263dec buggy

H264Lenc Low bit-rate video coding basedon the H.264 standard. Primar-ily uses inter-frame coding (Pframes). Widely used for bit-rates less than 64 Kb/s.

orignova 40 ms, 25 frames/s

H264Ldec buggy

MPEGenc High bit-rate video codingbased on the MPEG-2 videocoding standard. Uses intra-frame (1) and inter-frame (P,B) coding. Typical bit rate is1.5-6 Mb/s.

Buggy 33 ms, 30 frames/s

MPEGdec flwr

MPEG-1 encoder High bit-rate video codingbased on the MPEG-1 videocoding standard.

Buggy 33 ms, 30 frames/s

MPEG-1 encoder flwr

NLIVQ Non linear interpolative vectorquantization, image processingcodec

cameraman.tif 512x512 resolu-tion, Gray scale

Tab. 2.3: Multimedia Benchmarks (Video Transcodecs).


MP3enc Audio decoding based on theMPEG Audio Layer-3 standard.Synthesizes an audio signal outof coded spectral components.Typical bit rate is 16-256 Kb/s.

filter 26 ms (1151 sam-ples), 44.1 KHz

MP3dec filter

Tab. 2.4: Multimedia Benchmarks (Audio Transcodecs).


Application DescriptionFFT Fast Fourier Transform

IDCT Inverse Discrete Cosine Transform

T64 Matrix Transpose 64x64

M100 Matrix Multiplication 100x100

Tab. 2.5: Generic DSP application Benchmarks [7].

Domain Test Vector Description FeaturesAudio CatSteven Soft rock song 2500 frames, av-

erage length 65.25seconds

Sting Pop songBeethoven 2500 classical piece

Video Flwr Drive-by of houses 450 frames, each18 seconds forH.263 and 15seconds for MPEG

Cact Panoramic viewBuggy Buggy raceTens Table tennis match

Speech Homemsg An answering message Average frame sizefor GSM codecs is500, for G.72x is19000, length: 20seconds

Orignova Sentences read by different adultslpcqutefe Sentence read by a boy

Tab. 2.6: Test Vectors Characterization.

tion time efficient handheld devices for bioinformtics applications. Bioinformatics is an

interdisciplinary research area that helps to produce ’sensible’ and ’useful’ information

from the wealth of data that has been produced by the genome sequencing projects.

We categorize the basic functionality offered by all bioinformatics tools into four groups,

they are:

1. Algorithm for pattern recognition, probability formulae are used to determine the

statistical similarity in given two or more than two sequences.

2. Rule-bases analysis defines how a mathematical or statistical technique can be applied.

Different sets are defined with a membership, and set of rules are also created to elaborate

associativity. A basic set theory is used to fire a rule.

3. Biological data bases are uniformly and efficiently maintained archives of consistent

data that contain information and annotation of DNA and protein sequences, DNA

and protein structures as well as DNA and protein expression profiles [98] [99]. An


important feature of these databases is their simplicity in access and query management.

In addition some websites [100] [101] [102] provide visualization tools to aid biological

interpretation.

4. Biological taxonomy records the differences in sequences across different classes

helping further to reduce the similarity errors.

We chose applications for their importance in real system and representative enough to

make the inferences in this study. They are summarized in Table 2.7. We obtained

codes for these applications from various public domain sources. For lack of space, we

only report their underlying algorithm; details may be found in [99] [97] [102]. The

input databases are obtained from the NIH genetic sequence database ’GenBank’, NCBI

assembly archive ’Genome Assembly Archive’, Homologus structure alignment database

’HOMSTRAD’, the NIMH-NCI protein-disease database ’PDD’ and ’The Lens’ [100]

[102].

Application Pseudonym Features AlgorithmsGENESPLICER A01 Detect splice sites in the

genomic DNAHigh accuracy and com-putationally efficient

TIGRSCAN A02 DNA modeling Generalized HiddenMarkov Model (GHMM),HMM

TRANSTERMIS A03 Rho-independent tran-scriptional terminators

Statistical estimationtechniques

GENSCAN A04 Predict complete genestructure

Search algorithms

MUMMER A05 Genome Sequence align-ment

Tree algorithms

GLIMMERHMM A06 Find gene sequence ineukaryotes

IMM, Splice site models,Maximal dependence de-composition techniques

GENIE A07 Gene finder in vertebrateand human DNA

GHMM, Neural Net-works

FGENE A08 Find splice sites, genes,promoters

Linear discriminantanalysis

GRAIL A09 Analysis of DNA se-quence

Automated computation

GENEMARK A10 Find genes in bacterialDNA sequence

Markov chains

NetPlaneGene A11 Sequence analysis Neural network

GLIMMER A12 Coding regions in micro-bial DNA

Interpolated MarkovModels (IMM)

Tab. 2.7: Bio-Computation Applications Benchmark .


2.4 Energy Cycle Aware Compilation Framework Methodology

The ECACF is shown in Figure 2.3. The source code is processed successively for

static code analysis, post compiler analysis and finally for scheduling analysis. A VLIW

processor descriptor file (VDF) is used to provide architecture information to compiler,

scheduler and finally to the machine code generator. The VDF file contains a list of

pseudo and machine operations, latency of the operations, opcodes, slot assignment

schemes, processor operating frequency, instruction cache feature (associativity, block

size, number of sets) and main memory features (size, order, read/write latencies). This

file format is compatible as mentioned in [103] [4] [81] [104]. Here, we follow the

same VLIW naming convention as used in [104]. This feature has made our scheme

architecture independent. A list of parameters is generated in each step during the

methodology flow. Intermediate trace files are generated during the code processing

flow to produce AEP, such as code size, execution time number of cache miss (for both

instruction and data caches), data cache conflicts, data bank alignment, highway usage,

scheduling factor and slot utilization. After the simulation these parameters are used

to compute transformation control factors such as unrolling factor, grafting depth and

blocking metrics. These control factors are further explained in [25]. Iteratively after

each cycle all these parameters are recorded again and are compared to preset user

constraints mentioned in a User Constraint File (UCF). This file contains desired values

for code, execution time, energy and allowed percentage cache miss. Energy is measured

at the target platform (the setup is explained in Section 2.5). All these parameters are fed

back to the transformation cost analyzer. In each successive transformation it is decided

that whether energy-cycle performance has been optimized or not. The source code is

optimized by undergoing code restructuring schemes known as loop unrolling, decision

tree grafting and loop tiling. Additional benefits are gained by combining traditional

compiler optimization algorithms, such as constant and variable propagation, dead code

elimination, strength reduction etc..

2.4. Energy Cycle Aware Compilation Framework Methodology 29

Fig. 2.3: Transformation methodology.


2.4.1 Application Expression Profile

From a ’C’ source code to an executable binary, an embedded application has to go

through many tools: the text writing notepad, compiler, scheduler, linker, and the

loader. The urge ’how can I?’ is transformed into the conscious biased perception, en-

tailed by embedded systems emerging from software hardware co-design. The software

leads and the hardware follows the technological limitations. The behavior, a software

implementation can express on a hardware is limited by the liberty offered by the hard-

ware architecture and the ability of programmers to code the ’how can I?’. The above

issues indicate that for a ’good’ energy-cycle performance there is a need to gather

more detailed profiles, containing information about system behavior on various levels

as shown in Figure 2.4. The main goal of such vertical profiling is to further improve the

understanding of system behavior through correlation of profile information at different

levels.

Fig. 2.4: Vertical application profile layers.

Hitherto, an executable application development hierarchy is composed of compilation,

scheduling, linking, and binary code generation. Finally, this code is downloaded to

the SDRAM attached with the multimedia processor. Our Application Profile Monitor

(APM) extracts application behavioral parameters as mentioned above. This infor-

mation is extracted from the vertical profile layer block as shown in Figure 2.4. An

application is profiled both in terms of its static and run time (dynamic) behavior. The

way an application expresses itself, we call Application Expression Profile (AEP) for a

given hardware architecture. We characterize an application expression profile using the

following conventions:

1) Name : It describes the name of the profile monitor.

2) Definition: It defines the profile monitor as used in our ECACF.

2.5. Experimental Setup 31

3) Location: It shows the location of the monitor in the application development hier-

archy such as compilation, scheduling, linking etc.

4) Type : There are two possible types: static or dynamic.

5) Range: The possible range of value a monitor can have.

6) Level: If a parameter is measured directly from the code, it is called primary monitor,

in other case if it is computed using one or more parameters, we call it secondary monitor.

E.g., a primary monitor can be written as:

Name: Processor Frequency

Definition: The operating frequency of the microprocessor

Location: VDF

Type: static

Range: Typical 100MHz - 233MHz (depends on given hardware architecture)

Level: Primary

Similarly, a secondary monitor can be written as:

Name: Scheduling Factor

Definition: Computed this factor by dividing infinite machine cycle time with finite

machine cycle time

Location: Transformation Engine and Scheduler

Type: Dynamic

Range: 0 to 1

Level: Secondary

A complete list of profile monitors is provided in Appendix A.

2.5 Experimental Setup

The energy consumption by an application at a realtime platform is a first step to be

known in any energy constrained embedded system and can be used to estimate the

battery lifetime of the system. In this section, we describe an energy measurement

method for a software application running on a realtime multimedia VLIW processor.

The method is described for TM1300 Philips DSP processor, but it is applicable to other

multimedia processors, for e.g., Blackfin ADSP533S. The measurement framework has

been incorporated into our ECACF, that allows a software application programmer to

measure a realtime energy consumption by running the candidate ’C’ source code.


2.5.1 Related Work for Energy Measurement

The energy consumption of a software application running on target hardware depends

on the processor, data path and instruction set architecture [31]. The switching energy

consumed depends linearly on the operating frequency and quadratically on the supply

voltage. Other architectural parameters which strongly affect the energy consumption

of a processor are cache size, datapath width, number of functional units (multipliers,

shifters), register file size, legacy support hardware, multimedia extension support, etc.

In general, it is practically impossible to predict how much energy a software application

will consume on another processor given the energy consumption profile on one processor,

without some prior calibration and measurement on the other one.

Software energy estimation through exhaustive instruction energy profiling was first pro-

posed in [5]. The basic experimental setup used in [5] is shown in Figure 2.5. The

approach proposed in [5] is based on the current measurement drawn by the processor as

it repeatedly executes a certain instruction or sequences of instructions. This is achieved

by putting the sequence in a loop and measuring the current values. The measured val-

ues correspond to the base current cost of instructions. The program is broken up into

basic blocks and the base cost of each instance of a block is obtained by adding the base

cost of instructions in the block. These costs are provided in a base cost table. Tiwari

et al. in [5] obtained a run-time execution profile (instruction trace) for the program.

Using this information the number of times the basic block is executed is determined

and the overall base cost is computed. The effect of the circuit state (inter-instruction

effects) is incorporated by analyzing pairs of instructions. A cache simulation is also

performed to determine stalls and a cache penalty is also added to the final estimate.

The principal disadvantage of this approach is that it involves an elaborate instruction

trace analysis. Assuming an Instruction Set Architecture (ISA) with K instructions, K2

instruction energy profiles have to be obtained to accurately determine base and inter-

instruction costs. Moreover, most instructions have a lot of variations and an exhaustive

characterization is very time consuming.

2.5.2 Proposed Methodology

Our energy measurement setup shown in Figure 2.6 is close to Figure 2.5. Given that

energy is a time integral of a power-time product, and keeping voltage fixed, the energy

depends on the current variation appearing in the CPU and memory current consumption.

During program execution, the current variation in the CPU and memory depends on

the following major factors:

1. The switching activity caused by instruction execution in CPU and load/store activity

to/from memory.


Fig. 2.5: Experimental setup for instruction/program current measurement [5].

2. The cache misses, they lead to CPU or cache stalls and hence require extra cycles as

mentioned in [4].

The instantaneous current drawn by a CPU varies rapidly with time showing sharp spikes

around clock edges. This is because at clock edges the processor circuits switch (i.e., get

charged or discharged) resulting is switching current consumption. Expensive hardware

with a lot of measurement bandwidth and low noise is required to accurately track the

CPU instantaneous current consumption. From an energy measurement perspective,

however, we are interested in average current consumption. To a first order, battery

lifetime depends on the average current and the amount of stored energy in the cell.

Measuring average current is simpler and can be achieved by using a current meter.

The current meter averages the instantaneous current over an averaging window and

the corresponding readings are stable average values. The average current itself varies

as the program executes.

We captured current variations using an HP54710 oscilloscope. This scope has a 4G

sample/ second sampling rate. In addition, some real time arithmetic functions can also

be performed on the input signals at single or multiple channels. As shown in Figure 2.6,

current is captured in term of differential voltage drop across a 0.01 Ω sense resistor


Fig. 2.6: Proposed experimental setup for application current measurement at processorand memory.

inserted in the current path (only input to channel 1 is shown, for channel 2 and 3,

the connections for differential input are similar). The input differential voltage drop

at each channel is divided (using oscilloscope divide function) by 0.01 Ω to obtain the

current consumption. When an application is allowed to run on the target hardware, the

switching activity is produced, this leads to current variation on current paths at CPU

core voltage input, CPU peripheral voltage input and memory supply voltage input.

Figure 2.7 shows the CPU core current variation captured for a vector quantization

(VQ) application. It is plotted against the application execution time. The application

is allowed to run until the completion, i.e., 2800 msec. This plot clearly indicates the

current consumption profile at different time instants during the application execution

life cycle. It may be noted that during the application execution life time, one or more

code blocks in an application might have been executed twice or more. The basic code

block is a piece of code containing sequential instructions. As we mentioned, in APM,

time capture monitors are added in the original source code. They are inserted at the

beginning and end of each code block to obtain the number of times that a block is

accessed and the duration of its execution. Correspondingly, we generate an address

range versus current consumption plot. Figure 2.8 shows a plot over the address range

0x600000 to 0x603000 at address interval or step size 1024 bytes (note that we are using

the prefix ’0x’ to show hexadecimal values). There are 13 code blocks in the application.

The length of the vertical bar at address 0x601C00 corresponds to the average current

consumption over the address range 0x601C00 to 0x602000 at a step size 1024(0x400).


Fig. 2.7: Current consumption for vector quantization (VQ) application execution lifecycle.

Fig. 2.8: CPU core current consumption versus address range for VQ application.


The step size is adjusted according to the granularity of the program basic code block.

It may vary from 256 bytes, 512 bytes, 1024 bytes, 2048 bytes and 4096 bytes. A code

block may consist of varying numbers of address ranges, e.g., in Figure 2.8 code block

1 (CB01) is executed during address range 0x600000 to 0x600400, while code block 3

(CB03)is executed during address range 0x600800 to 0x600C00.

Now we formulate our energy consumption. Let a given program source code ′X ′ has′m′ code blocks, then

X = CB1, CB2, CB3, ..., CBm.

The total energy of the program code is the sum of energy consumed in individual code

blocks and if the j-th code block CBj is executed from time ta to tb, then the energy

consumed by the code block will be:

ECBj =∫ tb

ta

PCBjdt. (2.2)

The power consumed by each individual code block is the sum of total power consumed

over the code block address ranges.

For code block CBj , it will be

PCBj =∑n

i=1 Pi, where ′n′ is the total number of address ranges (vertical bars) and Pi

is the power consumed by the i-th address range:

Pi = vcic,i + vpip,i + vmim,i (2.3)

vm, vc and vp are operating voltages for memory, processor core and processor peripheral,

respectively. Similarly, ic,i, ip,i and im,i are the current consumptions in the processor

core, the processor peripheral and the memory, respectively, for the corresponding i-th

address range. For our experiments we set vm = 3.3 V, vc = 2.2 V and vp = 3.3 V.

We obtain the execution time for each address range from the time base scale of the

oscilloscope.

Example: We capture the current consumption in memory, CPU core and CPU periph-

eral for the address range 0x200400 to 0x202C00 in an audio transcodec G-728 [94] at

a step size 0x400 (it may be noted that the current consumption is captured in terms

of execution time and later we converted it in terms of address range using the pro-

cedure mentioned above). They are shown in Figure 2.9 to Figure 2.11, respectively.

This address range corresponds to the code block (function) ’code book’ in the baseline

source code. The reason behind such a current consumption depends on the coded al-

gorithm. Here we shall restrict our discussion only to elaborate the energy measurement

methodology. There are nine current consumption bars in Figure 2.9 to Figure 2.11

2.6. Conclusions 37

corresponding to nine address ranges. The power dissipation in memory for ’code book’

would be vmΣ9k=1im,k, where ′k′ is number of address ranges in our code block (’code

book’).

Then the power consumed by the ’code book’ in memory is:

3.3V (4.37+8.02+0.47+1.90+2.91+2.60+6.63+7.25+0.35)mA=113.87 mW.

Similarly, from Figure 2.10 and Figure 2.11 the power consumed in CPU core and CPU

peripheral is 73.01 mW and 166.83 mW, respectively.

The total power consumption will be: Ptotal = Pmemory + PCPUcore + PCPUperipheral

i.e., Ptotal = 353.70 mW.

From the oscilloscope time base the sample duration for the ’code book’ is 0.267 seconds.

Thus the total energy consumed by the ’code book’ is 94.44 mJ.

The total energy consumed by the program is obtained by summing the energy con-

sumption of each individual code block.

Fig. 2.9: Memory current consumption versus address range for G-728 audio transcodec.

2.6 Conclusions

In this chapter we layed the foundation for the development of our energy cycle aware

iterative compilation framework. Our methodology optimizes a software application for

energy consumption, execution time as well as efficient hardware architecture utilization.


Fig. 2.10: CPU core current consumption versus address range for G-728 audiotranscodec.

Fig. 2.11: CPU peripheral current consumption versus address range for G-728 audiotranscodec.

We elaborate our method for generic multimedia processors and define software appli-

cation in terms of its architectural behavior. We provide a simplified overview of typical

multimedia processors. Unlike conventional complex multimedia operation models, we

use a simplified multimedia operation model developed, that views the instruction set

2.6. Conclusions 39

in terms of load/store operations, compute operations, special register operations and

control flow operations. We elaborate the importance of measurement of energy con-

sumption made by an application at a realtime platform, that is a first step to know

in any energy constrained embedded system and can be used to estimate the battery

lifetime of the system. We present here a measurement platform that is generic and

applicable to most off-the-shelf available multimedia processors. It is based on current

measurement at both processor and memory input lines. We propose a simplified en-

ergy consumption model based on code blocks. We expose a step-by-step procedure

for the measurement of software application energy consumption at a target hardware

architecture. As compared to contemporary work, our framework is tested on two ma-

jor application domains, multimedia and bioinformatics. The multimedia application

set consists of encoders and decoders (transcodecs) encompassing three media types -

speech, video, and audio (music), whereas, we categorize the basic functionality offered

by all bioinformatic tools into four groups. They are pattern recognition algorithms, rule

based analysis, biological data bases and biological taxonomy. Moreover, our results

reveal the utility of our framework at diversified application domains.

3. GRADIENT MODE ITERATIVE

COMPILATION (GMIC)

In Chapter 2, a framework is presented for executing a single application in several source

transformation settings. The basic idea is to first identify the compute and data intensive

code blocks in programs, we call them Energy Cycle Hungry Code Block (ECHCB)

and then execute a series of experiments, with each ECHCB assigned a predetermined

Transformation Scheme (TS). A simplified flow of methodology is shown in Figure 3.1.

As explained in Chapter 2, we obtain the application expression in our ECACF that is

further used by a Transformation Engine block and Code Evaluation block as shown in

Figure 3.1. Based on the desired objectives, the transformation engine decides whether a

given application should go through successive transformations and hence compilation.

If energy-cycle constraints are not met in the UCF, the transformation engine block

transforms the code according to the Gradient Mode Iterative Compilation (GMIC)

algorithm and provides it to a native Application Build Environment block. This block

produces the machine code for the transformed application source code, that is later

allowed to execute on the target platform to obtain the dynamic application expression

profile. The whole process is repeated until each successive transformation meets the

desired optimization objective as mentioned in the UCF.

We implement our scheme by first investigating the ’C’ source code of application for

cycle energy taxing blocks, based on trace data collected during a profile of the appli-

cation as mentioned in Chapter 2. For χ code blocks in an application and λ possible

TS, there are λχ unique solutions, where a solution is an assignment of a transforma-

tion scheme to each code block. We present a novel heuristic that helps to search the

solution space and eventually finds solutions (or transformation scheme) to satisfy the

desired energy-time tradeoff for a given application. In each step a heuristic takes one

code block and tries to optimize it with the available set of transformation schemes.

It proceeds to the next code block only when the previous code block is optimized or

until there is no more TS available. Minutia of the proposed heuristic are elaborated in

Section 3.2.

In the following section we explain our profiling technique for determining an efficient

TS for each code block. Firstly, it describes our scheme for identification and prioritizing

of candidate code blocks in the source code. Secondly, it discusses the mechanism that

42 3 Gradient Mode Iterative Compilation (GMIC)

Fig. 3.1: Gradient mode Iterative Compilation Methodology (GMIC).

collects performance data and the energy-time tradeoff. Finally, it discusses our method

for choosing an assignment of TS to each code block.

3.1 GMIC Architecture

We use a straightforward programming model, which primarily applies to multimedia

and streaming applications. Specifically, it starts by obtaining a trace of the application

in question, which we call the baseline code. From there, it divides a program into Code

Blocks (CB). A code block is composed of procedure blocks and independent sequential

code blocks. Division is performed by examining the trace and using an ad hoc approach

that conforms to the following principles:

First, all CBs having their essential profile larger than the profile mentioned in the UCF

are considered. In the UCF they are enlisted with their cyclomatic complexity, nesting

depth, and paths.

Second, the priority conflict is resolved by the weighted values of CBs, i.e., if two CBs

have the same access profile, then the CB with highest cyclomatic complexity shall be

considered first and so on. For the latter rule, the priority is indexed as access rate,

cyclomatic complexity, nesting depth, and paths.

3.1. GMIC Architecture 43

Fig. 3.2: Fraction of JPMO CB in an MPEG-1 application, the code blocks are num-bered from fb01 to fb34.

3.1.1 Performance Qualifier Measurement

We introduce the Joules Per Million of Operations (JPMO) as a performance measure

for the selection of candidate CBs. This measure is computed as average energy con-

sumption per CB in million of operations. We have found JPMO to be effective in

determining energy cycle hungry code blocks. For example, Figure 3.2 gives an example

of how JPMO varies in an MPEG-1 video encoder. The Figure 3.3 shows a window of

7 CBs. Here, JPMO clearly helps to partition the code into ECHCB. These CBs were

determined by hand, but there is a potential to automate the partitioning.

3.1.2 Code Block Queuing

Our method for transforming ECHCB requires queuing these blocks such that blocks

pertaining a similar expression profile most likely benefit from the same TS. This saves

the search time to find the transformation scheme for the next candidate code block

which is similar in expression profile to the previous one. Hence, the same TS can be

applied to the next CB. Thus, our approach requires distinguishing a code block that

has a good energy-time tradeoff from one that does not. That is, we must estimate the

effect on the energy consumption and execution time of the block when executing such

blocks in a successive transformation. The key here again is JPMO (introduced above),

which estimates CBs that have a good energy-time tradeoff.


Fig. 3.3: Fraction of JPMO contributed by code blocks in an MPEG-1 application- (awindow view for seven blocks).

3.1.3 Code Block Expression Profile

The first step gathers profile data during an execution of the program. A collection of

application expression is performed in different code blocks and is following the scheme

mentioned in the next section. The information we collect includes the type of function

call and location (program counter). It shows the status (TS, time, energy etc.) and

metrics (number of useful instructions, instruction cache misses, data cache misses etc).

The extraction of code block expression profile is already explained in Chapter 2.

3.1.4 Transformation Scheme

The dynamics of GMIC is powered by a set of transformation schemes. They are chosen

according to their rate of appearance during the compilation. We examine a wide range

of transformation scheme and grouped them with respect to their highest appearance

in compilation to the lowest [68]. We use four sets of transformation schemes (TS1,

TS2, TS3, TS4), they are enlisted in Table 3.1. The second column describes the name

of optimization corresponding to each TS. Some transformations use one or more lower

level transformation schemes as well. Overall, loop transformation is considered as the

most beneficial in our framework, but blind use may lead to increase cache misses and

eventually high energy consumption. The third column shows the rate of the TS at the

order of their appearance in conventional DSP compilers. The rate decreases from TS1

to TS4. The fourth column shows the optimization level of each transformation scheme.

3.2. Implementation 45

The sequence of transformations is in the order of aggression, this is shown in the last

column of Table 3.1. TS1 is at lowest optimization level. Our algorithm increases the

level of transformation scheme according to the performance objective defined in the

performance tuple ρ(Energy, Cycles, Cache Misses, Functional Unit Utilization).

It may be noted, the proposed sequence may not be the best one. We found it efficient

across our benchmark applications.

TransformationScheme

Optimization Types Rate Optimization Level

TS1 Basic block Highest LowValue propagationHoisting loop invariantVariable optimization

TS2 TS1 High MediumFunctional blockLoop normalizationBreak up large expression treesLoop optimization

TS3 TS1 Medium HighGlobal optimizationDismantle array instructionsLoop optimization

TS4 TS2 Low HighestAggressive decision tree graftingTS3

Tab. 3.1: Transformation Schemes.

3.2 Implementation

The gradient mode iterative compilation is steered by the algorithm, depicted in Fig-

ure 3.4. Our ultimate goal is to find a transformation scheme that is ’acceptable’ for the

energy-cycle constraints. Determining which of the two solutions is ”better” depends

on how a user wants to trade off energy savings and time delay. Provided a program

partitioned into executable code blocks, we proceed to our method for determining an

effective assignment of transformation to code block. If there are χ code blocks and

λ transformation schemes, then the number of possible solutions (block-transformation

control) for the program are λχ . In general, the search space is too large to explore by

brute force. Therefore, the second part of our method is a heuristic that we use to find

the ”best” solution. The heuristic finds the ”best” TS in a CB, then moves on to the

next code block. Once it moves on to another CB, the TS for the preceding CB has

been determined. Therefore, it is important that CBs are sorted. Initially, the solution


(a vector of TS) is set to the baseline value-all zeroes. The recursive function is invoked

on the 0th code block. It executes the program using the next TS in this CB (all other

CB are as before). If the energy-time tradeoff (defined in UCF) of this new solution is

better than the current solution, it is accepted. The algorithm recursively tries the next

aggressive TS on this CB.

The TS is determined when the new tradeoff is worse than the current or when there are

no TS. After setting the TS, it moves on to the next code block. This heuristic runs at

most λχ times. After each program execution, the energy and time are measured and

compared via the user-defined relationship. For our tests, we use a simple and intuitive

evaluation of the tradeoff based on the slope of the line between two solutions. The

slope (Ω) is defined as the ratio of energy savings to time delay:

Ω =Jk − Jk+1

Ck − Ck+1(3.1)

Where, k and k+1 are two consecutive solutions, J=Energy consumption, and C=Execution

times.

Following conventions are implicitly true:

• Ω = -1 (i.e., 45 degree below the horizon) means savings and delay are equally

weighted.

• Ω = 0 means minimize energy.

• Ω = ∞ means minimize time.

We consider a new solution with a larger slope (in magnitude) than the user-defined

limit to be better. We advocate this metric because it is reasonable and it is easy to

visualize.

3.3 Example: Optimization of an MPEG-1 encoder

In this section we elaborate our methodology step-by-step with an example. We study

an MPEG-1 encoder (MPEGencoder) in depth. In the energy-cycle graph shown in

Figure 3.5, the baseline is mentioned as the ’bb’ point, where no optimization is applied.

It may be noted the higher of two points uses more energy, and the further right of two

points takes more time. For all other points, at least one CB is transformed. Each point

is labeled as a tuple, for e.g., point 21 means, the 2nd code block and transformation

scheme 1.

Our analysis of the JPMO identified five code blocks in MPEGencoder, CB7, CB17,

CB18, CB31, CB33. For convenience, we use their pseudonym CB1, CB2, CB3, CB4,

3.3. Example: Optimization of an MPEG-1 encoder 47

X = CB01, CB02, … CB m is an application vector composed of Energy Cycle Hungry CodeBlocks (ECHCB) in descending order. TS = TS1, TS2, TS3, TS4 is Transformation Scheme (TS) vector composed of available transformations. ρ(Total Energy Consumption, Execution Time, Cache Misses, Functional Unit Utilization) is the performance tuple as mentioned in user constraint file (UCF) Ω is the energy-cycle slope between the two consecutive solution SstsCount is the executable application binary that has been transformed ‘stsCount’ times. S0 is executatble application binary that has been optimized by native compiler for minimum execution time. Source to source transformation parameter : 1. X, array of ECHCBs , 2. TS , array of available transformation schemes, 3. The slope windows (W) , 4. sts_count S sts_count ← StS(X, TS, W, sts_count) // Source to Source code transformation call to proceed to optimal S f Algorithm StS(X, TS, W, sts_count)

Build the Application S sts_count and obtain AEP

Compute performance tuple form AEP and store in ρ If ρ satisfies UCF

return X endif initialize CB_count, TS_count

Next_Iter: Get (CB)CB_count

Apply (TS)TS_count sts_count++ Build the Application S sts_count and obtain AEP

Compute performance tuple form AEP and store in ρ’

If ρ’ satisfies UCF return X

Compute Ω for ρ and ρ’

If ( Ω Є W) /* if slope belongs to user defined slope limits W */ /* current TS for current CB is acceptable, so make it an anchor point for next iteration */

CB_ter = CB_count TS_ter = TS_count endif else

/* the efficiency of current TS for current CB is not satisfactory, therefore maintain the previous TS and get next code block */ CB_count++ TS_count = TS_ter

endif If (TS_count++ > TS_max) then /* if all TS are applied, then proceed to next code block */

CB_count++ endif If (CB_count > m) /* all code blocks have been considered for transformation */ return X Goto Next_Iter

Fig. 3.4: GMIC algorithm.


Fig. 3.5: Heuristic track of CT-Tuple for an MPEG-1 encoder application.

CB5, they are enlisted in descending order w.r.t JPMO. We also assign a unique tuple to

Code block - Transformation scheme assignments as a (CT), may be written as: CTxy;

x= number of candidate CB, and y=number of TS. In this example x = 1,2,3,4,5 and

y= 1,2,3,4. E.g., CT14 means code block 1, optimized with transformation scheme 4.

Any point on the graph will show a unique solution to our heuristic, it may be noted

that its not an exact solution, but satisfied the constraints. Initially, the application is

optimized for minimum execution time, without taking into account optimal architecture

utilization, energy consumption etc. Such solution is labeled ”00”, meaning our heuristic

is in inactive state for the source code. In the figure we have drawn a polygon line

connecting 5 points with the bb point, to form an Energy Cycle Bay (ECB), that are

”good” choices under a simple slope-based Energy-execution time metric. Any solution

”inside” the ECB is not a ”good” choice according to this metric, except those laying

on the lower edge of the bay. Recall that the goal of our profiling algorithm is to do

”better” than the baseline. As our chosen metric is the slope between two solutions, Ω,

our algorithm will select a unique point in ECB. This is illustrated in Table 3.2, which

shows a CTxy tuple track for only four gradient windows. The steps of the algorithm

are shown in the table. The first column shows the user defined window of slope for

the convergence of heuristic to find the closed solution on the ECB. The second column

shows the two solutions being compared, and the arrow indicates the direction of the

slope. Of these two solutions, the energy and time of the one on the left is known, so

only one run of the program is necessary to complete this step. The third column shows

the EC slope, and the fourth column indicates whether it is less than the user-defined

limit i.e., column one. The last column shows the final point that is anchored by the

heuristic to find the next optimum.

3.4. Discussion 49

Slope Window CTxy Tuple Track Ω Direction Solution

Ω <-10 00 → 11 -25.667 True 11 is anchored::

00 → 52 -2.48 False

-10 < Ω < -3 11 → 12 -3 False11 → 21 -3.23 True 21 is anchored

-3 < Ω < -1 21 → 31 -2.7 True 31 is anchored

-1 < Ω < 0 31 → 32 -1.2 False31 → 41 -0.87 True 41 is anchored

Tab. 3.2: Gradient Table.

In the first row of Table 3.2, Ω is steep (large negative number). Such a value favors

time delay over energy savings. In this case and all others, the algorithm selects an

appropriate TS in the order of its optimization level i.e., from lower to highest. Thus,

we start with the candidate solution 11. In the first step, the slope from 00 to 52 is

greater than Ω (not as steep); therefore, it is rejected. We back up to the previous

solution (11) and try the next TS. Next, solution 12 is rejected. Thus, the algorithm

selects the next most energy cycle hungry code block 2 for last successful TS 1. Here,

we find the slope Ω, which is sufficiently steep. In this case, 21 is accepted and 12

is rejected. The algorithm moves to the next aggressive transformation scheme 2 for

the same code block 1 and then determines the transformation for the first code block,

ultimately selecting 11. Similarly, in next case the slope is within limit resulting on

CT31 tuple. Followed by this from 31 to 32, the limit is slightly lower (more steep),

which results in acceptance of the next CT41 tuple, followed by the rejection of the

CT32 tuple. In this section we are more focused in elaborating our technique, while

ignoring the energy cycle benefits. Though implicitly they have been met, but we left

this discussion until the next section.

3.4 Discussion

Figure 3.6 to Figure 3.10 show results from our five benchmarks applications that we

executed. They are FFT, IDCT, T64, M100, H264L [105]. We plot all applications

according to their average time of execution, as explained in Chapter 2. Each graph

is showing execution time in milliseconds against energy consumption. Based on our

heuristic impact on the optimization of these applications, we observe different types of

slope sensitivities in tacking solutions from one anchored CTxy tuple to the next candi-

date. Applications are discussed below in the context of their track slope sensitivities.

Low EC Gradient Applications: We start our energy cycle benefit discussion from

Figure 3.5, where the baseline code was at the extreme right side of the graph. Our


heuristic was aiming for low energy and a low cycle count for the MPEG-1 encoder

application. Tracking CT00 tuple to CT11 tuple reduced down the energy to 6% with

a time penalty of 3%. Similarly following the whole track, the energy cycle benefits

from CT00 tuple to CT41 tuple is -89% / 10%, it means that energy is decreased by

10 percent while execution cycles are significantly increased to 89% as compared to the

CT00 tuple. It may be noted that CT00 tuple is the right most point on the graph that

aims for minimum execution cycles at maximum energy cost. Therefore CT41 tuple is

a tradeoff between the energy gains and execution cycles penalty as compared to the

baseline code. Our algorithm saves 4% energy at the penalty of 5% execution cycle.

Fig. 3.6: Heuristic track of CTxy tuple for FFT application.

Fig. 3.7: Heuristic track of CTxy tuple for IDCT application.

3.4. Discussion 51

Fig. 3.8: Heuristic track of CTxy tuple for T64 application.

Figure 3.6 shows the heuristic flow for the Fast Fourier Transform (FFT) algorithm

where the cycle/ energy gain is 77%/-10%. Being compute intensive such algorithms

are always energy consuming, but their loop structure make them a favorite choice for

energy reduction especially when they are used for high order filter. Similar behavior can

be observed in Figure 3.7 and Figure 3.8 for the IDCT and the T64 application. Overall,

the behavior of the IDCT application varies widely. Each code block takes benefits of

the next TS. Hence the slope is more steep. For T64 it is important to consider the

array size. Arrays that favor high localization in the on-chip cache memory show less

energy dissipation.

High EC Gradient Applications: H264L shows a significant gain in execution cycle

as well as energy reduction (see Figure 3.10). As compared to the baseline code, the

time penalty is 17% at an energy saving gain of 32%. H264L is mostly used in handheld

device, where energy saving is of prime importance. Our profiling shows that H264L

source code has a large number of localized procedure calls, that fit well in an onchip

cache. In the same vein, the size of the input frame sequences is also suitable to the

size of our data cache.

Non-sensitive Applications: In case of matrix multiplication of order 100 (M100) our

heuristic shows no benefits,(see Figure 3.9). M100 has only one main procedure call, no

communication (except once at the end of the program), and is CPU bound. Therefore,

transformation schemes have no effect on the program improvement. In such cases a

native compiler is sufficient to produce an optimal application.


Fig. 3.9: Heuristic track of CTxy tuple for M100 application.

Fig. 3.10: Heuristic track of CTxy tuple for H-264L application.

3.5 Conclusions

In this chapter we introduce our slope directed technique to drive the iterative compi-

lation in our energy aware framework. The ’C’ source code is divided into code blocks

depending on their energy cost. We introduce joules per million of operations as a

performance measure for a CB. Execution time and energy consumption of transformed

applications are compared against the user constraints. Once a solution is achieved

for highest energy cost, then our heuristic starts tracking the next available low energy

transformation. Successively, it finds lower energy solutions at the cost of time penalty.

3.5. Conclusions 53

Our technique is sensitive to the order of the code blocks. Due to greedy search, our

heuristic search for the next available CT tuple is very slow. We improve this in the next

chapter, where optimization objectives are modeled as a multiobjective problem and the

solution space is searched with the help of a genetic algorithm.

4. MULTICRITERIA STOCHASTIC ITERATIVE

COMPILATION (MSIC)

4.1 Introduction

In contrast to a general purpose computer, an embedded system typically runs one appli-

cation for its lifetime. In GMIC as proposed in Chapter 3, only a moderate improvement

is achieved, as it effectively restricts itself to trying different back-end optimizations.

The major impediment to such approach is the heuristic search technique itself. In this

chapter we consider the optimization problem as a single task, where all desired aims

have to be taken into account simultaneously. The new method is based on the opti-

mization of a multicriteria, objective function. The desired aims of architecture-based

energy-cycle optimization are formulated as penalty terms of such an objective function.

The maximization of the objective function is achieved using a Genetic Algorithm (GA).

A simplified flow of methodology is shown in Figure 4.1. As explained in Chapter 2, we

obtain the application expression in our ECACF that is further used by a Transformation

Engine block and MSIC block as shown in Figure 4.1. Based on the desired objec-

tives, the transformation engine decides whether a given application should go through

successive transformations and hence compilation. If energy-cycle constraints are not

met in the UCF, the transformation engine block transforms the code according to the

Multicriteria Stochastic Iterative Compilation (MSIC) algorithm and provides it to a na-

tive Application Build Environment block. This block produces the machine code for

the transformed application source code, that is later allowed to execute on the target

platform to obtain the dynamic application expression profile. The whole process is re-

peated until each successive transformation meets the desired optimization objective as

mentioned in the UCF.

In next section we propose source level optimization as a multicriteria problem. We

expose the minutia of our methodology for e.g., selection of constraints, development of

the fitness function, as well as the formation of the Hertz Matrix (HM). We discuss two

multimedia applications in depth to elaborate the advantage of the proposed algorithm.

56 4 Multicriteria Stochastic Iterative Compilation (MSIC)

Fig. 4.1: A simplified view of framework with multicriteria methodology extension.

4.2 Model Development

Multicriteria optimization is very different than a single-objective optimization. In the

later, the aim is to obtain the best design which is usually the global minimum or

global maximum depending on the desired objective. While former, there may not

exist one solution which is considered to be the best with respect to all objectives.

Instead there exists a set of solutions which are superior to the rest of solutions in the

search space when all objectives are considered but are inferior to the other solution

in the space in one or more objectives. These solutions are also known as Pareto-

optimal solutions or nondominated solutions [106]. Since genetic algorithms work with

a population of points, a number of Pareto-optimal solutions may be captured using

GAs. A genetic algorithm belongs to the class of stochastic optimization methods

[106] [6] [107]. Although it does not guarantee finding of the global optimal solution,

the result is typically a good approximation of it. The concept of the GA allows for

working parallel with many feasible solutions (individuals) by operating between these

solutions. Because of working with many solutions in parallel, it is improbable that the

genetic algorithm stalls in any local optimum and thus likely that it finds the global

solution. The algorithm is well suited to our problem, where the objective function is

non-smooth, non-differentiable and discontinuous, because the GA does not demand

any of these properties. However, the following two properties regarding search space

and objective function are demanded:

• Firstly, every point of the search space must be able to be coded as finite length

string.

• Secondly, every point of the search space must have a positive fitness described

by the objective function.

4.2. Model Development 57

Assume that all possible transformations are known. The assumption is sound because

the optimization space is in practice limited by architectural constraints, e.g., number

of available functional units, or best fit for the code block in cache. By using AEPs the

transformation scheme is determined for every possible code restructuring.

4.2.1 Objects and Constraints

We have two objects for the optimization:

1. Instruction per cycle (η) and

2. Energy saving (ξ).

For every measured point of the optimization space, it is observed that:

• The successive architecture utilization (in terms of functional units, internal reg-

ister usage, best cache fit) must be greater than a predefined, system dependent

limit (i.e., execution cycle and energy threshold).

• The predecessor transformation scheme must overlap the successor in order to

follow a smooth optimization. The smooth optimization over two samples of code

is defined by the minimum and maximum limits of the transformed code. If the

output profile of the code is between these limits, this point must lie on a smooth

curve for optimization.

The problem is now to find that number γ, γ < Γ, of Γ transformation possibilities and

their yielded code profile (i.e., AEP) that maximizes our two objectives. We formulate

the above optimization problem as the following multicriteria, optimization problem with

two components η and ξ:

MAXρf(ρ) = MAX

ραη(ρ) + βξ(ρ) (4.1)

subject to an individual ρ, and two possible weighing terms α and β. They are explained

below.

Algorithm Flow of GA: We use the GA and consider ρ as an individual. An individual

contains information of the transformation space and the previous iteration. Our popula-

tion for this multiobjective GA is composed of dominated and non dominated individuals.

The basic line of the algorithm is derived from a steady-state genetic algorithm given in

[6], where only one replacement occurs per generation. The first modification we have

brought in the GA lies in the selection step. The selection phase implements a roulette

wheel selection. The crossover and mutation operators are then applied. The crossover

is applied on both selected individuals, generating one child. The mutation is applied


on the best individual. The best resulting individual is integrated into the population,

replacing the worst ranked individual in the population. Figure 4.2 presents the model of

the algorithm. Initial solutions are randomly generated using a uniform random number

of transformation schemes. As a result, the initial population is spread along the search

space in terms of the number of transformation schemes.

Fig. 4.2: Simplified Genetic Algorithm Model [6].

Development of Fitness Function and Selection of Weights: The first term of

the fitness function in Equation (4.1), 0≤ η(ρ) ≤ 1, denotes the achieved fraction

of the Instruction Per Cycle (IPC) for the total transformation space. The second

term 0≤ ξ(ρ) ≤ 1, denotes the fraction of points where the energy saving is fulfilled.

Coefficients 0≤ α ≤ 1, 0≤ β ≤ 1 are weight factors to the criteria and they define

the importance of different criteria with respect to each other, e.g. if α=1 and β=0

the method optimizes only IPC, similarly for α=0.5 and β=0.5, the method optimizes

overlapped IPC and the energy function. The values of α and β depend on the user

requirement associated with available CPU cycles and energy budget for a candidate

application.

The Choice of Individuals: The individual sample points in the transformation space

are chosen with a uniform probability distribution. They are profiled later by an eval-

uation of the application expression profile at the target architecture. The selected

individual transformations are updated based on their success, i.e., IPC and energy sav-

ing factor of the sequence as a whole. The constraints are modeled as a penalty term

of the fitness function f(ρ). Transformations contributing to better performance are

rewarded while those resulting in performance losses are penalized. Thus, future sample

points are more likely to include previously successful transformations more frequently

and search their neighborhood more intensively.


4.2.2 Case Study I - Arbitrary Application

As an example we solve the following code optimization problem:

We assume that the search space consists of 29 transformation schemes, and they are :

7 loop transformation,

12 variable operations,

5 data packaging schemes,

5 cache optimization.

In addition it contains 20,000 transformation points (resolution is controlled by steering

factors such as grafting depth, cache block size etc. [25]). For simplicity, their IPC is

considered only for the useful instructions that are executed during the run time profiling.

The problem is to find among them the optimal transformation scheme, such that it

maximizes the fitness function as mentioned above. We optimized IPC with and without

overlapping energy goals.

Case 1: TS1 (α=1, β=0), only IPC is optimized and

Case 2: TS2 (α=1, β=1), both IPC and E are overlapping goals.

Two transformation schemes are depicted in Figure 4.3 to Figure 4.5 as TS1(α=1 and

β=0) and TS2(α=1 and β=1). The steps are calculated over 200 generations, Figure 4.3

shows the development of the total fitness (as a fraction of maximum fitness), Figure 4.4

shows the fraction of IPC and Figure 4.5 shows the fraction of points where IPC and

energy overlapping conditions are fulfilled. Note that, each successive point on these

graphs is showing the improvement over the baseline version of the same code. They

are computed as follows :

fnorm = f−fbaselinefbaseline

, ηnorm = η−ηbaselineηbaseline

, ξnorm = ξ−ξbaselineξbaseline

Fig. 4.3: Development of fitness function for Case Study 1 in TS1 and TS2.


Fig. 4.4: Fraction of IPC for Case Study 1 in TS1 and TS2.

Fig. 4.5: Fraction of IPC and Energy overlapping for Case Study 1 in TS1 and TS2.

Figure 4.3 shows that in both cases the fitness values are increasing. The fitness function

in TS1 only maximizes for the IPC, but energy is implicitly related to the IPC because

both are being optimized at the same hardware platform. In the same line, there are

many factors contributing to both IPC and energy, such as cache misses, functional unit

utilizations and other architectural attributes. The fitness curve appearing for TS2 has

a lower rise than for TS1. As TS1 was looking for optimized IPC, any optimization

to IPC implicitly leads to reduction in cache misses using appropriate code block sizes,

higher functional unit utilization and an increase in the scheduling factor. The goals

are different, if optimization is made only for energy saving. An increase in functional

unit utilization reduces the energy significantly, but it might lead to an increase in cache

misses. In this case the increase in cache misses is due to compaction of code to achieve

higher parallelism and hence increase the functional unit utilization. The slower rise

in IPC for TS2 is observed in Figure 4.4, because here objectives were both energy

saving and IPC maximization. Despite the importance set by α=1 in TS1, the applied

optimization schemes did optimize energy as we expected and depicted in Figure 4.5.


4.2.3 Case Study II - Nonlinear Interpolative Vector Quantization (NLIVQ)

In this section we consider a more complete source to source transformation methodol-

ogy for an image compression application NLIVQ from our benchmark. Our aim is to

optimize the energy saving such that both IPC and architectural utilization are taken

into consideration. The aim of the architectural usage is to ensure that on chip cache,

and functional units are efficiently utilized. In order to enhance architectural utilization

for every software application, the application expression profile for CPU usage is needed.

This data contains information of the actual CPU utilization for each CB composing the

application. We can make a table of application CB versus the percentage of load it

shows over the maximum CPU load. E.g., for a CPU operating at 100 MHz, if a code

block CB1 needs a 40% of the maximum CPU operational time, then it can be said that

code block CB1 is consuming 40MHz of the CPU. We call such a table Hertz Matrix

(HM), that enlist the available Hz (or CPU cycle) for each function or code block. It

means that in a HM, the distribution of CPU cycles corresponds to the distribution of

the frequency of code blocks inside the applications. This requirement can be obeyed by

using a fixed number of high frequency code blocks and then applying transformation

schemes on those code blocks. By using the fixed number of code blocks it is possible to

calculate the proportional distribution of CPU cycles. Figure 4.6 shows the CPU usage

for different code blocks in the NLIVQ application at a processor running at 145MHz.

The total workload of the application is 14.046% of the available CPU computation

power. The individual contribution of each code block can be computed as a ratio of

code block CPU usage to the total application workload. E.g., % CPU usage by code

block F01 is (0.99767/14.046)*100 = 7.1%. Percentage CPU workload for some code

blocks is shown in Table 4.1.

Let us suppose that the CPU utilization is divided into several cycle slots based on CB

lifetime (i.e., time each CB need on the processor in terms of CPU cycles). After this,

it is easy to calculate the fraction of CPU cycles in each time slot regarding to the

maximum allowed CPU cycles for the whole application. The aim of optimization is

to find a transformation scheme such that the obtained IPC obeys the available CPU

cycles. We call it architecture utilization optimization problem. We optimize energy and

IPC simultaneously as well as functional unit utilization. We formulate the optimization

problem as an extension to Equation (4.1) as follows:

MAXρf(ρ) = MAX

ραη(ρ) + βξ(ρ) + δζ(ρ) (4.2)

subject to individual ρ, where individuals have the same characteristics as explained in

Case Study 1.

The architectural constraint is modeled as penalty term ζ(ρ), 0 ≥ ζ(ρ) ≥1, which

measures the observed CPU utilization (in terms of functional unit utilization) using an


Fig. 4.6: Fraction of CPU cycles for CB life time (CBLT)in NLIVQ application (25 CBare numbered from F01 to F25).

individual ρ. The other penalty terms are explained in the previous section. When the

weights α, β, δ are equal, all objects are equally important. This means that we try to

find results that give quite good IPC, energy saving and functional unit utilization at

the underlying hardware.

We consider the NLIVQ example to elaborate the concept. NLIVQ is composed of 25

CBs as shown in Figure 4.6. Based on the CBLT, we considered only 16 code blocks

for the transformation enlisted in Table 4.1. As they cover 84% of the CPU cycles,

it is an appropriate choice. The aim to reduce CPU cycles is simply considered as

50% improvement to the original CPU cycles. For example CB F01 will be optimized

for 3.55% of the total CPU cycles (145MHz). In order to demonstrate the working

of multicriteria optimization we optimize the fitness function by using different weight

values (α, β, δ) for the energy saving, functional unit utilization and IPC to optimize

architecture utilization.

Figure 4.7 to Figure 4.10 depict the development of the fitness function (as a fraction

of maximum fitness), fraction of the IPC, fraction of the the energy saving, fraction

of the Function Unit Utilization (FUU). As discussed in Case Study 1, these values are

plotted as a fraction of their values for the baseline version of same source code. They

are computed as follows:

fnorm = f−fbaselinefbaseline

, ηnorm = η−ηbaselineηbaseline

, ξnorm = ξ−ξbaselineξbaseline

, ζnorm = ζ−ζbaselineζbaseline

There were 400 generations for each run and they were repeated several times in or-

der to get statistically reliable results. For brevity yet concise, we selected only three

transformation schemes out of nine. Their weight adjustment is mentioned below:


Code Blocks Actual CPU Cycles(%) Desired CPU Cycles (%)

F01 7.1% 3.55%

F03 6.8% 3.40%

F15 6.6% 3.31%

F24 6.4% 3.19%

F16 6.2% 3.11%

F08 6.0% 3.11%

F23 5.8% 3.02%

F21 5.3% 2.88%

F19 5.3% 2.67%

F20 5.2% 2.66%

F17 5.2% 2.59%

F05 4.9% 2.46%

F06 3.5% 1.75%

F14 3.4% 1.69%

F25 3.4% 1.68%

F13 3.0% 1.50%

Tab. 4.1: CBLT in CPU cycles for NLIVQ.

TS04(α=1, β=0, δ=1);

TS07(α=1, β=1, δ=1);

TS09(α=0.6, β=0.1, δ=0.9);

Fitness values are not growing as higher in TS04, and TS07 as compared to TS09 as

shown in Figure 4.7. Though energy saving does not contribute in the development of

fitness function but it does grow in TS04 and TS09. As we discussed in Case 1, the

application of the transformation scheme implicitly affects the energy consumption as

well. The ripples in TS07 (see Figure 4.10) reflect the negative impact on the fraction

of energy saving achieved due to the application of the optimization scheme. Figure 4.8

and Figure 4.10 reveal an implicit relation between the FUU and growth of IPC. This

was expected for the NLIVQ algorithm; both follows each other almost linearly, but that

may not be the case in general as demonstrated in [30].

A careful weight adjustment may produce the desired results. For TS09, the choice

of the weights was made after many experimental iterations. Any random selection of

weights may lead to several compiler iterations. E.g., TS04(α=1, β=0, δ=1), though

aims for better IPC and architectural utilization (in terms of FUU), but we observed

a very poor performance in fitness function development, and achieved a low fraction

of IPC as well as in FUU. In TS07(α=1, β=1, δ=1) all criteria are equally significant

and it shows good on the average results regarding all criteria. We have found that the

choice of weights is also sensitive to the underlying application algorithm and the coding

styles. E.g., for a typical MPEG-1 encoder application optimization these weights were


selected as α=0.7, β=0.4, δ=0.1 (discussed in next section).

Fig. 4.7: Development of the fitness function for NLIVQ.

Fig. 4.8: Fraction of IPC for NLIVQ.


Fig. 4.9: Fraction of energy saving for NLIVQ.

Fig. 4.10: Fraction of functional unit utilization for NLIVQ.


In order to visualize the optimization results in terms of CPU load, the numerical test

results are enlisted in Table 4.2. The first column shows the list of candidate code

blocks, the desired percentage of CPU load is shown in the second column (it is similar

to column three of Table 4.1). The achieved fractions of the CPU target load in enlisted

CBs for three different schemes TS04, TS07 and TS09 are shown in the third, fourth

and fifth columns, respectively. In TS09 the desired and optimal values are very close to

each other. To see this more clearly, Table 4.3 presents the sum of the absolute values

of the differences between desired and optimized values of each TS. Table 4.3 shows

very clearly that the TS09 gives a good result when concerning the obeying architecture

utilization. Using this measurement TS09 achieves approximately a nine times better

result as compared to TS04.

Code Blocks Target CPU Cycles (%) TS04 (%) TS07 (%) TS09 (%)F01 3.55% 5.97% 5.47% 1.99%

F03 3.40% 5.71% 5.24% 1.90%

F15 3.31% 5.57% 5.10% 1.86%

F24 3.19% 5.36% 4.92% 1.79%

F16 3.11% 5.23% 4.79% 1.74%

F08 3.02% 5.08% 4.66% 1.69%

F23 2.88% 4.84% 4.44% 1.61%

F21 2.66% 4.46% 4.09% 1.49%

F19 2.65% 4.45% 4.08% 1.48%

F20 2.59% 4.36% 3.99% 1.45%

F17 2.59% 4.35% 3.99% 1.45%

F05 2.46% 4.13% 3.79% 1.38%

F06 1.75% 2.94% 2.70% 0.98%

F14 1.69% 2.84% 2.60% 0.95%

F25 1.68% 2.83% 2.59% 0.94%

F13 1.50% 2.53% 2.31% 0.84%

Tab. 4.2: Achieved CPU cycles (%) in ECHCB of NLIVQ application for TS04, TS07,TS09.

Transformation Schemes TS04 TS07 TS09

Sum of Abs. Difference 0.2860 0.2271 0.1850

Tab. 4.3: Sum of absolute difference for for TS04, TS07, TS09.

4.3 Performance Comparison with GMIC

Though we discussed here our scheme in detail for NLIVQ image compression application,

the method is well suited for other compute-data intensive multimedia applications e.g.,

MPEG-1, a video codec. The application features are mentioned in Table 2.3. In

4.4. Conclusions 67

this section we consider to optimize an MPEG-1 encoder for our target hardware and

improvements over the iterative compilation scheme discussed in Chapter 3. Our aim

is to optimize an MPEG-1 application for its five most energy cycle hungry code blocks

among 35 CBs, such that following constraints will be fulfilled:

• Processor maximum speed is 180 MHz,

• Available cycles for MPEG-1 encoder is 120 MHz (60 MHz are reserved for other

activities for e.g., user interface, panel display etc.)

Table 4.4 shows the optimized fractions of IPC, energy saving and functional unit utiliza-

tion and their improvement after optimization. The object function in Equation (4.2),

with parameters α=0.7, β=0.4, δ=0.1 was used. There were 410 iteration steps for each

run and they were repeated several times in order to get statistically reliable results.

MPEG-1 encoder

Parameter Results (MSIC) %Improvement ofMSIC over GMIC

%Search Time Re-duction in MSIC ascompared to GMIC

IPC 87% 15%Energy Saving 23% 46% 49%Functional Unit Util. 77% 7%

Tab. 4.4: Performance comparison between GMIC and MSIC.

The result shows clearly that the optimization scheme is beneficial even for compute-

data intensive multimedia applications. Energy saving and IPC are optimized by a factor

of 0.46 and 0.15. The improvement in functional unit utilization is small, because its

weight δ was low. The salient feature of this scheme is its faster convergence to good

solution as compared to GMIC, that is another significant impediment in implementation

of such offline optimization schemes.

4.4 Conclusions

In this chapter we considered the source to source transformation as a multicriteria op-

timization problem, where IPC and energy saving are optimized simultaneously. The

optimization approach is demonstrated for real time multimedia applications. The opti-

mized result is then more reliable than the result of traditional methods, where usually

off-line compilation is performed without considering the architectural benefits. We

demonstrated architecture utilization as an important consideration while satisfying the

desired aims for IPC and energy constraints. IPC was taken into consideration in that

sense that the IPC obtained increases the target CPU utilization while reducing the


energy consumption. As compared to GMIC, the proposed methodology is faster and

the target of the source to source transformation is to find an efficient source at given

hardware constraints. As constraints for optimal solution different kind of properties

were demanded, like maximum computation power, low energy consumption and effec-

tive target hardware utilization in terms of cache, functional units, and on-chip registers

to obtain a high architecture-application correlation.

5. APPLICATION-ARCHITECTURE

CHARACTERIZATION

Embedded systems are software running on hardware. An efficient embedded system is

that one for which the software application fully utilizes the underlying architecture to

deliver optimal energy-cycle performance. The application-architecture correlation is a

bidirectional process, matching the algorithmic structure with hardware architecture and

vice versa [108] [109] [110]. The programmer will benefit from this efficient mapping

and produce better source codes. The mapping of algorithm and data structures onto

the machine architecture includes processor scheduling, memory maps, inter-processor

communications, to name a few. These activities are usually architecture dependent.

Optimal mappings are sought for various processor architectures. The implementation

of these mappings relies on efficient compiler and operating system support. Parallelism

can be exploited at algorithm design time, at program time, at compile time and run

time. In Chapter 3 and Chapter 4 we illustrated with examples how a native compilation

environment for VLIW processors can be exploited for efficient code generation. Efficient

code generation means, a code that takes the benefits offered by the architecture. We

show that a multi layer profile mechanism can be used to optimize the embedded system

efficiency. Our energy cycle aware compilation framework (ECACF) has a great interest

for designers working in mobile computing embedded system development, where there

design goal is to measure the application behavior across different architectures. Appli-

cations of similar functionality may yield similar expression profiles, and hence can be

suitable for similar hardware platforms. We tested our ECACF at diversified application

domains that varies from multimedia to bioinformatics. Despite the simplicity of our

methodology, the analysis of the large matrices of application expression profiles un-

der different levels of transformation at different architecture is not trivial and requires

advanced knowledge discovery processes. There exists several kind of representations

available to express knowledge that can be extracted from AEP. Knowledge discovery

in available data, also known as data mining is the efficient discovery of previously un-

known, valid, potentially useful and understandable patterns in large volume of data

[111]. Patterns in the data can be represented in many different forms including clas-

sification rules, association rules, clusters, sequential patterns, time series, contingency

tables and others [112]. Typically, the number of patterns generated is very large but

70 5 Application-Architecture Characterization

only a few of these patterns are likely to be of any interest to the domain expert ana-

lyzing the data. The reason for this is that many of the patterns are either irrelevant

or obvious, and do not provide new knowledge. To increase the utility, relevance and

usefulness of the discovered patterns, techniques are required to reduce the number of

patterns that need to be considered. Techniques which satisfy this goal are broadly re-

ferred to as interestingness measures [113] [112]. The analysis of relationship measures

among variables is a fundamental task at the heart of such interestingness measures.

In this chapter we propose to analyze AEP data with the help of multivariate statistical

techniques, in order to determine the application-architecture (A-A) correlation between

the different applications at one platform and similar applications across different plat-

forms. We use scatter plots, box plots, scree plots and Principal Component Analysis

(PCA) biplots to explore the correlation between application and underlying hardware

architecture. In next section we shall introduce the basic concept and definitions used

in our methodology.

5.1 Terminologies

5.1.1 Principal Component Analysis (PCA):

PCA is used for dimensionality reduction in a data set by retaining those characteristics

of the data set that contribute most to its variance, by keeping lower-order principal

components (e.g., PC1, PC2, PC3) and ignoring higher-order ones (such as PC4,PC5

and higher). Such low-order components often contain the most important aspects of

the data. But this is not necessarily the case, depending on the application. PCA

is an orthogonal linear transformation that transforms the data to a new coordinate

system such that the greatest variance by any projection of the data comes to lie on

the first coordinate (called the first principal component), the second greatest variance

on the second coordinate, and so on. PCA is a way of identifying patterns in data,

and expressing the data in such a way as to highlight their similarities and differences.

Since patterns in data can be hard to find in data of high dimension, where a graphical

representation is not available, PCA is a powerful tool for analyzing data. We use PCA

biplots to visualize the black box impact of compiler and hardware architecture over the

software applications.

5.1.2 Scree Plot:

The Scree plot shows the relative fit of each principal component. It does this by

plotting the proportion of the data variance that is fit by each component versus the

component number. The plot shows the relative importance of each component in

5.1. Terminologies 71

fitting the data. The numbers beside the points provide information about the fit

of each component. The first number is the proportion of the data variance that is

accounted for by the component. The second number is the difference in variance from

the previous component. The third number is the total proportion of variance accounted

for by the component and the preceding components.

The Scree plot can be used to aid in the decision about how many components are

useful. We use it to make this decision by looking for an elbow (bend) in the curve. If

there is one (and there often is not be likely to) then the components following the bend

account for relatively little additional variance, and are good candidates to be ignored.

5.1.3 Box Plot:

The Box, Diamond and Dot plot uses boxes, diamonds and dots to form a schematic of a

set of observations. The schematic can give you insight into the shape of the distribution

of observations. Some Box, Diamond and Dot plots have several schematics. These side-

by-side plots help to see if the distributions have the same average value and the same

variation in values.

The plot always displays dots. They are located vertically at the value of the observations

shown on the vertical scale. The dots are ’jittered’ horizontally by a small random amount

to avoid overlap.

The plot can optionally display boxes and diamonds. Boxes summarize information about

the quartiles of the variable distribution. Diamonds summarize information about the

moments of the variable distribution. The box plot is a simple schematic of a variable

distribution. The schematic gives information about the shape of the distribution of the

observations. The schematic is especially useful for determining if the distribution of

observations has a symmetric shape. If the portion of the schematic above the middle

horizontal line is a reflection of the part below, then the distribution is symmetric.

Otherwise, it is not. In the box plot, the center horizontal line shows the median, the

bottom and top edges of the box are at the first and third quartile, and the bottom and

top lines are at the 10th and 90th percentile. Thus, half the data are inside the box,

half outside. Also, 10% are above the top line and another 10% are below the bottom

line. The width of the box is proportional to the total number of observations.

5.1.4 Scatter Plot:

The scatter plot matrix is designed to display the relationship between all pairs of several

variables. The plot matrix consists of plot cells containing little scatter plots formed from

a pair of variables. The variables are represented by the X-axis and Y-axis of each plot

cell. The observed values on the two variables are represented by points in the little


scatter plot. Each point represents the values for one observation on two variables.

Normally distributed variables will have scatter plots which have the greatest density in

the middle, are roughly elliptical in shape, and have no obvious outliers. The scatter

plot matrix can be used as a control panel for selecting variables, pairs of variables and

triples of variables.

5.1.5 Differential Application Expression Profile (dAEP):

An application may behave differently in the following scenarios:

1. The same application is executed on two different platform, and

2. When two different versions of the same application that are compiled with different

optimization settings are executed on the same platform.

In both scenarios, when it is executed we get two application expression profiles. We

call the difference in performance between the two platforms an architecture-centric

differential application expression profile. While the performance difference between the

two different versions is called compiler-centric differential application expression profile.

An example of a compiler centric dAEP is shown in Table 5.1. The table shows the appli-

cation expression profile (code size, execution time, energy consumption, slot utilization

etc.), across different transformation schemes from Iter-1 to Iter-7. Each transformation

iteration (Iter-1 to Iter-7) shown in the Table 5.1 corresponds to percent relative change

to the original profile for the baseline version of MPEGdec. E.g, each successive iteration

give rise to code size from 15% to 87%, while first iteration has reduced the execution

time by 6% ( see -6% in Iter-1 column). Similarly energy consumption is also decreased

by 1% (see -1% in Iter-1 column). Iter-7 column shows the optimal application expres-

sion profile improvement over the baseline version. We call it dAEP as compared to the

baseline version.

Relative Measures Iter-1 Iter-2 Iter-3 Iter-4 Iter-5 Iter-6 Iter-7

CodeSize 15% 28% 13% 26% 72% 79% 87%ExecutionTime -6% -14% -19% -50% -66% -73% -80%EnergyConsump. -1% -8% -4% -14% -19% -21% -23%SlotUtilization 17% 19% 54% 45% 64% 70% 77%SchedulingFactor 4% 4% 10% 17% 36% 40% 44%HighwayUsage 94% 182% 221% 319% 327% 359% 395%InstrucCacheMiss -6% -13% -9% -18% -28% -30% -33%

Tab. 5.1: MPEGdec profile for successive transformations [8].

5.2. Application Characterization 73

5.2 Application Characterization

Our objective is to characterize software applications at three hardware platforms. We

choose 20 applications from our benchmark set mentioned in Chapter 2. We use an ap-

plication pseudonym instead of their full name as mentioned in Table D.1. We optimized

these applications for following processors:

1. Philips TriMedia Processor PNX1302

2. Analog Device Blackfin ADSP533S

3. Intel PIII 850 embedded processor.

We obtain the AEP with the help of our energy cycle aware compilation framework. We

choose eight attributes in order to characterize applications at each hardware architec-

ture.

• Cache Miss (CMISS)

• Code Size (CODESIZE)

• Highway Usage (HIUSE)

• Slot Utilization (SLTUTIL)

• Register Usage (REGUSE)

• Scheduling Factor (SCHFAC)

• Cycle Efficiency (CYCEFF)

• Energy Saving (ENSAVING)

To find any potential relation between these attributes, we plot them on a scatter plot

for 20 applications. A visual inspection to find direct or indirect relationship between

the attributes leads the characterization procedure further. Later, we plot PCA biplot

to explore further the impact of compiler and underlying hardware architecture on these

applications. We explain it with three case studies at the above mentioned hardware

platforms.

5.2.1 Case Study 1

We obtain eight application attributes for the TriMedia processor. It is a media processor

for high-performance multimedia applications that deals with high-quality video and

audio. Typically, an extended general-purpose CPU (called the DSPCPU) makes it

capable of implementing a variety of multimedia algorithms from popular multimedia


standards such as MPEG-1 and MPEG-2 [4]. A scatter plot for our applications is

shown in Figure 5.1. These values are enlisted in Table D.2, Appendix D. Figure 5.1

displays the relationship between all pairs of application attributes. The plot matrix

consists of the application pseudonym (A01-A20) containing little scatter plots formed

from a pair of attributes. The attributes are represented by the X-axis and Y-axis of each

application. The observed values on the two attributes are represented by points in the

little scatter plot. Each point represents the values for one observation on two attributes.

We draw a line to signify any potential relation between the two attributes. Vertical

lines in REGUSE versus ENSAVING and CYCEFF, little scatter plot show that REGUSE

does not have higher variability as compared ENSAVING and CYCEFF. Similarly, a linear

relation exist between ENSAVING and CYCEFF. Though it is a well known fact that

SCHFAC and SLTUTIL are linearly related [110], but an inverse relation between the

two in little scatter plot, shows the compiler inefficiency to exploit the parallelism offered

by the TriMedia platform. SLTUTIL versus ENSAVING and SLTUTIL versus CYCEFF

show a linear relation between each other. This is expected, because an increase in

parallelism increase the cycle efficiency as well as energy saving [2].

A preliminary analysis of the scatter plot clearly indicates a potential relation between

the application profiles on the TriMedia architecture. We further analyze it with PCA.

We obtain the PCA for our data shown in Table D.2, Appendix D. In order to identify

the number of necessary principal components, we plot them on a box plot as shown in

Figure 5.3. The first principal component PC1 shows the maximum variability, whereas

PC2 and PC3 are the next larger principal components. All principal components and

their proportional contributions to the variability are depicted as a bar plot in the Fig-

ure 5.2). Though this plot is a discontinuous function, a dotted line is drawn between

the PCs to highlight the Scree plot elbow (bend). It shows that PC1 and PC2 are

sufficient to represent the variability in the application expression profiles for TriMedia

platforms. We plot a PCA on a biplot to further explore the applications behavior. The

biplot is drawn with the help of PC1, PC2, PC3 that covers approximately 90% data

variability as shown in Figure 5.4.

Generally PCA is used to reduce the data dimension, here, we focus on the qualitative

analysis of biplot. To the best of our knowledge this is first attempt to explore application

expression profiles and application-architecture correlations on PCA biplots. We explain

first, how we analyze the biplot shown in Figure 5.4.

• Application names are mentioned as solid dots.

• Thick lines show the Application Expression Vectors (AEV), they correspond to

eight application attributes.

• Thin lines show the principal components (PC1, PC2, PC3).


Fig. 5.1: Scatter plot for 20 applications at the TriMedia processor.

• Though the biplot is a three dimensional plot, it is depicted here in such a way, so

that it can show the maximum association between AEV, PCs and applications.

The spread of application dots around the PCs and AEVs show how much an application

has enjoyed the architectural benefits. The plane formed by all of them, corresponds

to the architectural liberty offered to the compiler as well as the application in terms

of AEVs. An embedded system runs an application binary, which is an outcome of an

application build flow environment (see Figure 2.4). The PCA biplot helps to see the

potential lacks in compiler as well as in application coding.


Fig. 5.2: PCA Scree plot for 20 applications at the TriMedia processor.

Fig. 5.3: PCA box plot for 20 applications at the TriMedia processor.


Fig. 5.4: PCA biplot for 20 applications at the TriMedia processor.


All the AEVs heading in the same direction support each other for e.g.,

HIUSE and SLTUTIL;

ENSAVING, SCHFAC and REGUSE;

All the AEVs heading in the opposite direction negate each other for e.g.,

CYCEFF and CMISS;

Applications close to AEVs support them, for e.g., A05, A12, A18 and A03 exploit

the parallelism offered by the TriMedia platform, as vectors HIUSE, SLTUTIL are also

in the same direction as ENSAVING and CYCEFF. Applications A09, A15 are most

energy efficient, while A04, A09, A14, A20 are cycle efficient. Despite the aggressive

transformation scheme, the applications A17, A01, A06, A13, A16, A02 are not able to

exploit the architectural benefits. The increase in cache misses has lead to a decrease in

cycle efficiency, since the location of these applications is exactly opposite to the CYCEFF

expression vector. Similarly, applications A19, A07, A06, A13, A02 are energy inefficient

applications. These applications are dominated with branch operations and hence lead

to a higher number of cache misses, that eventually lead to more energy consumption.

This is because the TriMedia architecture lacks the branch prediction unit. Our ECACF

has produced very compact code for applications A04 and A11. These applications

take advantage of TriMedia custom operations [4]. These operations offer many single

commands to perform array operations at data streams. For the data manipulation in

many algorithms, however, 32-bit data and operations are wasteful of expensive silicon

resources. Important multimedia applications, such as the decompression of MPEG

video streams, spend significant amounts of execution time dealing with eight bit data

items. Using 32-bit operations to manipulate small data items makes inefficient use of

32-bit execution hardware in the implementation. If these 32-bit resources could be used

instead to operate on four eight-bit data items simultaneously, performance would be

improved by a significant factor with only a tiny increase in implementation cost.

As our aim to characterize applications at TriMedia is mainly concerned with the porting

issue. The trend is increasing towards assembling an off-the-shelf hardware and porting

applications from Independent Software Vendors (ISV). To find an application which

suits the target hardware, PCA biplot proved to be useful tool. As we analyze above

applications such as A17, A01, A06, A13, A16, A02 are not well suited for TriMedia

architecture because of branch dominated operations. While applications such as A05,

A12, A18, A10, A09, A14, A20 are very well suited for TriMedia processors. We can

conclude that applications dominated with large matrix operations, data streaming and

localized operations produce better performance both in terms of cycle efficiency and

energy saving.


Fig. 5.5: Scatter plot for 20 applications at the Blackfin processor.

5.2.2 Case Study 2

We optimized our 20 applications for the Blackfin processor. The results are enlisted in

Table D.3, Appendix D. The relationship between the application attributes is shown in

Figure 5.5. Vertical lines in REGUSE versus ENSAVING, CYCEFF and CODESIZE little

scatter plot show that REGUSE does not have higher variability as compared ENSAVING,

CYCEFF and CODESIZE. Similarly, a linear relation exists between ENSAVING and

CYCEFF. This behavior is similar to what we observed in TriMedia (Case Study 1). A

linear relation is observed between the CMISS and CODESIZE, though in general there

could be no apparent relation between the two, because cache missing (CMISS) is a run


Fig. 5.6: PCA biplot for 20 applications at the Blackfin processor.

time behavior of the application, while code size (CODESIZE) is a static feature. In our

opinion this very behavior is an outcome of the Blackfin compiler. During optimization,

it increases the size of the code to handle branch operations. Attempts to increase

the spatial access in iterative function calls and temporal access in multi folded loops,

result in an increase in code size. Though it reduces the cycle count but the increase

in cache miss leads to an increase in energy consumption. Apparently, the Blackfin

compiler exploits the branching for better cycle performance but decreases in energy

performance.

The PCA biplot in Figure 5.6 shows a different response for all applications as compared

to Figure 5.4 for the TriMedia architecture. The expression vectors CODESIZE, HIUSE

and CMISS are very well correlated with each other as well as PC2, we have already

commented on it. Whereas, SLTUTIL, ESAVING, SCHFAC, CYCEFF and REGUSE are

well correlated with each other as well as PC1. As PC1 corresponds to the maximum

variability in terms of architectural usage by the applications. The biplot shows, the

Blackfin processor offers better performance for A03, A14, A10, A18, A20, A01, A02

and A09. While applications on the left side of biplot such as A12, A07, A19, A04, A11,

A17, A13, A05 are not exploiting any architectural benefits offered by the processor.

Most of these applications are data dominated, and involve pointer operations. This

5.3. Architecture-Centric Application Characterization 81

points to the poor performance of native compilers to handle pointer operations. If the

aim is to port these application to such processor, it is recommended to transform the

underlying algorithm to be in small functions and localized array operations. Though,

the energy saving as shown in Table D.3 is not very promising, but in practice Blackfin

is known as energy efficient processor. We assume that, the energy performance in

practice is gained by using the Power Management Unit (PMU) available in the Blackfin

processor. It may be noted that our ECACF optimizes a given application explicitly

at the source code level (i.e. source to source transformation). During optimization

iterations we always turn the native power optimization unit off. The primary advantage

of this methodology is, first we optimize the application binary, later we can reduce

energy further down by scheduling power management units.

5.2.3 Case Study 3

We optimized our 20 applications for a general purpose INTEL PIII 850 processor, results

are enlisted in Table D.3, Appendix D. This processor is implemented with baseline

version of VLIW architecture [4]. The relationship between the application attributes

is shown in Figure 5.7. Unlike, the TriMedia and Blackfin processor, we do not observe

a large correlation between the attributes. For the sake of completion, we have shown

their PCA biplot as well (see Figure 5.8). There is a large variability in the application

spread. We do not observe any cluster of application that exploits any of eight attributes

explicitly. Moreover, the code size is a big issue in PIII native compilation environment.

HIUSE is exactly opposite to the expression vector of CMISS; it shows the optimal use

of internal buses, which reduces the cache miss and eventually energy. A closer look into

the applications A18, A14, A13 reveals that, being an audio codec, they perform most of

the operation in local loops at a small chunk of data. The size of the instruction blocks

or data blocks are well correlated with the instruction and data cache. We assume that,

this feature is owned by the native compiler, which ensures a local optimization rather

than a global or inter-procedural one.

5.3 Architecture-Centric Application Characterization

In the previous section, we explore the application variability at a given hardware archi-

tecture. In this section we shall explore the application portability across the platforms.

We observe the differential application expression profile (dAEP) between our three tar-

get hardware platforms. The basic idea is depicted in Figure 5.9. The absolute difference

between the two AEPs across two different platform is used as dAEP. As the difference

in application behavior is across two platforms, we call it architecture-centric application

expression profile. We obtain the dAEP for the following scenarios:


Fig. 5.7: Scatter plot for 20 applications at the PIII 850 processor.

1. Across the TriMedia processor and the Blackfin processor for 20 applications.

2. Across the Blackfin processor and the PIII 850 processor for 20 applications.

3. Across the TriMedia processor and the PIII 850 processor for 20 applications.

We analyze the behavior for the eight attributes, explained in the previous section. Table

D.5, Table D.6 and Table D.7 in Appendix D enlist the dAEP for three scenarios. The

PCA biplots for each scenario are shown in Figure 5.10, Figure 5.11 and Figure 5.12,

respectively.

Applications close to AEVs are those favorite to both platforms, for e.g., A01, A06,


Fig. 5.8: PCA biplot for 20 applications at the PIII 850 processor.

Fig. 5.9: Differential AEP across three hardware platforms.

A09 on the average performed better cycle efficiency at both the TriMedia and the

Blackfin processor (see Figure 5.10). Similarly, A14, A04, A11 are not well suited to

both platforms, due to a higher cache miss rate. Application clusters at the left are

not suited for the portability, they perform well on either platform, for e.g A10, A15


Fig. 5.10: PCA biplot for 20 applications across the TriMedia processor and the Blackfinprocessor.

are energy and cycle efficient at the TriMedia processor but show poor performance at

the Blackfin processor. Here, the biplot clearly identifies the cluster of applications well

suited for portability across the two platforms.

Figure 5.11 shows the A12, A14, A01, A08, A20, A09 are both energy and cycle efficient

for both Blackfin and PIII 850 processors. While applications cluster A02, A18, A05,

A11, A10 and A15 are not well suited for portability.

Figure 5.12 shows a high portability between the TriMedia and PIII 850 processors. The


Fig. 5.11: PCA biplot for 20 applications across the Blackfin processor and the PIII 850processor.

AEVs SLTUTIL, REGUSE, SCHFACT, ENSAVING, CYCEFF are very close to each other

and heading into the same direction. Their contribution to PC1 is also very high. The

application in the vicinity for e.g., A11, A18, A02, A06, A17, A04, A03, A20, A09, and

A12 are very well suited for portability. While applications cluster on the left containing

A01, A07, A08, A15, A10, A05, A14, A13, and A19 perform well across either of the

two platforms and show poor portability.


Fig. 5.12: PCA biplot for 20 applications across the TriMedia processor and the PIII850 processor.

5.4 Conclusions

In this chapter we show that our energy cycle aware compilation framework (ECACF)

has a great interest for designers working in mobile computing embedded system devel-

opment, where their design goal is to measure the application behavior across different

architectures. Applications of similar functionality may yield similar expression profiles,

and hence can be suitable for similar hardware platforms. We introduce a new methodol-

5.4. Conclusions 87

ogy to evaluate the application portability using multivariate statistics. We demonstrate

how box plot, Scree plot, and PCA biplots can be used to characterize an application

at a given hardware architecture. We expose the minutia of our methodology by ex-

ploring the AEPs across three different hardware platforms at diversified applications.

Finally, we demonstrate how dAEP can be used to find out legacy code portability across

platforms.

6 CONCLUSIONS

In this thesis we propose a framework, where software applications optimally utilize

the hardware architecture to deliver energy-cycle performance within user defined con-

straints. Our energy aware framework in [25] meets the demand by incorporating the

following features in native multimedia DSP compilation environments.

1) The framework transforms the legacy application source code into optimal ’C’ source

code, taking advantage of different slacks appearing in the application-to-binary devel-

opment hierarchy.

2) Unlike conventional techniques, ’C’ source code is iteratively compiled for different

performance goals both in terms of execution time as well as energy dissipation.

3) Our post profiling techniques published in [26] evaluate the application performance

not only at compilation layer (as conventional compiler does) but also at scheduling

layer, linker layer, machine code generation layer and finally at loader layer.

4) We measure the realtime performance of application running on actual hardware.

These measured parameters are further used to tune the transformation scheme of the

legacy software application.

5) We tested our framework at different applications that belong to diversified industrial

domains such as audio transcodecs [27], video transcodecs [8], speech codecs, and

bioinformatics [28] [29].

6) The work is further extended in [30] [27], to characterize application-architecture

correlation, that are well suited for a pre-design assessment of an embedded system

design. It answers the question whether a given hardware architecture is an appropriate

choice for a given multimedia software application or not.

90 6 Conclusions

APPENDICES

A. LIST OF APPLICATION EXPRESSION

PROFILE (AEP) MONITORS

Name: Processor Frequency

Definition: The operating frequency of a multimedia processor

Location: VDF

Type: Static

Range: Typical 100MHz - 233MHz (depends on given hardware architecture)

Level: Primary

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Name: Execution Time

Definition: The total execution time of a software application for a given input test

vector.

Location: Target HW

Type: Dynamic

Range: Measured in seconds

Level: Primary

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Name: Energy

Definition: Amount of energy consumed by the software application for a given input

test vector.

Location: Target HW

Type: Dynamic

Range: Measured in milli joules (mJ)

Level: Primary

94 A List of Application Expression Profile (AEP) Monitors

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Name: Scheduling Factor

Definition: Computed this factor by dividing infinite machine cycle time with finite

machine cycle time [110] [114] [115].

Location: Transformation Engine and Schedular

Type: Dynamic

Range: 0 to 1

Level: Secondary

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Name: Average cycles on finite machine

Definition: The finite machine cycle time averaged according to the probabilities of

execution of the block of code [110] [114] [115].

Location: Target HW

Type: Dynamic

Range: Measured in cycles

Level: Secondary

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Name: Useful Issues per Cycle

Definition: Useful operations issued dynamically per number of dynamic instructions [110] [114] [115].

Location: Target HW

Type: Dynamic

Range: Integer

Level: Primary

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Name: Slot Utilization.

Definition: The finite machine cycle time averaged according to the probabilities of

execution of the block of code [110] [114] [115].

Location: Transformation Engine

Type: Dynamic

Range: Measured in percentage

95

Level: Primary

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Name: Ideal cycles

Definition: We provide the estimated infinite machine cycle time for static code.

Location: Target HW

Type: Dynamic

Range: Measured in cycles

Level: Primary

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Name: JPMO

Definition: Joules per million of operations, computed as measure energy per number

of million of operations

Location: Target Hardware

Type: Dynamic

Range: Measured in joules per million of operations

Level: Secondary

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Name: OPC

Definition: Operations per cycle, computed as number of operations per total number

of executed cycles [110].

Location: Schedular

Type: Dynamic

Range: Measured in operations per cycle (integer)

Level: Secondary

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Name: IPC

Definition: Instruction per cycle, computed as number of operations per total number

of executed cycles [110].

Location: Native simulator (for e.g., tmSim for TriMedia TM130x)

Type: Dynamic


Range: Measured in instruction operations per cycle (integer)

Level: Secondary

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Name: Architecture Affinity Number (AAN)

Definition: It is computed as :

AAN =numberofoperationsstatic

(numberofinstructionstatic ∗ numberofissueslot)(A.1)

Location: Transformation Engine

Type: Dynamic

Range: 0 to 1

Level: Secondary

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Name: Register Usage

Definition: Number of maximum live register at any time of program execution [110].

Location: Schedular

Type: Dynamic

Range: Interger number (depends on VLIW architecture)

Level: Primary

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Following profile monitors are obtained with the help of a tool ’csource’ from [116]

Name: Code Size

Definition: Size of the executable binary

Location: Linker

Type: Static

Range: Integer

Level: Primary

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Name:Nesting

Definition: Maximum nesting level of control constructs

97

Location: Compiler

Type: Static

Range: Integer

Level: Primary

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Name: Paths

Definition: Number of possible paths, not counting abnormal exits or gotos

Location: Compiler

Type: Static

Range: Integer

Level: Primary

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Name: Cyclomatic

Definition: The measure of the complexity of a function’s decision structure. The

cyclomatic complexity is also the number of basis, or independent, paths through a

module. Also sometimes called the McCabe Complexity after its originator.

Location: Compiler

Type: Static

Range: Integer

Level: Primary

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Name:Modified

Definition: Cyclomatic except each case statement is not counted;the entire switch

counts as 1

Location: Compiler

Type: Static

Range: Integer

Level: Primary

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Name: Strict


Definition: cyclomatic except logical operators are counted as 1

Location: Compiler

Type: Static

Range: Integer

Level: Primary

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Name: Essential

Definition: Measure of the amount of unstructured code in a function

Location: Compiler

Type: Static

Range: Integer

Level: Primary

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

B VLIW DESCRIPTOR FILE (VDF) FORMAT

Our ECACF is generic with respect to the VLIW architecture. In this section we explain

the structure of our VDF, which is similar to [4]. In order to compile for a specific

target machine, the compilation tools are parameterized through a textual description

known as the VLIW description format and can be integrated as shown in Figure 2.3.

As entries in VDF are generic, any VLIW processor description can be added into our

VDF. Different fields of VDF are explained below:

Operation:

The operation section defines operation names and the properties associated with them.

The section consists of the reserved keyword OPERATIONS, followed by any number of

operation groups. Each operation group consists of the arity, operation properties, and

operation names in the operation group.

E.g.,

UNARY PARAMETRIC (UNSIGNED 0 TO 127) iaddi isubi

indicates both iaddi and isubi take a single argument, and both operations contain a

parameter that is unsigned and in the range 0 to 127.

Pseudo-Operation:

The pseudo-operation section consists of the reserved keyword PSEUDO-OPERATIONS,

followed by any number of pseudo-operation mapping rules. Each rule consists of the

tree operation name, followed by a string in quotation marks that defines the mapping,

ending with a semicolon to terminate the rule entry. The string defines how an operation

is expanded. Each use of the pseudo-operation is rewritten to the form specified by the

string. E.g, following string defines the iles operation (integer less than) as igtr (integer

greater than), with its arguments swapped.

iles ”igtr 21” ;

iles ”igtr 21” ;

Unit Type:

The unit type section defines a functional unit type in the machine (such as data mem-

ory unit, integer arithmetic/logic unit, floating point divider unit, and so forth). It

consists of the keyword FUTYPE, followed by the name of the unit type, followed by

100 B VLIW Descriptor File (VDF) Format

unit type properties, then the keyword OPERATIONS, followed by a list of all operations

implemented in that functional unit type. E.g.,

FUTYPE shifter LATENCY 1 OPERATIONS asli roli asri lsri asl rol asr lsr ;

Target Machine:

This section describes the target machine configuration. The ISSUESLOTS entry defines

the number of issue slots in the machine. The REGISTERS entry declares the size of

the register file. The WRITEBUSES entry defines the number of writeback buses used

to write back the results of computations into the register file.

E.g., typical description for TriMedia architecture [4] is as follows:

MACHINE

ISSUESLOTS 5

REGISTERS 128

WRITEBUSES 5

FUTYPE const SLOT 1 2 3 4 5

FUTYPE alu SLOT 1 2 3 4 5

FUTYPE dmem SLOT 4 5

FUTYPE shifterSLOT 1 2

FUTYPE dspalu SLOT 1 3

FUTYPE branch SLOT 2 3 4

Instruction Format: The instruction format section consists of the reserved IFORMAT

keyword, followed by the bitfields subsection, and then the opcodes subsection. The

bitfields subsection consists of the keyword BITFIELDS and six bitfield length specifiers.

These bitfield specifiers can be in any order, though the assembler always packs the

bitfields in a particular order. This section specifies bitfield size (in bits).

E.g., IFORMAT section description for TriMedia architecture [4] is as follows:

OPCODES

iimm 95

uimm 95

iadd 12

isub 13

imax 15

101

imin 14

igtr 17

igeq 16

ieql 37

nop 255

Readers are encouraged to refer [103] [4] [81] [104] for further detail about the entries

of VDF structure.

102 B VLIW Descriptor File (VDF) Format

C. USER CONSTRAINTS FILES (UCF)

FORMAT

The UCF format has following fields:

Processor Operating Frequency: It describes the processor actual operating frequency,

though processor could be driven to much higher frequency. (Range = processor depen-

dent)

Main Memory Size: It describes the attached main memory size, though actual memory

size that can be glued to processor chip may be higher. (Range = processor dependent)

Slot Utilization: The percentage of slot utilization for an application, higher the per-

centage, more would be the application compilation time. It is not necessary that our

ECACF may meet this parameter, because it is directly related to the parallelism offered

by the application itself. (Range = 0 to 100%)

Total CPU Load: The workload offered by an application to CPU, user sets this para-

meter based on his constraints for available CPU cycles, to increase the CPU productivity.

(Range = 0 to 100%)

Total Energy Dissipation: The energy consumed by an application, user sets this

parameter based on his constraints for available battery budget, to increase the energy

saving. (Range = 0 to 100%)

Scheduling Factor: It is associated with CPU utility, briefly higher the number greater

is CPU utilization. It contributes to energy saving significantly. (Range = 0 to 1)

Tree Depth: The number of times a tree can be replicated at the execution exit is called

tree depth. It increases the parallelism in VLIW processor and reduce the execution time.

(Range = 0 to 1)

Unfolding Depth: The number of times a loop can be unfolded/ unrolling. (Range =

1,2,4,8)

Search Time Out: Time out for search algorithm.(Range = user dependent)

Search Generations: The maximum number of generation, used only in our MSIC

algorithm. (Range = user dependent)

104 C User Constraints Files (UCF) Format

Transformation Schemes: The set of transformation schemes to be used by our

ECACF, this entry is generally provided by the user. (Range = user dependent)

C.1 UCF for MPEG-1 encoder example in Section 3.3

Processor Operating Frequency: 180 MHz

Main Memory Size: 32 Mbyte

Slot Utilization: 80%

Total CPU Load: 40%

Total Energy Dissipation: 14,000 mJoules

Scheduling Factor: 0.5

Tree Depth: 0.5

Unfolding Depth: 8

Search Time Out: Manual

Search Generations: None

Transformation Schemes: TS1, TS2, TS3, TS4

C.2 UCF for NLIVQ example in Section 4.2.3

Processor Operating Frequency: 145 MHz

Main Memory Size: 16 Mbyte

Slot Utilization: 80%

Total CPU Load: 5%

Total Energy Dissipation: 5,000 mJoules

Scheduling Factor: 0.5

Tree Depth: 0.5

Unfolding Depth: 8

Search Time Out: None

Search Generations: 700

Transformation Schemes: TS01, TS02, TS03,..., TS17

D APPLICATION ATTRIBUTES

Application Pseudonyms Description DomainA01 G728enc Speech

A02 GENESSPLICER Bioinformatics

A03 TRIGRSCAN Bioinformatics

A04 MPEGdec Video

A05 H263enc Video

A06 M100 General

A07 G728dec Speech

A08 NLIVQ Image Processing

A09 GENIE Bioinformatics

A10 H263dec Video

A11 M64 General

A12 MPEGenc Video

A13 GSMdec Speech

A14 GSMenc Speech

A15 GRAIL Bioinformatics

A16 G723enc Speech

A17 G723dec Speech

A18 MP3enc Audio

A19 G728enc Speech

A20 MP3dec Audio

Tab. D.1: Pseudonyms for 20 applications.

106 D Application Attributes

Applications SchFac RegUse HiUse SltUtil Cmiss CodeSize EnSaving CycEffA01 0.08 0.04 0.16 0.5 0.43 0.33 0.14 0.15

A02 0.04 0.02 0.09 0.59 0.39 0.24 0.16 0.26

A03 0.02 0 0.19 0.66 0.34 0.28 0.21 0.27

A04 0.15 0.09 0.22 0.36 0.2 0.18 0.23 0.16

A05 0.08 0.04 0.39 0.8 0.35 0.29 0.36 0.33

A06 0.07 0.05 0.26 0.24 0.38 0.28 0.06 -0.13

A07 0.12 0.02 0.15 0 0.39 0.18 -0.08 -0.24

A08 0.15 0.07 0.02 0.7 0.2 0.44 0.34 0.5

A09 0.12 0.02 0.05 0.59 0.13 0.21 0.29 0.46

A10 0.06 0.02 0.3 0.98 0.26 0.34 0.45 0.56

A11 0.1 0.02 0.06 0.27 0.07 0.05 0.16 0.22

A12 0.02 0.02 0.24 0.74 0.23 0.43 0.32 0.34

A13 0.12 0 0.1 0.36 0.39 0.35 0.07 0.07

A14 0.01 0 0.17 0.85 0.03 0.01 0.43 0.67

A15 0.15 0.11 0.09 0.77 0.45 0.16 0.3 0.48

A16 0.2 0.16 0.24 0.08 0.17 0.47 0.16 -0.13

A17 0.13 0.04 0.34 0.15 0.26 0.46 0.09 -0.21

A18 0.15 0.11 0.37 0.61 0.2 0.39 0.38 0.25

A19 0.17 0 0.03 0.08 0.49 0.36 -0.1 -0.18

A20 0.11 0.07 0.16 0.84 0.06 0.15 0.48 0.67

Tab. D.2: AEP for optimized 20 applications at the TriMedia processor.

107


A02 0.140 0.111 0.100 0.980 0.257 0.201 0.484 0.737

A03 0.197 0.055 0.052 0.605 0.462 0.122 0.203 0.375

A04 0.032 0.024 0.320 0.268 0.460 0.364 0.016 -0.221

A05 0.049 0.027 0.126 0.378 0.217 0.107 0.143 0.171

A06 0.125 0.001 0.175 0.535 0.108 0.001 0.289 0.415

A07 0.020 0.011 0.234 0.061 0.351 0.246 -0.063 -0.287

A08 0.075 0.018 0.040 0.498 0.235 0.407 0.183 0.254

A09 0.197 0.106 0.012 0.825 0.000 0.280 0.520 0.784

A10 0.080 0.017 0.090 0.769 0.186 0.274 0.345 0.534

A11 0.148 0.011 0.081 0.068 0.475 0.071 -0.093 -0.145

A12 0.008 0.007 0.386 0.023 0.271 0.434 -0.027 -0.412

A13 0.115 0.022 0.357 0.314 0.241 0.116 0.166 0.024

A14 0.159 0.097 0.224 0.620 0.284 0.152 0.322 0.361

A15 0.158 0.117 0.263 0.795 0.445 0.493 0.355 0.333

A16 0.098 0.029 0.072 0.380 0.121 0.215 0.194 0.250

A17 0.073 0.051 0.159 0.265 0.313 0.266 0.075 -0.021

A18 0.160 0.008 0.315 0.783 0.167 0.079 0.423 0.539

A19 0.190 0.052 0.095 0.080 0.434 0.303 -0.034 -0.154

A20 0.044 0.033 0.328 0.885 0.174 0.155 0.443 0.534

Tab. D.3: AEP for optimized 20 applications at the Blackfin processor.



A02 0.106 0.013 0.022 0.428 0.236 0.295 0.155 0.296

A03 0.012 0.004 0.288 0.193 0.459 0.455 -0.042 -0.368

A04 0.045 0.025 0.089 0.862 0.481 0.081 0.256 0.597

A05 0.11 0.048 0.165 0.207 0.313 0.18 0.06 -0.037

A06 0.02 0.002 0.361 0.354 0.185 0.314 0.165 -0.025

A07 0.133 0.081 0.362 0.011 0.145 0.139 0.096 -0.239

A08 0.19 0.113 0.092 0.557 0.489 0.37 0.201 0.278

A09 0.095 0.065 0.296 0.384 0.485 0.229 0.099 -0.054

A10 0.187 0.005 0.066 0.772 0.077 0.398 0.42 0.763

A11 0.128 0.1 0.178 0.496 0.438 0.19 0.18 0.2

A12 0.13 0.083 0.367 0.874 0.056 0.104 0.549 0.752

A13 0.18 0.032 0.358 0.445 0.311 0.201 0.227 0.148

A14 0.12 0.016 0.348 0.593 0.215 0.393 0.308 0.272

A15 0.075 0.032 0.299 0.589 0.156 0.048 0.313 0.403

A16 0.095 0.036 0.017 0.632 0.002 0.222 0.357 0.672

A17 0.012 0.006 0.083 0.916 0.314 0.309 0.33 0.668

A18 0.192 0.12 0.359 0.347 0.005 0.324 0.356 0.198

A19 0.169 0.052 0.259 0.309 0.402 0.065 0.112 0.039

A20 0.028 0.018 0.25 0.251 0.286 0.435 0.065 -0.16

Tab. D.4: AEP for optimized 20 applications at the PIII 850 processor.

109


A02 0.101 0.096 0.014 0.39 0.13 0.035 0.326 0.479

A03 0.182 0.052 0.135 0.051 0.118 0.155 0.008 0.11

A04 0.12 0.068 0.102 0.095 0.261 0.183 0.213 0.384

A05 0.036 0.015 0.265 0.421 0.129 0.185 0.216 0.157

A06 0.055 0.053 0.085 0.295 0.269 0.284 0.234 0.547

A07 0.099 0.009 0.088 0.057 0.042 0.066 0.021 0.047

A08 0.073 0.053 0.018 0.197 0.04 0.031 0.16 0.242

A09 0.073 0.087 0.043 0.239 0.126 0.074 0.228 0.324

A10 0.016 0 0.211 0.212 0.074 0.063 0.103 0.023

A11 0.05 0.01 0.017 0.201 0.402 0.02 0.249 0.366

A12 0.012 0.009 0.145 0.719 0.046 0.006 0.347 0.751

A13 0.008 0.022 0.254 0.046 0.144 0.239 0.092 0.041

A14 0.144 0.096 0.053 0.229 0.257 0.138 0.11 0.309

A15 0.008 0.007 0.178 0.022 0.008 0.333 0.053 0.143

A16 0.101 0.128 0.167 0.298 0.045 0.252 0.034 0.377

A17 0.053 0.012 0.185 0.118 0.05 0.194 0.011 0.191

A18 0.011 0.1 0.055 0.175 0.031 0.313 0.043 0.285

A19 0.025 0.048 0.068 0.001 0.056 0.062 0.068 0.025

A20 0.071 0.032 0.165 0.047 0.115 0.004 0.037 0.141

Tab. D.5: dAEP for optimized 20 applications across the TriMedia and the Blackfinprocessors.



A02 0.034 0.099 0.078 0.552 0.021 0.095 0.329 0.442

A03 0.185 0.051 0.237 0.412 0.003 0.334 0.245 0.744

A04 0.013 0.002 0.231 0.595 0.022 0.284 0.24 0.818

A05 0.062 0.021 0.039 0.171 0.096 0.073 0.083 0.208

A06 0.105 0.001 0.186 0.181 0.078 0.313 0.123 0.44

A07 0.113 0.07 0.128 0.05 0.205 0.107 0.159 0.048

A08 0.115 0.095 0.052 0.059 0.255 0.038 0.018 0.025

A09 0.101 0.041 0.284 0.441 0.485 0.051 0.421 0.839

A10 0.107 0.012 0.024 0.003 0.109 0.125 0.075 0.229

A11 0.021 0.088 0.097 0.428 0.037 0.119 0.273 0.346

A12 0.121 0.077 0.019 0.851 0.215 0.329 0.576 1.164

A13 0.064 0.009 0.001 0.13 0.07 0.086 0.061 0.124

A14 0.039 0.081 0.124 0.026 0.069 0.24 0.014 0.089

A15 0.083 0.084 0.036 0.206 0.288 0.445 0.042 0.069

A16 0.003 0.007 0.055 0.252 0.118 0.007 0.163 0.422

A17 0.061 0.045 0.076 0.652 0 0.042 0.255 0.689

A18 0.032 0.113 0.044 0.436 0.163 0.245 0.067 0.341

A19 0.021 0 0.164 0.23 0.032 0.237 0.145 0.194

A20 0.016 0.015 0.078 0.634 0.112 0.28 0.378 0.694

Tab. D.6: dAEP for optimized 20 applications across the Blackfin and the PIII 850processors.

111


A02 0.068 0.003 0.064 0.162 0.151 0.059 0.003 0.037

A03 0.003 0.002 0.102 0.464 0.115 0.178 0.253 0.634

A04 0.107 0.067 0.129 0.499 0.283 0.101 0.027 0.434

A05 0.026 0.006 0.226 0.592 0.033 0.112 0.299 0.366

A06 0.049 0.052 0.1 0.114 0.191 0.029 0.11 0.107

A07 0.014 0.061 0.216 0.007 0.247 0.041 0.18 0

A08 0.043 0.042 0.069 0.139 0.294 0.069 0.142 0.218

A09 0.028 0.045 0.242 0.202 0.36 0.022 0.193 0.514

A10 0.122 0.012 0.235 0.209 0.183 0.062 0.028 0.206

A11 0.029 0.078 0.114 0.226 0.365 0.139 0.024 0.02

A12 0.109 0.068 0.126 0.133 0.169 0.324 0.229 0.413

A13 0.056 0.031 0.255 0.084 0.074 0.154 0.153 0.082

A14 0.105 0.015 0.177 0.255 0.188 0.379 0.124 0.398

A15 0.075 0.077 0.213 0.184 0.297 0.111 0.011 0.074

A16 0.104 0.121 0.222 0.55 0.163 0.245 0.197 0.799

A17 0.114 0.033 0.261 0.77 0.05 0.152 0.244 0.879

A18 0.043 0.013 0.011 0.261 0.193 0.068 0.024 0.057

A19 0.004 0.048 0.232 0.231 0.087 0.299 0.213 0.219

A20 0.086 0.047 0.088 0.587 0.227 0.284 0.415 0.835

Tab. D.7: dAEP for optimized 20 applications across the TriMedia and the PIII 850processors.

E LIST OF ACRONYMS

AEP Application expression profile

AEV Application expression vector

APM Application profile monitor

BSX Byte sex (little endian or big endian)

CB Code block

CISC Complex instruction set computer

CMISS Cache miss

CODESIZE Code size

CPU Central processing unit

CYCEFF Cycle efficiency

dAEP Differential application expression profile

DSP Digital signal processor

DSPCPU DSP CPU

EC Energy cycle

ECACF Energy cycle aware compilation framework

ECB Energy cycle bay

ECHCB Energy cycle hungry code block

EDA Electronic data automation

ENSAVING Energy saving

FFT Fast fourier transform

FUU Functional unit utilization

GA Genetic algorithm

GMIC Gradient mode iterative compilation

HIUSE Hiway usage

HM Hertz Matrix

IC Integrated circuits

IDCT Inverse discrete cosine transform

ILP Instruction level parallelism

IPC Instruction per cycle

ISA Instruction set architecture

ISV Independent software vendor

JPMO Joules per million of operations

LSB Least significant bit

114 E List of Acronyms

M100 Matrix of order 100x100

M64 Matrix of order 64x64

MES Mobile embedded systems

MMIO Memory mapped input output

MSIC Multicriteria stochastic iterative compilation

NI-CD Nickle Cadmium

NI-MH Nickel metal hydride

NLIVQ Non linear vector quantization

OPC Operation per cycle

PC Personal computer

PCA Principal component analysis

PCSW Program counter status word

PDA Personal data assistant

PMU Power management unit

REGUSE Register usage

RISC Reduced instruction set computer

SCHFAC Scheduling factor

SDRAM Synchronous data random access memory

SLTUTIL Slot utilization

TS Transformation scheme

UCF User constraint file

VDF VLIW descriptor file

VLIW Very long instruction word

VQ Vector quantization

WCET Worse case execution time

BIBLIOGRAPHY

[1] “Desktop CPU Power Consumption Guide,”

http://www.techarp.com/.

[2] “Intel Processor Chronicle,” http://developer.intel.com/design/.

[3] Natibo, “Rechargeable Battery/Systems for Communication/Electronic Applica-

tions,” http://www.acq.osd.mil/ott/natibo/docs/BatryRpt-2.pdf.

[4] P. Electronic, “TM1300 Data Book,” North America Corporation, vol. Oct., 1999.

[5] V. Tiwari and S. Malik, “Power Analysis of Embedded Software: A First Approach

to Software,” in Proceedings of IEEE Transactions on VLSI Systems, vol. 2, Dec.

1994.

[6] M. Lorenz, T. Draeger, R. Leupers, P. Marwedel, and G. P. Fettweis, “Low-Energy

DSP Code Generation Using a Genetic Algorithm,” in Proceedings of the IEEE on

Computer Design 2001, Austin, Texas, Jan. 2001.

[7] A. V. Oppenheim and R. Schafer, Discrete Time Signal Processing. New Jersey:

Prentice Hall, 1989.

[8] N. Z. Azeemi, “Probabilistic Iterative Compilation for Source Optimization of

Embedded Programs,” in Proceeding of the IEEE 2006 International SoC Design

Conference, Seoul, Korea, Oct. 2006, pp. 323 – 328.

[9] J. L. Hennessy and D. A. Patterson, Computer Architecture A Quantitative Ap-

proach, 2nd ed. Kluwer Academic Publishers, 1995.

[10] “Advanced Configuration and Power Interface Specification,”

http://www.teleport.com/ acpi.

[11] “The StarCore DSP,” http://www.starcore-dsp.com/.

[12] Techsharp, “Intel StrongARM Processors,” http://developer.intel.com/design/strong/.

[13] “Intel SpeedStep Technology,” http://www.intel.com/mobile/pentiumIII/ist.htm.

116 Bibliography

[14] Intel StrongARM SA-1110 Microprocessor - Advanced Developer’s Manual. Intel

Corp., Jun. 2006.

[15] C. Small, “Shrinking Devices Puts the Squeeze on System Packaging,” in EDN

39(4), Feb. 1994, pp. 41–46.

[16] V. Gutnik and A. P. Chandrakasan, “An Embedded Power Supply for Low-Power

DSP,” in Proceedings of IEEE Transactions on VLSI Systems, ser. 4, vol. 5, Dec.

1997, pp. 425–435.

[17] “Moores Law,”

http://www.intel.com/intel/museum/25anniv/hof/moore.htm.

[18] C. Chiasserini and R. Rao, “Pulsed battery discharge in communication devices,”

in MOBICOM (1999), 1999, pp. 88–95.

[19] J. Eager, “Advances in Rechargeable Batteries Spark Product Innovation,” in

Proceedings of the 1992 Silicon Valley Computer Conference, Santa Clara, Aug.

1992, pp. 243–253.

[20] D. Stepner, N. Rajan, and D. Hui, “Embedded Application Design Using a Real-

Time OS,” in Proceedings of DAC 1999, New Orleans, 1999, pp. 151–156.

[21] P. S. R. Diniz, Adaptive Filtering Algorithms and Practical Implementation.

Kluwer Academic, 1997.

[22] A. Hoffmann and H. Meyr, Architecture Exploration for Embedded Processors

with LISA. Kluwer Academic Publishers, 2002.

[23] “Advance RISC Machines Architectural Reference Manual,” Prentice Hall, vol.

Advanced RISC Machines Ltd., 1996.

[24] V. Tiwari, S. Malik, and T. L. A. Wolfe, “Instructino Level Power Analysis and Op-

timization of Software.” in Journal of VLSI Signal Processing, vol. 13(2), August

1996.

[25] N. Z. Azeemi and M. Rupp, “Energy-Aware Source-to-Source Transformations for

a VLIW DSP Processor,” in Proceeding of the IEEE 17th International Conference

on Microelectronics, Islamabad, Pakistan, Dec. 2005, pp. 133 – 138.

[26] ——, “Multicriteria Low Energy Source Level Optimization of Embedded Pro-

grams,” in Proceedings of Tagungsband zur Informationstagung Mikroelektronik

06 IEEE Austria, Vienna, Austria, October, 2006, pp. 150–158.

[27] N. Z. Azeemi, “Multicriteria Energy Efficient Source Code Compilation for De-

pendable Embedded Applications,” in Proceeding of the IEEE International Con-

ference on Information Technology IIT 2006, Dubai, UAE, Nov. 2006.

Bibliography 117

[28] N. Z. Azeemi, A. Sultan, and A. Muhammad, “Parameterized Characterization of

Bioinfomatics Workload on SIMD Architecture,” in Proceeding of the IEEE Inter-

national Conference on Information and Automation 2006, Colombo, Sri Lanka,

Dec. 2006, pp. 189 – 194.

[29] N. Z. Azeemi and A. Sultan, “Characterization of Bioinformatics Applications on

Multimedia Processor,” in Proceeding of the IEEE Cairo International Biomedical

Engineering Conference 2006, Kairo, Egypt, Dec. 2006.

[30] N. Z. Azeemi, “Compiler Directed Battery-Aware Implementation of Mobile Ap-

plications,” in Proceeding of the IEEE 2nd International Conference on Emerging

Technologies 2006, Peshawar, Pakistan, Nov. 2006, pp. 151 – 156.

[31] V. Tiwari, D. Singh, S. Rajgopal, G. Mehta, R. Patel, and F. Baez, “Reducing

Power in High-performance Microprocessors,” in Proceedings of the 35th Design

Automation Conference, San Francisco, CA USA, Jun. 1998.

[32] “JouleTrack - A Web Based Tool for Software Energy Profiling,”

http://dry-martini.mit.edu/JouleTrack/.

[33] A. Sinha and A. Chandrakasan, “Energy Aware Software,” in Proceedings of the

XIII International Conference on VLSI Design, Calcutta, Jan. 2000.

[34] ——, “JouleTrack - A Web Based Tool for Software Energy Profiling,” in Pro-

ceedings the 38th Design Automation Conference, Las Vegas, Jun. 2001.

[35] ——, “Operating System and Algorithmic Techniques for Energy Scalable Wire-

less Sensor Networks,” in Proceedings of the Second International Conference on

Mobile Data Management, Hong-Kong, Jan. 2001.

[36] ——, “Energy Efficient Real-Time Scheduling,” in Proceedings of the Interna-

tional Conference on Computer Aided Design (ICCAD), San Jose, Nov. 2001.

[37] U. Thoeni, Programming real-time multicomputers for signal processing.

Prentice-Hall, 1994.

[38] R. B. Lee, “Subword Parallelism with MAX-2,” in IEEE Micro, ser. 4, vol. 16,

Aug. 1996, pp. 51–59.

[39] “The Intel XScale Microarchitecture,”

http://developer.intel.com/design/intelxscale/.

[40] “eCos Users Guide,”

http://sources.redhat.com/ecos/docs-latest/pdf/user-guides.pdf.

118 Bibliography

[41] M. Mehendale, A. Sinha, and S. D. Sherlekar, “Low Power Realization of FIR

Filters Implemented Using Distributed Arithmetic,” in Proceedings of Asia South

Pacific Design Automation Conference, Yokohama, Japan, Feb. 1998.

[42] A. Sinha, A. Wang, and A. Chandrakasan, “Algorithmic Transforms for Efficient

Energy Scalable Computation,” in Proceedings of the 2000 IEEE International

Symposium on Low-Power Electronic Design (ISLPED 00), Italy, Aug. 2000.

[43] W. Fornaciari, P. Gubian, D. Sciuto, and C. Silvano, “Power Estimation of Em-

bedded Systems: A Hardware/Software Codesign Approach,” IEEE Transactions

on Very Large Scale Integration (VLSI) Systems, vol. 6, pp. 266–275.

[44] N. Z. Azeemi, “A Framework for Architecture Based Energy-Aware Code Transfor-

mations in VLIW Processors,” in Proceeding of the IEEE International Symposium

on Telecommunications 2005, Shiraz, Iran, 2005, pp. 393 – 398.

[45] T. N. N. Ahmed and K. R. Rao, “Discrete Cosine Transform,” in IEEE Transactions

on Computers, vol. 23, Jan. 1974, pp. 90–93.

[46] W. Chen, C. H. Smith, and S. C. Fralick, “A Fast Computational Algorithm for the

Discrete Cosine Transform,” in Proceedings of IEEE Trans. on Communication,

vol. 25, Sep. 1997, pp. 1004–1009.

[47] L. McMillan and L. A. Westover, “A Forward-Mapping Realization of the Inverse

Discrete Cosine Transform,” in Proceedings of the Data Compression Conference

(DCC 92), Mar. 1992, pp. 219–228.

[48] A. Chandrakasan, S. Sheng, and R. W. Broderson, “Low-Power CMOS Design,”

in IEEE Journal of Solid State Circuits, Apr. 1992, pp. 472–484.

[49] A. Chandrakasan and R. Brodersen, “Low Power CMOS Design,” IEEE Press,

1998.

[50] P. J. M. Havinga and G. J. M. Smit, “Octopus embracing the energy efficiency of

handheld multimedia computers,” in Proceedings of MOBICOM, 1999, pp. 77–87.

[51] M. B. Srivastava, A. P. Chandrakasan, and R. W. Broderson, “Predictive System

Shutdown and Other Architectural Techniques for Energy Efficient Programmable

Computation,” in Proceedings of IEEE Transactions on VLSI Systems, ser. 1,

vol. 4, Mar. 1996, pp. 42–54.

[52] H. Zhang and J. Rabaey, “Low-Swing Interconnect Interface Circuits,” in Pro-

ceedings of the International Symposium on Low Power Electronics and Design

1998, 1998, pp. 161–166.

[53] W. Athas and et. al., “Low Power Digital Systems Based on Adiabatic Switching

Principles,” in IEEE Transactions on VLSI Systems, ser. 4, vol. 2, Dec. 1994.

Bibliography 119

[54] S. H. Chow, Y. Ho, and T. Hwang, “Low power realization of finite state ma-

chines a decompostion approach,” in ACM Transactions on Design Automation

of Electronic Systems, Jul. 1996, pp. 315–340.

[55] T. Burd and et. al., “A Dynamic Voltage Scaled Microprocessor System,” in

Proceedings of International Solid State Circuits Conference 2000, 2000, pp. 294–

295.

[56] E. C. K. Govil and H. Wasserman, “Comparing Algorithms for Dynamic Speed Set-

ting of a Low-Power CPU,” in Proceedings of the ACM International Conference

on Mobile Computing and Networking, 1995, pp. 13–25.

[57] K. Govilak, E. Chan, and H. Wasserman, “Comparing algorithms for dynamic

speed-setting of a low-power CPU,” in Proceedings of MOBICOM, 1995, pp.

13–25.

[58] R. Min, T. Furrer, and A. P. Chandrakasan, “Dynamic Voltage Scaling Techniques

for Distributed Microsensor Networks,” in Proceedings of the IEEE Computer

Society-Workshop on VLSI (WVLSI 00), Apr. 2000.

[59] “Intel StrongARM SA-1100 Microprocessor Developer’s Manual,”

http://developer.intel.com/design/strong/manuals/278088.htm.

[60] “AMD K6 PowerNOW,”

http://www.amd.com/products/cpg/mobile/powernow.html.

[61] “AMPS Operating System and Software,” http://gatekeeper.mit.edu.

[62] D. J. Kolson, A. Nicolau, and N. Dutt, “Optimal register assignment to loops

for embedded code generation,” in ACM Transactions on Design Automation of

Electronic Systems, vol. 1(2), Apr. 1996, pp. 251–279.

[63] “eCos Reference Manual,”

http://sources.redhat.com/ecos/docs-latest/pdf/ecos-ref.pdf.

[64] “The µITRON API,”

http://sources.redhat.com/ecos/docs-latest/ref/ecos-ref.a.html.

[65] “The EL/IX Homepage,” http://sources.redhat.com/elix/.

[66] “eCos Downloading and Installation,”

http://sources.redhat.com/ecos/getstart.html.

[67] M. O. Tokhi, Parallel Computing for Real-time Signal Processing and Control.

Springer, 2003.

120 Bibliography

[68] T. V. K. Gupta, R. E. Ko, and R. Baruna, “Compiler-directed Customization

of ASIP Cores,” in International Symposium on Hardware/Software Co-Design,

2002, pp. 97–102.

[69] W. Horn, “Some Simple Scheduling Algorithms,” Naval Research Logistics

Quaterly, vol. 21, 1974.

[70] K. Ramamritham and J. A. Stankovic, “Dynamic Task Scheduling in Distributed

Hard Real-Time Systems,” in Proceedings of IEEE Software, ser. 3, vol. 1, Jul.

1984.

[71] F. Yao, A. Demers, and S. Shenker, “A Scheduling Model for Reduced CPU

Energy,” in Proceedings of IEEE Annual Foundations of Computer Science, 1995,

pp. 374–382.

[72] T. Kondo, M. Inoue, and K. Nakai, “Application of autonomous decentralized

system to the steel production computer control,” in In 3rd International Workshop

on Future Trends of Distributed Computing Systems, 1992, pp. 419–423.

[73] G. Buttazzo, Hard Real-Time Computing Systems - Predictable Scheduling Algo-

rithms and Applications. Kluwer Academic Publishers, 1997.

[74] “ARM Software Development Toolkit Version 2.11 : User Guide,” Advanced RISC

Machines Ltd., May. 1997.

[75] S. H. Nawab and et. al., “Approximate Signal Processing,” in Journal of VLSI

Signal Processing Systems for Signal, Image, and Video Technology, ser. 1/2,

vol. 15, Jan. 1997, pp. 177–200.

[76] “Microsoft Windows CE,” http://www.microsoft.com/windows/embedded/ce/.

[77] “The Palm OS Platform,” http://www.palmos.com/platform/architecture.html.

[78] A. S. Tanenbaum, Modern Operating Systems. Prentice Hall, Feb. 2001.

[79] “VIS Speeds New Media Processing,” in IEEE Micro, ser. 4, vol. 16,

http://www.acq.osd.mil/ott/natibo/docs/BatryRpt-2.pdf, Aug. 1996, pp. 10–20.

[80] M. D. Jennings and T. M. Conte, “Subword Extensions for Video Processing on

Mobile Systems,” in IEEE Concurrency, July-Sept. 1998, pp. 13–16.

[81] “Intel Pentium III SIMD Extensions,”

http://developer.intel.com/vtune/cbts/simd.htm.

[82] “Solution Engine,”

http://semiconductor.hitachi.com/tools/solution-engine.html.

Bibliography 121

[83] M. P. D. Sheet, “Dynamically Adjustable, Synchronous Step-Down Controller for

Notebook CPUs,” http://pdfserv.maxim-ic.com/arpdf/MAX1717.pdf.

[84] A. C. W. Heinzelman and H. Balakrishnan, “Energy Efficient Routing Protocols for

Wireless Microsensor Networks,” in Proceedings of the 33rd Hawaii International

Conference on System Sciences (HICSS 00), Jan. 2000.

[85] Q. Qiu and M. Pedram, “Dynamic Power Management Based on Continuous-Time

Markov Decision Processes,” in Proceedings of the Design Automation Conference

(DAC 99), New Oreleans, 1999, pp. 555–561.

[86] G. Wei and M. Horowitz, “A Low Power Switching Power Supply for Self-Clocked

Systems,” in Proceedings of International Symposium on Low Power Electronics

and Design, 1996, pp. 313–318.

[87] C. L. Liu and J. W. Layland, “Scheduling Algorithms for Multiprogamming in

a Hard Real-Time Environment,” in Journal of ACM, ser. 1, vol. 20, 1973, pp.

46–61.

[88] M. Satyanarayanan and D. Narayanan, “Multi-fidelity algorithms for interactive

mobile applications,” in 3rd International Workshop on Discrete Algorithms and

Methods for Mobile Computing and Communications (DIAL M99) , 1999, pp.

1–6.

[89] A. Salkintzis, C. Chamzas, and C. Koukourlis, “An energy saving protocol for mo-

bile data networks,” in International Conference on Advances in Communications

and Control (COMCON 5), Jun. 1995, pp. 26–30.

[90] R. Kravets and P. Krishnan, “Power management techniques for mobile commu-

nication,” in Proceedings of MOBICOM (1998), 1998, pp. 157–168.

[91] I. Chlamtac, C. Petrioli, and J. Redi, “Energy-conserving access protocols for

identification networks,” in IEEE/ACM Transactions on Networking, vol. 7(1),

Feb. 1999, pp. 51–59.

[92] S. Singh, M.Woo, and C. Raghavendra, “Power-aware routing in mobile ad hoc

networks,” in Proceedings of MOBICOM (1998), pp. 181–190.

[93] Y. Bai and C. Lai, “A bitmap scaling and rotation design for SH1 low power

CPU,” in 2nd International Workshop on Modeling, Analysis and Simulation of

Wireless and Mobile Systems (1999), 1999, pp. 101–106.

[94] S. Codecs, “G.71x and G.72x,” http://www.compression-links.info/G.711-G.72x.

[95] “MPEG Pointers and Resources,” http://www.mpeg.org/.

122 Bibliography

[96] T. Xanthopoulos and A. Chandrakasan, “A Low-Power DCT Core Using Adaptive

Bit-width and Arithmetic Activity Exploiting Signal Correlations and Quantiza-

tions,” in Proceedings of the Symposium on VLSI Circuits, Jun. 1999.

[97] JHMI, “Genome Sequencing,” www.bis.med.jhmi.edu.

[98] “The NCBI Bacteria genomes database,”

ftp://ftp.ncbi.nih.gov/genomes/Bacteria/.

[99] NCBI, “Genome Sequencing,” http://www.ncbi.nlm.nih.gov/.

[100] GDB, “Genome Sequencing,” www.gdb.org.

[101] SEQ, “Genome Sequencing,” www.sequenceanalyses.org.

[102] oupjournal, “Genome Databases,” www.nar.oupjournal.org.

[103] “AMD-3DNOW! Technology Manual-Instruction Set Architecture Specification,”

http://www.amd.com/K6/k6docs/.

[104] “TMS320C54x DSP Function Library,”

http://www.ti.com/sc/docs/products/dsp/c5000/.

[105] N. Z. Azeemi, “Power Aware Framework for Dense Matrix Operations in Multi-

media Processors,” in Proceeding of the IEEE International Multitopic Conference

2005, Karachi, Pakistan, Dec. 2005, pp. 157–168.

[106] T. Baeck, Evolutionary Algorithms in Theory and Practice. Oxford University

Press, 1996.

[107] N. Z. Azeemi, “A Multiobjective Evolutionary Approach for Constrained Joint

Source Code Optimization,” in Proceeding of the ISCA 19th International Confer-

ence on Computer Application in Industry, Las Vegas, USA, Nov. 2006, pp. 175

– 180.

[108] ——, “Handling Architecture-Application Dynamic Behavior in Set-top Box Ap-

plications,” in Proceeding of the IEEE International Conference on Information

and Automation 2006, Colombo, Sri Lanka, Dec. 2006, pp. 195 – 200.

[109] “Architecture-Aware Hierarchical Probabilistic Source Optimization,” in Proceed-

ing of the ISCA 19th International Conference on Parallel and Distributed Com-

puting Systems, San Francisco, USA, Sep. 2006.

[110] K. Hwang, Advanced Computer Architecture. McGraw-Hill, 2001.

[111] T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, “Data Mining with

optimized two-dimensional association rules,” in ACM Transactions on Database

Systems (TODS), ser. 2, vol. 26, June 2001.

Bibliography 123

[112] A. A. Nanavati, K. P. Chitrapura, S. Joshi, and R. Krishnapuram, “Association

Rule Mining: Mining generalised disjunctive association rules,” in Proceedings of

the tenth international conference on Information and knowledge management

CIKM ’01, Oct. 2001.

[113] P.-N. Tan, “Selecting the Right Interestingness Measure for Association Patterns,”

in ACM SIGKDD 02, Alberta Canada.

[114] C. Brandolese, W. Foranciari, F. Salice, and D. Sciuto, “Source-Level Execution

Time Estimation of C Programs,” in International Symposium on Hardware/Soft-

ware Co-Design, 2001, pp. 98–104.

[115] P. Puschner and C. Koza, “Calculating the maximum execution time of real-time

programs,” Journal of Real-Time Systems, vol. 1, no. 2, pp. 159–176, September

1989.

[116] “Software Testing,” http://hissa.ncsl.nist.gov/swassurance/strtest.html.

[117] “The GNU Project,” http://www.gnu.org/.

[118] “Intel StrongARM SA-1110 Linecard,”

http://developer.intel.com/design/strong/linecard/sa-1110.

An Energy Aware Framework for Mobile Computing - TU · PDF fileAn Energy Aware Framework for...

Documents

Transcript of An Energy Aware Framework for Mobile Computing - TU · PDF fileAn Energy Aware Framework for...