a light weighted semi-automatically i/o-tuning solution for engineering applications

Institut für Höchstleistungsrechnen

Xuan Wang

A LIGHT WEIGHTED SEMI-AUTOMATICALLY I/O-TUNING SOLUTION FOR ENGINEERING APPLICATIONS

FORSCHUNGS- UND ENTWICKLUNGSBERICHTE

ISSN 0941 - 4665 Dezember 2017 HLRS-18

A LIGHT WEIGHTED SEMI-AUTOMATICALLY I/O-TUNING SOLUTION FOR ENGINEERING APPLICATIONS

Höchstleistungsrechenzentrum Universität StuttgartProf. Dr.-Ing. Dr. h.c. Dr. h.c. Prof. E.h. Michael M. ReschNobelstrasse 19 - 70569 StuttgartInstitut für Höchstleistungsrechnen

von der Fakultät Energie-, Verfahrens- und Biotechnik der Universität Stuttgart zur Erlangung der Würde eines Doktor-Ingenieurs (Dr.-Ing.) genehmigte Abhandlung

vorgelegt von

Xuan Wangaus Yunnan, China

Hauptberichter: Prof. Dr.-Ing. Dr. h.c. Dr. h.c. Prof. E.h. Michael M. ReschMItberichter: Prof. Dr. Edgar GabrielTag der Einreichung: 19. Juni 2017Tag der mündlichen Prüfung: 18. Dezember 2017CR-Klassifikation: I.3.2, I.6.6

ISSN 0941 - 4665 Dezember 2017 HLRS-18

D93

i

Acknowledgements

This work could not be accomplished without the support of my family, col-leagues as well as friends.

First of all, I would like to thank my supervisor, Prof. Dr. Michael Resch atthe High Performance Computing Center Stuttgart (HLRS), for all his guidance andinstructive comments, especially at the beginning of my work. I am grateful to myadvisor Dr. Thomas Bönisch for his patience and constructive guidance during myPhD study. Not to mention the plenty of time and energy that he devoted to mydissertation, he also gave me a lot of invaluable advice and encouragement.

I gratefully acknowledge the funding provided by the Federal Ministry of Edu-cation and Research (BMBF) for project Scalable I/O for Extreme Performance and thefunding provided by the EU’s Horizon 2020 Research and Innovation Program forproject the Partnership for Advanced Computing in Europe.

I appreciate the help and support of all colleagues at HLRS, especially my grat-itude to Manuela Wossough for her help with the correction of my dissertation andthe valuable feedback. I am also thankful to Björn Schembera and Florian Seyboldfor their helpful advice and inspired discussions. I am also grateful to Dr. QiaoyanYe and Bo Shen from Fraunhofer Institute for Manufacturing Engineering and Au-tomation (IPA) for providing their engineering use cases to present my research.

Last but absolutely not least, I want to express my sincere thanks to my family.I want to thank my parents for their continuous support and encouragement. Es-pecially I want to thank my wife Luyi Chen for her patience, support and help inkeeping my life in balance.

iii

ZusammenfassungDie heutigen Ingenieuranwendungen, die auf Supercomputerplattformen laufen,

erzeugen immer mehr unterschiedliche Daten und erfordern große Storage-Systemesowie extrem hohe Datenübertragungsraten, um ihre Daten zu speichern. Um leis-tungsstarke Datenübertragungsraten (hiermit ist im Folgenden die I/O-Leistunggemeint) zu erreichen, haben Informatiker zusammen mit Supercomputerherstellernviele innovative Lösungen entwickelt. Jedoch stellt die Übertragung dieses Wissensund dieser Lösungen an Ingenieure und Wissenschaftler eine der größten Barrierendar. Da Ingenieure und Wissenschaftler zumeist Spezialisten nur auf ihrem eigenenFachgebiet sind, sind sie oft nicht in der Lage, die I/O-Leistung ihrer Anwendun-gen zu optimieren. Außerdem liegt die Priorität der Wissenschaftler nicht auf I/O-Optimierung, was zu einer Verschlechterung der I/O-Leistung führen könnte. Ob-wohl Rechenzentren wie das HLRS verschiedene Ausbildungskurse anbieten, umdas benötigte Informatikwissen sowie die Optimierungsmöglichkeiten zu vermit-teln, ist der Effekt leider sehr begrenzt. Um diese Barriere zu überwinden, wurdeinnerhalb dieser Arbeit eine semi-automatische I/O Optimierungslösung (SAIO) fürIngenieuranwendungen entwickelt.

SAIO, ein leichtgewichtiges und intelligentes Framework, ist so konzipiert, dasses mit möglichst vielen Ingenieuranwendungen kompatibel, für große Ingenieu-ranwendungen skalierbar, für Ingenieure und Wissenschaftler mit wenig Kennt-nissen über paralleles I/O einfach verwendbar, und über mehrere HPC Plattfor-men portierbar ist. SAIO ist auf der MPI-IO-Bibliothek aufgebaut und kompatibelzu MPI-IO basierten High-Level I/O-Bibliotheken, wie z.B. Parallel HDF5, Paral-lel NetCDF sowie kommerzieller und Open Source Software, wie z.B. Ansys Flu-ent, WRF-Model usw. Darüber hinaus folgt SAIO den aktuellen MPI-Standard,wodurch es über viele HPC Plattformen portierbar und skalierbar ist. SAIO, dasals dynamische Bibliothek implementiert und dynamisch geladen wird, erfordertkeine Neukompilierung oder Änderung des Anwendungsquellcodes. Ingenieureund Wissenschaftler müssen lediglich einige export Direktiven in ihre Job Sub-mission Skripte einfügen, um ihre Jobs effizienter laufen lassen zu können. Zudemhält ein automatisiertes SAIO Trainingsprogramm die optimalen Konfigurationenauf dem neuesten Stand, ohne dass die Nutzer manuell eingreifen müssen.

Die Evaluation von SAIO mit der verbreiteten I/O-Benchmark Software IORhaben eine Verbesserung von über 700% für MPI und HDF5 Leseoperationen sowieüber 600% für MPI und HDF5 Schreiboperationen gezeigt. Darüber hinaus erfüllensehr kleine Runtime-Instrumentierungs- und Finalisierungs-Overheads in SAIO dieAnforderungen für den Einsatz in einer Produktionsumgebung. Zwei Computa-tional Fluid Dynamics (CFD) Anwendungen wurden als Anwendungsfälle gewählt.SAIO hat den Datenverarbeitungsprozess um etwa 184% erfolgreich beschleunigt,und damit 4.634 Core-Stunden Rechenzeit für einen einmaligen Lauf gespart. Umdieses Ziel zu erreichen, verbraucht SAIO nur 83 Core-Stunden Rechenzeit extra,um die optimalen Konfigurationen zu finden. Der andere Anwendungsfall ist keine

iv

I/O intensive Anwendung und nutzt Ansys Fluent, eine weit verbreitete CFD Sim-ulationssoftware. Trotzdem hat SAIO die I/O-Operationen um ca. 23.6% für unab-hängiges HDF5 I/O und ca. 30.0% für kollektives HDF5 I/O verkürzt. Neben diesenbeiden Anwendungsfällen wurde SAIO mit dem WRF-Modell erfolgreich getestet.

SAIOs intuitive, JSON ähnlich formatierten Log- und Konfigurationsdateien sindeinfach zu verstehen. Dies ermöglicht den Ingenieuren und Wissenschaftlern, par-allele I/O-Operationen selbst zu analysieren und zu beschleunigen. Durch die bei-den Ingenieuranwendungsfälle wurden nicht nur die Optimierungsergebnisse vonSAIO gezeigt, sondern es wurde auch eine Anleitung für die Verwendung von SAIOfür die I/O-Analyse und -Optimierung in einer Produktionsumgebung zur Verfü-gung gestellt.

v

AbstractToday’s engineering applications running on high performance computing (HPC)

platforms generate more and more diverse data simultaneously and require largestorage systems as well as extremely high data transfer rates to store their data.To achieve high performance data transfer rate (I/O performance), computer scien-tists together with HPC manufacturers have developed a lot of innovative solutions.However, how to transfer the knowledge of their solutions to engineers and scien-tists has become one of the largest barriers. Since the engineers and scientists areexperts in their own professional areas, they might not be capable of tuning theirapplications to the optimal level. Sometimes they might even drop down the I/Operformance by mistake. The basic training courses provided by computing centerslike HLRS seem to be not sufficient enough to transfer the know-how required. Inorder to overcome this barrier, I have developed a semi-automatically I/O-tuningsolution (SAIO) for engineering applications.

SAIO, a light weighted and intelligent framework, is designed to be compatiblewith as many engineering applications as possible, scalable with large engineeringapplications, usable for engineers and scientists with little knowledge of parallelI/O, and portable across multiple HPC platforms. Standing upon MPI-IO libraryallows SAIO to be compatible with MPI-IO based high level I/O libraries, such asparallel HDF5, parallel NetCDF, as well as proprietary and open source software,like Ansys Fluent, WRF Model etc. In addition, SAIO follows current MPI standard,which makes it be portable across many HPC platforms and scalable. SAIO, whichis implemented as dynamic library and loaded dynamically, does not require recom-piling or changing application’s source codes. By simply adding several export di-rectives into their job submission scripts, engineers and scientists will be able to runtheir jobs more efficiently. Furthermore, an automated SAIO training utility keepsthe optimal configurations up to date, without any manuell efforts of user involved.

Evaluating SAIO with the popular I/O benchmark software IOR has shown im-provements of over 700% for MPI and HDF5 reading operations as well as over 600%

for MPI and HDF5 writing operations. Moreover, SAIO’s extremely low run-time in-strumentation and finalizing overhead also fulfills the requirements for deploying itin a production environment. Two computational fluid dynamics (CFD) applica-tions have been chosen as use cases. For the first application, SAIO has successfullyaccelerated one of its data processing processes by about 184%, namely 4,634 corehours can be saved for running once, in exchange for 83 core hours extra to find outthe optimal configurations. The second one is not an I/O-heavy application and usesAnsys Fluent, a widely used CFD simulation proprietary software. Nevertheless,SAIO has shortened its I/O requests’ consumption by about 23.6% for independentHDF5 I/O and 30.0% for collective HDF5 I/O. Besides these two use cases, SAIOalso went successfully through the tests with WRF model.

vi

SAIO’s intuitive JSON-like formatted log and configuration files are easy to un-derstand. With these files, engineers and scientists can analyze parallel I/O requestsand accelerate their applications by themselves. The two engineering use cases havenot only presented SAIO’s optimizing results, but also provided a guideline for ap-plying SAIO for I/O analysis and optimization in a production environment.

vii

Contents

Acknowledgements i

Zusammenfassung iii

Abstract v

1 Introduction and Motivation 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 User Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 I/O Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.3 Distributed Parallel File Systems . . . . . . . . . . . . . . . . . . 41.1.4 Distributed Data Storage Systems . . . . . . . . . . . . . . . . . 6

1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.2 Existing Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

I/O Auto-Tuning Solutions . . . . . . . . . . . . . . . . . . . . . 9I/O Tracing Mechanisms . . . . . . . . . . . . . . . . . . . . . . 10

1.2.3 A Light Weighted Approach . . . . . . . . . . . . . . . . . . . . 111.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . 12

2 State of the Art 132.1 Distributed Parallel File Systems . . . . . . . . . . . . . . . . . . . . . . 13

2.1.1 Lustre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1.2 IBM Spectrum Scale - GPFS . . . . . . . . . . . . . . . . . . . . . 152.1.3 Hadoop Distributed File System - HDFS . . . . . . . . . . . . . 172.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Parallel I/O Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.1 Parallel I/O Types . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.2 Data Sieving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.3 Two-Phase I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 Message Passing Interface (MPI) . . . . . . . . . . . . . . . . . . . . . . 232.3.1 MPI Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

MPI-IO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24MPI File Hints/Info . . . . . . . . . . . . . . . . . . . . . . . . . 24MPI File View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3.2 MPI Implementations . . . . . . . . . . . . . . . . . . . . . . . . 26MPICH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Open MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3.3 MPI-IO Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . 27ROMIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27OMPIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4 High-Level Scientific Data Libraries . . . . . . . . . . . . . . . . . . . . 292.4.1 Hierarchical Data Format (HDF) . . . . . . . . . . . . . . . . . . 29

viii

2.4.2 Network Common Data Form (NetCDF) . . . . . . . . . . . . . 302.4.3 Adaptable I/O System (ADIOS) . . . . . . . . . . . . . . . . . . 31

3 Semi-Automatically I/O-Tuning Framework (SAIO) 333.1 SAIO Design Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.1.1 Following MPI Standard . . . . . . . . . . . . . . . . . . . . . . . 333.1.2 Running Transparently . . . . . . . . . . . . . . . . . . . . . . . 343.1.3 Producing Little Overhead . . . . . . . . . . . . . . . . . . . . . 343.1.4 Optimizing Automatically . . . . . . . . . . . . . . . . . . . . . . 35

3.2 SAIO Software Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.3 SAIO Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.1 SAIO Running Modes . . . . . . . . . . . . . . . . . . . . . . . . 373.3.2 Core Module: I/O Tracer & Optimizer . . . . . . . . . . . . . . . 38

I/O Tracer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38I/O Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.3 Learning Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.3.4 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Training Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42Learning Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4 SAIO Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.4.2 Influence Factors of I/O Performance . . . . . . . . . . . . . . . 44

Number of MPI Processes for I/O Operations . . . . . . . . . . 45Data Transfer Size . . . . . . . . . . . . . . . . . . . . . . . . . . 45MPI-IO Subroutine . . . . . . . . . . . . . . . . . . . . . . . . . . 46MPI info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4.3 Definition of SAIO Files . . . . . . . . . . . . . . . . . . . . . . . 46SAIO Log File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46SAIO Configuration File . . . . . . . . . . . . . . . . . . . . . . . 47SAIO Configuration Index File . . . . . . . . . . . . . . . . . . . 48

3.4.4 MPI and PMPI Wrapper . . . . . . . . . . . . . . . . . . . . . . . 493.4.5 I/O Tracing and Optimizing . . . . . . . . . . . . . . . . . . . . 50

I/O Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50I/O Optimizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.4.6 SAIO Learning Module . . . . . . . . . . . . . . . . . . . . . . . 563.4.7 SAIO Training Utility . . . . . . . . . . . . . . . . . . . . . . . . 573.4.8 SAIO Statistic Utility . . . . . . . . . . . . . . . . . . . . . . . . . 573.4.9 SAIO Software Compatibility . . . . . . . . . . . . . . . . . . . . 58

3.5 How to Use SAIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4 Evaluations 594.1 Evaluation Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.1.1 System Specifications . . . . . . . . . . . . . . . . . . . . . . . . 594.1.2 Software Configurations . . . . . . . . . . . . . . . . . . . . . . . 594.1.3 I/O Configurations’ Searching Scope . . . . . . . . . . . . . . . 60

4.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.2.1 SAIO - Training Process . . . . . . . . . . . . . . . . . . . . . . . 634.2.2 SAIO - Capability . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Accelerating MPI Applications . . . . . . . . . . . . . . . . . . . 64Accelerating Untrained MPI Applications . . . . . . . . . . . . . 67Accelerating HDF5 Applications . . . . . . . . . . . . . . . . . . 68

ix

Real-Time Accelerating MPI Applications . . . . . . . . . . . . . 704.2.3 SAIO - Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Process Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . 72Run-Time Instrumentation Overhead . . . . . . . . . . . . . . . 74Finalize Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . 77Memory Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.2.4 SAIO - Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.2.5 SAIO - Portability . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.3 Collective Buffering or not? Lessons Learned . . . . . . . . . . . . . . . 814.4 Conclusion of Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5 Engineering Use Cases 875.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.2 Engineering Use Case - CFD: HDF5, Fortran . . . . . . . . . . . . . . . 87

5.2.1 Analyzing Application . . . . . . . . . . . . . . . . . . . . . . . . 885.2.2 Applying SAIO Training Utility . . . . . . . . . . . . . . . . . . 885.2.3 Optimization and Results . . . . . . . . . . . . . . . . . . . . . . 905.2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.3 Engineering Use Case - CFD: ANSYS Fluent . . . . . . . . . . . . . . . 935.3.1 Analyzing Application . . . . . . . . . . . . . . . . . . . . . . . . 935.3.2 Applying SAIO Training Utility . . . . . . . . . . . . . . . . . . 965.3.3 Optimization and Results . . . . . . . . . . . . . . . . . . . . . . 975.3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6 Conclusion and Future Work 1016.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

A Code Segments 105

B Used SAIO Files 113

Bibliography 115

xi

List of Figures

1.1 Typical I/O Stack of an HPC System. . . . . . . . . . . . . . . . . . . . . 21.2 Computing Hours Usage Ratios of Different Professional Areas at HLRS

in 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 I/O Simulation Results by Applying Different MPI Hints for Writing

8 KB and 4,000,000 B Data Transfer Sizes . . . . . . . . . . . . . . . . . . 71.4 I/O Simulation Results by Applying Different MPI Hints for Reading

and Writing 80 MB Data Transfer Size . . . . . . . . . . . . . . . . . . . 8

2.1 A Simplified Illustration of Lustre File System Components . . . . . . 132.2 Lustre File System Striping Mechanism . . . . . . . . . . . . . . . . . . 142.3 A Simplified Illustration of GPFS Components . . . . . . . . . . . . . . 162.4 Hadoop Distributed File System Components and Read/Write Pro-

cess Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.5 I/O Request with and without Data Sieving . . . . . . . . . . . . . . . . 212.6 An Example of Reading Process without Two-Phase I/O . . . . . . . . 222.7 An Example of Reading Process with Two-Phase I/O . . . . . . . . . . 232.8 An Example of MPI File view . . . . . . . . . . . . . . . . . . . . . . . . 252.9 The Abstracted ROMIO Architecture . . . . . . . . . . . . . . . . . . . . 272.10 The Abstracted Architecture of OMPIO Frameworks and Components 282.11 Parallel HDF5 Application I/O Stack . . . . . . . . . . . . . . . . . . . . 302.12 Parallel NetCDF Application I/O Stack . . . . . . . . . . . . . . . . . . 312.13 Parallel ADIOS Application I/O Stack . . . . . . . . . . . . . . . . . . . 31

3.1 SAIO Abstract Software Stack . . . . . . . . . . . . . . . . . . . . . . . . 353.2 SAIO Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.3 SAIO Tracing Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4 SAIO Optimizing Process . . . . . . . . . . . . . . . . . . . . . . . . . . 413.5 SAIO Learning Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.6 SAIO Training Process - Two Pools in Red Font are Variables for SAIO

Training and Learning Processes . . . . . . . . . . . . . . . . . . . . . . 423.7 SAIO MPI and PMPI Wrapper for User Applications . . . . . . . . . . . 493.8 SAIO MPI Wrapper for MPI_Init() Flow Chart . . . . . . . . . . . . 503.9 SAIO Tracing Process Flow Chart . . . . . . . . . . . . . . . . . . . . . . 513.10 SAIO Recording Operation Details Flow Chart - Two Processes with

Red Font will be Presented in Figure 3.13 . . . . . . . . . . . . . . . . . 523.11 SAIO Optimizing Flow Chart . . . . . . . . . . . . . . . . . . . . . . . . 533.12 SAIO Getting Configuration Flow Chart . . . . . . . . . . . . . . . . . . 543.13 SAIO Real-Time Optimization Based on Predefined Frequency Flow

Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.14 Workflow of SAIO Learning Module - the Two Processes with Red

Font will be Represented in Figure 3.15 . . . . . . . . . . . . . . . . . . 563.15 SAIO Creating a Default Optimal Configuration File (0.conf) and a

Configuration Index File (index.conf) . . . . . . . . . . . . . . . . . . 56

xii

3.16 SAIO Training Utility Flow Chart - The Process with Red Font wasPresented in Figure 3.14 . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1 I/O Performance Impact of Different Number of OSTs . . . . . . . . . . 614.2 Performance Impact when Two I/O Benchmarks Run Simultaneously 624.3 Default Setups with 1 OST vs. SAIO Optimization . . . . . . . . . . . . 654.4 Default Setups with 4 OST vs. SAIO Optimizing for MPI Write Bench-

marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.5 Default Setups with 4 OST vs. SAIO Optimizing for MPI Read Bench-

marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.6 Using SAIO Default Configuration 0.conf vs. Using Configuration

Index File to Assign Predefined Configurations to MPI Write Bench-marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.7 Default Setups with 4 OST vs. SAIO Optimizing for HDF5 WriteBenchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.8 Default Setups with 4 OST vs. SAIO Optimizing for HDF5 Read Bench-marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.9 Evaluation Results of SAIO Real-Time Optimization . . . . . . . . . . . 714.10 Overhead Test of Different SAIO Modes (SIZE: size only; OPTON: op-

timizing only; TRON: tracing only; OPTTR: optimizing and trancing)as well as Darshan for MPI-IO - Setting the Same MPI info Objectsfor All Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.11 Overhead Test of Different SAIO Modes (SIZE: size only; OPTHDF5:optimizing coll/optimizing hdf5) as well as Darshan for parallel HDF5- Setting the Same MPI info Objects for All Test Cases . . . . . . . . . 76

4.12 Overhead Test of Different SAIO Modes as well as Darshan when Ac-cessing the SAIO Configuration Index File . . . . . . . . . . . . . . . . . 76

4.13 SAIO Finalize Overhead of Tracing Only and Optimizing Only Modes(Each Reads and Writes Once) . . . . . . . . . . . . . . . . . . . . . . . . 78

4.14 SAIO Finalize Overhead of Tracing Only and Optimizing Only Modes(Each Reads and Writes 500 Times) . . . . . . . . . . . . . . . . . . . . . 79

4.15 SAIO Finalize Overhead of Tracing Only Mode for Multiple Readingand Writing Operations on 24 MPI Processes . . . . . . . . . . . . . . . 79

4.16 Reading and Writing 8KB per Process Using Different Setups of Col-lective Buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.17 Reading and Writing Data with Different Setups of Collective Buffer-ing on Different Number of Processes . . . . . . . . . . . . . . . . . . . 83

4.18 Reading and Writing 32MB per Process Using Different Setups of Col-lective Buffering (Small Jobs) . . . . . . . . . . . . . . . . . . . . . . . . 84

4.19 Reading and Writing 32MB per Process Using Different Setups of Col-lective Buffering (Big Jobs) . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.1 Optimizing Results of Running APE4sources Process Once with Dif-ferent Configurations on 1200 Processes . . . . . . . . . . . . . . . . . . 91

5.2 Writing Files with File-per-Process Pattern on 7200 Processes (StripeSize is 4 MB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.3 Optimization Results of Running Part of Production Process on 240Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.4 Optimizing Results of Running Part of Production Process lactec_1v64on 1200 Processes (Only Read & Write) . . . . . . . . . . . . . . . . . . . 98

xiii

5.5 Estimated Results of Running Entire Production Process lactec_1v64on 1200 Processes (Only Read & Write) . . . . . . . . . . . . . . . . . . . 99

xv

List of Tables

1.1 A Small List of Popular Distributed Parallel File Systems . . . . . . . . 5

2.1 Examples of GPFS File Access Hints and Directives . . . . . . . . . . . 162.2 A Part of ROMIO Supported MPI hints . . . . . . . . . . . . . . . . . 28

3.1 Seven SAIO Key Components . . . . . . . . . . . . . . . . . . . . . . . . 443.2 SAIO Software Compatibility . . . . . . . . . . . . . . . . . . . . . . . . 58

4.1 Technical Details of Hazel Hen and Lustre File System . . . . . . . . . . 594.2 Configurations and Further Information about Two Simultaneously

Running I/O Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 624.3 Configurations’ Searching Scope for Training Process . . . . . . . . . . 634.4 Resources Consumed of the SAIO Training Process . . . . . . . . . . . 634.5 Found Optimal Configurations for Reading and Writing 40,000,000

Bytes (Data Transfer Size) per Process . . . . . . . . . . . . . . . . . . . 644.6 Generated Configuration Index File after Training Process . . . . . . . 674.7 A Ranked Consumption Statistic of Different Applications (≤ 2400

PEs) at HLRS in 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.8 Overhead Results on 1 MPI Process: Other Overhead include the MPI_INIT

and MPI_FINALIZE instrumentation overhead (initializing software,writing log files, finalizing software etc.) . . . . . . . . . . . . . . . . . . 73

4.9 Process Overhead on Multiple MPI Processes . . . . . . . . . . . . . . . 744.10 Size of Log Files Generated by Darshan and SAIO . . . . . . . . . . . . 80

5.1 Tracing Results of APE4sources Production Process . . . . . . . . . . . 895.2 Configurations’ Searching Scope for Training Process APE4sources . 895.3 Found Optimal Configurations after Training Process APE4sources . 905.4 Optimizing Results of Running Part of Production Process on 240 Pro-

cesses (Operations’ Duration in Seconds) . . . . . . . . . . . . . . . . . 955.5 Data Size Summary of Reading/Writing Different Data Formats for

Process lactec_1v64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.6 A List of Configurations’ Searching Scope for Writing Operations of

Process lactec_1v64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965.7 A List of Found Optimal Writing Configurations for Process lactec_1v64

(saio_file_type Definition in Listing A.3) . . . . . . . . . . . . . . . 975.8 Optimizing Results of Running Part of Production Process lactec_1v64

on 1200 Processes (Duration in Seconds) . . . . . . . . . . . . . . . . . . 985.9 Estimated Optimizing Results of Running Production Process lactec_1v64

on 1200 Processes (Duration in Seconds) . . . . . . . . . . . . . . . . . . 99

xvii

Listings

3.1 Two Records of SAIO Log File (1200.saio) from Training the CFDApplication’s Process APE4sources in Section 5.2 . . . . . . . . . . . . 47

3.2 Generated SAIO Configuration File (1200.conf) from Training theCFD Application’s Process APE4sources in Section 5.2 . . . . . . . . 48

3.3 SAIO Configuration Index File (index.conf) of Evaluations in Sec-tion 4.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4 Code Segment for Recording Read Duration . . . . . . . . . . . . . . . 514.1 Pseudo Codes of Overhead Evaluation MPI Program . . . . . . . . . . 734.2 Code Segment of SAIO Finalize Overhead Evaluation . . . . . . . . . . 775.1 44 Different Data Transfer Sizes (Byte) of Reading Operations from

Process lactec_1v64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965.2 18 Different Data Transfer Sizes (Byte) of Writing Operations from

Process lactec_1v64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96A.1 SAIO Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105A.2 SAIO Error Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106A.3 SAIO File Type Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 107A.4 Code Segment for MPI_Init() Wrapper . . . . . . . . . . . . . . . . . 109A.5 Example of Shell Script for Using SAIO . . . . . . . . . . . . . . . . . . 110B.1 SAIO Traced Data Transfer Size List of a WRF Online Tutorial Process . 113B.2 Data Transfer Size List of a WRF Online Tutorial Process for Training

Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113B.3 Configurations’ Searching Scope for a WRF Online Tutorial Process . . 113B.4 Generated SAIO Configuration File (for Writing) from Training the

WRF Online Tutorial Process . . . . . . . . . . . . . . . . . . . . . . . . 113

xix

List of Abbreviations

ADIO Abstract-Device interface of I/OADIOS ADaptable I/O SystemAPE Acoustic Perturbation IquationAPI Aplication Program InterfaceCAE Computer-Aided EngineeringCFD Computational Fluid DynamicsCPU Central Processing UnitCSV Comma-Separated ValuesDSA Lenovo Distributed Storage ArchitectureDSS Lenovo Distributed Storage Solutionext extended file systemFAT File Allocation TableFLOPS FLoating point OPerations per SecondGCC GNU Compiler CollectionGCS Gauss Centre for SupercomputingGPFS General Parallel File SystemGPU Graphics Processing UnitGUI Graphical User InterfaceIBM International Business Machines CorporationHDF5 Hierarchical Data Format 5HDFS Hadoop Distributed File SystemHFS Hierarchical File SystemHLRS Hoechstleistungsrechenzentrum StuttgartHLRS High-Performance Computing Center StuttgartHPC High-Performance ComputingHPSS High Performance Storage SystemI/O Input and OutputIEEE Institute of Electrical and Electronics EngineersIOR Interleaved-Or-RandomJSON JavaScript Object and NotationLES Large-Eddy SimulationLNet Lustre NetworkMCA Modular Component ArchitectureMDS MetaData ServerMDT MetaData TargetMPI Message Passing InterfaceNetCDF Network Common Data FormNSD Network Shared DiskNTFS New Technology File SystemNVMe Non-Volatile Memory expressOSS Object Storage ServerOST Object Storage TargetPBS Portable Batch System

xx

PCIe Peripheral Component Interconnect expressPE Process ElementPOSIX IEEE Portable Operating System Interface for UniXRAID Redundant Array of Inexpensive DisksSAIO Semi-Automatically I/O-tuning frameworkSAN Storage Area NetworkSSD Solid-State DrivesTCP Transmission Control ProtocolWRF Weather Research and ForecastingXML EXtensible Markup Language

xxi

Constants

Byte 1 B = 1 ByteKilobyte 1 KB = 210 BytesMegabyte 1 MB = 220 BytesGigabyte 1 GB = 230 BytesTerabyte 1 TB = 240 BytesPetabyte 1 PB = 250 BytesExabyte 1 EB = 260 ByteskiloFLOPS kFLOPS = 103 FLOPSmegaFLOPS MFLOPS = 106 FLOPSgigaFLOPS GFLOPS = 109 FLOPSteraFLOPS TFLOPS = 1012 FLOPSpetaFLOPS PFLOPS = 1015 FLOPSexaFLOPS EFLOPS = 1018 FLOPSzettaFLOPS ZFLOPS = 1021 FLOPSyottaFLOPS Y FLOPS = 1024 FLOPS

1

Chapter 1

Introduction and Motivation

1.1 Introduction

A high-performance computing (HPC) system includes not only hundreds of thou-sands of powerful CPUs, many-core processors1, GPU accelerators2 delivering hardlyimaginable computing power, but also extremely high speed network technologyconnecting among thousands of compute nodes. Massive data generated by userapplications need to be stored in system-wide accessible file systems, which usu-ally stand outside of HPC systems. These external accessible file systems are con-nected with HPC systems via high performance networks such as InfiniBand3[1],Fiber Channel, (10 Gigabit) Ethernet and so on. Since user applications read inputdata into the memory of compute nodes, execute, generate output data, and thenstore them into the file systems, the file systems will not be counted in rating theHPC system’s performance. It is the input and output (I/O)[2] performance that im-pacts the efficiency of the entire HPC system. In addition, some applications writemultiple checkpoint files at run-time, so that they can continue to run in case of inter-rupts or application crashes. The blocking I/O requests, that some of them use, willhold all processes until they finish reading/writing data. Therefore, understandingthe I/O requests of user applications, parallel I/O stack and even the specificationof storage systems helps the computing centers to increase their efficiency.

Figure 1.1 presents a typical I/O stack of an HPC system, through which anI/O request normally goes: user application –> high-level I/O library (optional) –>Message Passing Interface I/O (MPI4-IO[3]) library (optional) –> Portable OperatingSystem Interface (POSIX5) I/O[4] –> parallel file system –> storage system. Withineach layer of the I/O stack, computer scientists have developed various algorithms,I/O libraries, file/data management software to approach the theoretical hardwarebandwidth limit. Hardware manufacturers invented high performance hardware,

1http://en.wikipedia.org/wiki/Manycore_processor2http://www.nvidia.com/object/tesla-supercomputing-solutions.html3http://www.mellanox.com/pdf/whitepapers/IB_Intro_WP_190.pdf4http://mpi-forum.org/5http://standards.ieee.org/develop/wg/POSIX.html

2 Chapter 1. Introduction and Motivation

like Solid-State Drives (SSD), and technologies, such as Redundant Array of Inex-pensive Disks (RAID) and Non-Volatile Memory express (NVMe) to break throughthe bandwidth limits.

User Applications

POSIX-I/O MPI-IO Library

High-Level I/O Libraries

Distributed Parallel File Systems Distributed Data Storage Systems

FIGURE 1.1: Typical I/O Stack of an HPC System.

1.1.1 User Applications

Normally, the user applications can be classified into three types: self-developed,proprietary and open source software.

In the self-developed software, engineers can choose any I/O layer to access par-allel file systems according to their researching requirements. Sometimes this typeof software is developed on one HPC platform but supposed to run on multipleHPC platforms. Engineers don’t always have enough time and knowledge to opti-mize their codes to achieve the maximal performance on different HPC platforms.Therefore, only the default setups are implemented in most cases, while the tasks oftuning I/O performance are left to system administrators.

Proprietary software are developed for different application fields like ComputerAided Engineering (CAE). They pack the I/O module into executable files and usu-ally provide both series and parallel I/O functions. For example, the Ansys Fluent6

uses (parallel) Hierarchical Data Format version 5 (HDF5)[5][6][7] high-level I/O li-brary, MPI-IO library as well as POSIX I/O. Users can choose either one of them toread/write data. However, the software is merely concentrating on solving the pro-fessional problems but paying little attention on the I/O performance, hence thereare very few instructions about these I/O modules.

Open source software, whose source codes are free to download, are named assubstitute to the proprietary software. For example, the Weather Research and Fore-casting (WRF) Model7 is one of the most popular software for climate research. Itsupports (parallel) Network Common Data Form (NetCDF[8][9][10][11]), (parallel)HDF5 high-level I/O libraries, as well as other I/O libraries. Any programmer de-veloping high performance I/O module, can make contributions to the WRF Model.As a result, climate scientists might have no more motivation to tune its I/O perfor-mance after getting the research results.

6http://www.ansys.com/Products/Fluids/ANSYS-Fluent7http://www.wrf-model.org/index.php

1.1. Introduction 3

In order to help scientists understand the importance of I/O performance andtune their applications, HLRS holds different training courses8 for different appli-cations. Unfortunately, only the general optimization information is provided, andthe application users have to acquire deep know-how to understand various codeoptimizations, which is not easy even for a computer scientist. Some user applica-tions are suppose to run for decades, while the HPC system in a computing centerchanges every 3 to 5 years. It is a big challenge to adapt these applications for newHPC system with modern technologies.

On the other hand, it is impossible for the computer scientists or the project advi-sors to cover all aspects of code optimizations for all active projects at a computingcenter. For example, there are plenty of projects in different professional areas atHLRS, around 100 active federal projects plus many others from industry and aca-demic. Due to the lack of background information of computer science, the HLRSproject advisors are not able to master the necessary knowledge of I/O optimizationfor all active projects. One solution is to hire some expert, who is familiar with userapplications, different I/O libraries, distributed parallel file systems and the cur-rently running HPC system. Moreover, this expert has to cooperate with end usersand works with over hundred projects. It remains a big question, if this complicatedtask can be accomplished by one person or even one team, let alone the personneland operating costs.

CFD64,02%

Physics21,62%

Climate Research4,26%

Chemistry3,68%

Bioinformatics0,95%

Electrical Engineering0,75%

Computer Science0,47%

Others4,25%

Computing Time Ratios of Different Professional Areas at HLRS in 2016

CFD Physics Climate Research ChemistryBioinformatics Electrical Engineering Computer Science Others

FIGURE 1.2: Computing Hours Usage Ratios of Different ProfessionalAreas at HLRS in 2016

8http://www.hlrs.de/training/


To find out which professional areas have the largest potential to benefit froman I/O optimization, I investigated the computing time consumption in 2016 atHLRS (Figure 1.2[12]). Among all professional areas, Computational Fluid Dynam-ics (CFD9) consumed over 60% of annual computational capabilities at HLRS. Be-sides, climate research also caught my eye, since their applications generate massivedata to simulate the dynamic change of climate. These data are not reproducible andneed to be archived in High Performance Storage System (HPSS)10 soon.

1.1.2 I/O Libraries

In the I/O libraries, computer scientists have developed many I/O algorithms forvarious data types and the underlying parallel file systems. As shown in figure1.1, the lowest layer is the POSIX standard specified by the Institute of Electricaland Electronics Engineers (IEEE11) Computer Society. Both optimization and par-allelism of POSIX-I/O are not easy to achieve owning to its complexity. The MPI-IO library, built upon POSIX-I/O, is hence introduced to specify the parallel I/Ounder MPI standard. Computer scientists have designed and implemented MPI-IO libraries integrated with different I/O algorithms, such as data sieving[13] inROMIO12[14][15] and automatically selecting proper I/O algorithms in OMPIO[16].Built upon the MPI-IO library, there are a number of high-level I/O libraries, suchas parallel HDF5[17], parallel NetCDF13 and ADIOS14.

Although great progresses have been made in I/O libraries, it is yet a barrier toinvestigate how to choose a proper I/O algorithm, and whether the I/O algorithmis compatible with the underlying parallel file system. Regardless a plenty of I/Otuning options offered by computer scientists, application users still need to take ef-forts to acquire the tuning skills and/or modify their application codes accordingly.I wonder, if it is possible to tune I/O operations between the MPI-IO and the high-level I/O library layers transparently, so that it can make use of the MPI-IO standardand improves the applications that use high-level I/O libraries.

1.1.3 Distributed Parallel File Systems

File systems are developed to manage the user data stored in storage systems. Dif-ferent operation systems support different file systems: Linux usually uses the ex-tended file system (ext*) family for its local driver; macOS uses the Apple File Sys-tem (APFS) to replace its default Hierarchical File System (HFS) Plus file system;Microsoft Windows uses File Allocation Table (FAT) or New Technology File System

9http://en.wikipedia.org/wiki/Computational_fluid_dynamics10http://www.hlrs.de/en/systems/hpss-data-management/11http://www.ieee.org/index.html12http://www.mcs.anl.gov/projects/romio/13http://trac.mcs.anl.gov/projects/parallel-netcdf14http://www.olcf.ornl.gov/center-projects/adios/

1.1. Introduction 5

(NTFS). The so-called "distributed file systems" or "network file systems" are devel-oped based on the network protocol for multiple operating systems to access dataconcurrently. Distributed parallel file systems stripe data over multiple networkconnected servers to achieve higher I/O performance. All network storage systemsare connected by high performance network connections such as InfiniBand.

Universities, research institutions, IT companies and open source communitiesare all dedicated to design and implement distributed parallel file systems to man-age data and the underlying storage systems efficiently. Table 1.1 lists a very smallpart of the most popular distributed parallel file systems on the market: Lustre15,Spectrum Scale / GPFS16[18], BeeGFS17 and HDFS18[19]. Besides managing andcontrolling the user data, distributed parallel file systems also provide different op-timization possibilities for I/O requests.

File Systems Owner Operating SystemLustre OpenSFS & EOFS LinuxSpectrum Scale / GPFS IBM AIX / Linux / WindowsBeeGFS Fraunhofer ITWM LinuxHDFS Apache Cross-platform

TABLE 1.1: A Small List of Popular Distributed Parallel File Systems

Some computing centers use different types of distributed parallel file systemsfor their HPC systems. As long as end users apply for a certain HPC system, theywould not concern about the underlying file systems. Some basic knowledge of theunderlying file system will be given to end users through instructions or trainingcourses. However, it is still difficult to accelerate their I/O requests or to identifythus to avoid unsuitable configurations.

System administrators can apply an optimal default setup for the underlying filesystem. Nevertheless, it is unfortunately a reluctant compromise for most applica-tions. End users could still use other configurations accidentally leading to slowerI/O performance. For example, an end user might have learned that a higher I/Operformance can be achieved by striping files over as many as possible Lustre ObjectStorage Targets (OSTs19), which is correct in theory, when and only when there is noother concurrently running I/O request. Practically, the job size and data transfersize has also significant influences on I/O performance (Section 4.1.3). Such kindof wrong configurations are impossible to detect, unless the users take a deep lookunder the surface. Hence, it comes another barrier for improving I/O performance.

15http://www.lustre.org/16http://www-03.ibm.com/systems/storage/spectrum/scale/17http://www.beegfs.io/content/18http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html19more details about Lustre file system OSTs are in Section 2.1.1


1.1.4 Distributed Data Storage Systems

Distributed data storage systems stand on the bottom layer of the I/O stack (Figure1.1). They are transparent to the application users, narrowly related to hardware,and deciding the theoretical limit of I/O performance. Although the distributedparallel file system layer (software) and distributed data storage system layer (hard-ware) are independent to each other, hardware manufacturers and storage systemvendors develop file system specified storage systems to approach the hardwarelimits.

HPC manufacturers develop storage systems based on file system technologiesfor better performance when they deploy the storage systems together with theirown HPC systems. For example, NEC20 offers flexible storage-infrastructures: GxFS21

(a GPFS-based storage appliance) and LxFS22 (a Lustre-based storage appliance).Cray23 provides its scale-out Lustre storage system - Sonexion series24, while IBM’sdifferent storage systems25 offer different solutions for different range of applica-tions in both industrial and academic areas. The storage system vendors developcomplete distributed data storage system solutions for popular distributed parallelfile systems as well. Lenovo26, for instance, releases its two scalable software-definedDistributed Storage Solutions (DSS) for IBM Spectrum Scale (DSS-G), SUSE Enter-prise Storage27 (DSS-C) and two Distributed Storage Architectures (DSA) for IntelLustre (DSA-L), SUSE Enterprise Storage/Red Hat Ceph Storage28 (DSA-C)29.

1.2 Motivation

1.2.1 Problem Description

Every I/O optimizing achievement made by computer scientists on any layer of theI/O stack could be published as a research paper or dissertation. However, it isalmost impossible for scientists or engineers from other professional areas to under-stand and apply the proper technologies for their applications. Usually they onlyapply the default setups, which ensures the portability of their applications. Never-theless, more and more users start to monitor their resource consumption. Amongdifferent optimizing potentialities, I/O request is one of the most inquired parts.

20http://www.nec.com/21http://www.gxfs.info/gxfs_flyer.pdf22http://de.nec.com/de_DE/emea/products/hpc/lxfs_high_performance_storage/23http://www.cray.com/24http://www.cray.com/products/storage/sonexion25http://www-03.ibm.com/systems/storage/26http://www.lenovo.com/27http://www.suse.com/de-de/products/suse-enterprise-storage/28http://www.redhat.com/en/technologies/storage/ceph29http://insidehpc.com/2017/04/lenovo-hpc-strategy-update/

1.2. Motivation 7

Some users managed to accelerate their applications. Others, in a bigger proportion,were either unlucky or exhausted the system resources.

In order to compare a successful optimization to an inefficient one, I used Inter-leaved Or Random (IOR) benchmark30 (Section 4.1.2) to simulate the I/O requestsfor reading and writing three different data transfer sizes with 1200 CPU core pro-cesses on Hazel Hen (Cray XC40)31 and Lustre file system at HLRS (Section 4.1.1).The I/O simulation ran with different MPI hints changing Lustre stripe_count,stripe_size and enabling/disabling the collective buffering for MPI collectiveread and write32. The four I/O simulations were configured as following:

• using MPI collective I/O operations

• accessing a single shared file

• using same data transfer size on each process of each scenario

• using MPI hints to control the Lustre striping setups

Considering the characteristics of Lustre file systems (Section 2.1.1), a rule of thumbis to stripe files over more OSTs and to set reasonable stripe sizes (1 MB - 4 MB). Theresults of each scenario came from 20 running samples and is presented in Figure 1.3and 1.4 through four box plot diagrams33.

0

100

200

300

Setup 1 Setup 2

striping_factor=16striping_unit=4MBromio_cb_write=automatic

striping_factor=4striping_unit=1MBromio_cb_write=automaticB

and

wid

th[M

B/S

]

1st Scenario1200 Processes Collectively Write a Single-Shared-File

(8 KB Data Transfer Size)

0

3000

6000

9000

12000

Setup 1 Setup 2

striping_factor=16striping_unit=4000000romio_cb_write=automatic


Ban

dw

idth

[MB

/S]

2nd Scenario1200 Processes Collectively Write a Single-Shared-File

(4000000 B Data Transfer Size)

FIGURE 1.3: I/O Simulation Results by Applying Different MPI Hintsfor Writing 8 KB and 4,000,000 B Data Transfer Sizes

In the 1st scenario of Figure 1.3, each process wrote 8 KB data into a singleshared file (8KB × 1200 = 9600KB). For the small data transfer size, the writ-ing performance of striping over 4 OSTs with 1 MB stripe size was about 35% bet-ter than the performance of striping over 16 OSTs with 4 MB stripe size. In the2nd scenario, the "setup 1" was actually one of my mistakes as I started to opti-mize I/O performance on Lustre file systems. I tried to write a single shared file

30http://github.com/LLNL/ior31http://www.hlrs.de/systems/cray-xc40-hazel-hen/32The value of MPI hint striping_factor sets the number of Lustre stripe_count while the

value of MPI hint striping_unit sets the size of Lustre stripe_size. The collective bufferingcan be enabled/disabled by setting MPI hints romio_cb_read and romio_cb_write

33Box plots use graphic to illustrate groups of numerical data through their quartiles in descriptivestatistics. They provide the maximum, median and minimum results, as well as upper and lowerquartiles. More details please refer to [20]


(4000000B × 1200 ≈ 4.47GB) using 1200 processes (4,000,000 bytes per process).After learning the architecture of Lustre file systems, I decided to set the valueof striping_unit as 4,000,000 bytes, because each process would have accessedonly one OST. However, the I/O performance was unexpectedly poor. AfterwardsI found out that Lustre would automatically set the value of striping_unit as65,536 bytes (64 KB) if it is not divisible by 65,536. When I changed striping_unit

to 4,194,304 bytes (65536B × 64 = 4MB), the writing performance was increased byabout 400%.

The 3rd and 4th scenarios simulated writing and reading a large file (80MB ×1200 = 93.75GB) collectively (Figure 1.4). Disabling the collective buffering, whichis normally not recommended, achieved about 225% and 550% improvements foreach writing and reading performance. Identifying the moment of disabling andenabling collective buffering for such kind of situations is also a challenge.

0

6000

12000

18000

24000

Setup 1 Setup 2


striping_factor=16striping_unit=4MBromio_cb_write=disable

Ban

dw

idth

[MB

/S]

3rd Scenario1200 Processes Collectively Write a Single-Shared-File

(80 MB Data Transfer Size)

0

10000

20000

30000

40000

Setup 1 Setup 2

romio_cb_read=automatic

romio_cb_read=disable

Ban

dw

idth

[MB

/S]

4th Scenario1200 Processes Collectively Read a Single-Shared-File


FIGURE 1.4: I/O Simulation Results by Applying Different MPI Hintsfor Reading and Writing 80 MB Data Transfer Size

The problems illustrated in these four scenarios are just a tip of the iceberg. Un-covering and investigating the overview of user applications’ I/O path still need thecooperation between computer scientists and application users. Especially when thedata generated by one process is the source data of other processes. While investi-gating and profiling the parallel I/O requests on Lustre file systems, I found out, thatthose files generated using optimal configurations usually led to better reading per-formance. That means, optimizing writing operations can also potentially optimizethe reading performance.

From Figure 1.3 and 1.4, I have seen a lot of potential to improve the efficiency ofHPC systems by avoiding those unsuccessful tuning attempts. In order to maximizethe HPC system’s efficiency, the following three requirements should be considered:

• Improving the application I/O performance transparently

• Avoiding the configurations leading to a poor I/O performance

• Offering system administrators suggestions for system default setup

1.2. Motivation 9

1.2.2 Existing Solutions

To solve the previously mentioned problems and to fulfill the three requirements, Idid some research and tried to find a solution, which should include the followingtwo functions:

• optimizing I/O requests: A lot of innovative researches were made by talentedcomputer scientists. But unfortunately some solutions can not be used for gen-eral engineering applications and some cannot run with production processesbecause of the high overhead.

• providing system administrators the statistic information: Among the I/Oprofiling tools, no one can tell system administrators the optimal system de-fault setups without too many efforts investigating the tracing results.

I/O Auto-Tuning Solutions

Scalable I/O for extreme performance (SIOX)[21][22], developed by HLRS, ZIH34,UHH35, DKRZ36 and IBM37, monitors a running system in real-time, uses a databaseto store I/O-related information and eventually optimizes future I/O operations.SIOX keeps an I/O tracing thread (SIOX daemon) running on each compute node,accesses the I/O database to inquire the suitable I/O access patterns, uses machinelearning mechanism to keep the I/O database updated and achieves a real-time par-allel I/O optimization. The current system load on the compute node can be mon-itored and considered as one factor for selecting I/O access patterns. About 3.0 to4.5 seconds overhead is produced by the MPI instrumentation[22], which is unfor-tunately too high in a production environment.

Pattern-driven parallel I/O tuning for HDF5 applications[23], developed by Be-hzad et al. in university of Illinois at Urbana-Champaign, is to optimize I/O perfor-mance of HDF5 applications across platforms and applications automatically. Theframework traces the high-level I/O accesses using Recorder[24] and then ana-lyzes their patterns with H5Analyze[23]. Based on these patterns and the tuningparameters in history, the framework selects the best performance configurationsfrom an XML file at run-time (H5Tuner[25]). If there is no historical parameteravailable, it initializes model-based training to acquire efficient configurations byGenetic Algorithm[26] (H5Evolve[25]). Evaluating the solution, it occurred to meseveral concerns. First, this framework is built upon HDF5 I/O library and onlycompatible with HDF5 applications. Second, although the searching process hasbeen reduced from 12 hours (via genetic algorithm) to 2 hours (via empirical per-formance models)[27], it still consumes too many computing resources. Finally, the

34http://tu-dresden.de/zih35http://www.uni-hamburg.de/36http://www.dkrz.de/37http://www.ibm.com/de-de/


overhead for invoking H5Tuner, which is essential in a production environment, isunfortunately not provided.

An auto-tuning I/O framework on Cray XT5 System[28], designed by You et al.,uses a mathematical model to describe parallel I/O activities and support an I/Oauto-tuning infrastructure for HPC systems. The entire system is transferred andsimulated according to a mathematical model. Using the auto-tuning process, theoptimal parameters, such as the values of Lustre stripe_count and stripe_size,will be applied to the real applications. This innovative framework requires user tobe familiar with the characteristics of application’s I/O operations and the I/O sim-ulations. Plus, whether this mathematical model works on other HPC systems is nottested.

Based on the description of application’s I/O requests and the system configura-tion, Chen et al. have developed an optimization engine with a parallel I/O libraryfor multidimensional arrays in project Panda[29]. The optimization engine uses arule-based algorithm as well as a randomized search-based algorithm to select op-timal parameter settings for I/O requests. Engineering applications read/write notonly multidimensional arrays, but other data formats as well. Therefore, a universalsolution would be very useful.

I/O Tracing Mechanisms

Darshan38, a parallel I/O characterization tool, is designed to capture an accuratepicture of I/O behavior, including properties such as patterns of access within files,with the minimum possible overhead[30][31][32]. It instruments POSIX, MPI-IO,parallel NetCDF and HDF5 functions and collects the I/O characterization fromvarious I/O stacks[33]. Since Darshan version 3.1.039, a new mmap-based loggingmechanism has been integrated. Together with the new darshan-merge utility40,Darshan ensures, that the tracing results are still available, even if the application iscrashed or runs out of resources (e.g. wall time). To minimize its overhead, Darshanprovides a post-processing utility to generate useful reports. System administratorsneed to implement or select proper I/O simulations, record the I/O configurationsand learn the description of Darshan reports, so as to find out the optimal config-urations for running system. System administrators could expect a more intuitivesolution with less irrelevant tracing information.

Behzad et al. have approached an idea of automatically generating I/O kernelsof HPC applications from tracing results[34]. This framework is built upon HDF5I/O library and consists of three components: Recorder[24], Trace Merger and Code

38http://www.mcs.anl.gov/research/projects/darshan/39http://www.mcs.anl.gov/research/projects/darshan/2016/09/30/new-darshan-3-1-0-release-

now-available/40http://www.mcs.anl.gov/research/projects/darshan/docs/darshan3-util.html

1.2. Motivation 11

Generator. The first stage is tracing the details of I/O operations on each MPI pro-cess, and then generating n log files, where n represents the number of used MPIprocesses. In the second stage, a merging algorithm is applied to parse all n log filesand create a single trace file as a foundation for next step. At last, based on thissingle trace file, the I/O kernel of the application will be generated automatically.This framework works only for HDF5 applications, while MPI-IO is used by variousengineering applications as well. A solution based on MPI-IO library could be morewidely used.

1.2.3 A Light Weighted Approach

After investigating the research areas, I cannot find a suitable software to solve theproblem and fulfill all three requirements (Section 1.2.1). Therefore, I am going todesign and implement an I/O auto-tuning framework for MPI-IO library, which canbe widely used and supports parallel HDF5 as well as parallel NetCDF applica-tions.41 The framework should act as a knowledge bridge between application usersand system administrators. It finds the most suitable configurations for each I/O re-quest and eventually applies the best one at run-time automatically. This intelligentframework, designed and implemented for different environments, is dedicated tosearching for optimal configurations with no interactions from application users. Itowns the following four abilities:

• Compatibility: This intelligent framework should be compatible with as manyengineering applications as possible. Designing based on MPI-IO library en-sures the software compatibility with not only MPI, HDF5 or NetCDF opensource applications, but also plenty of proprietary software (e.g. Ansys Fluent,SIMULIA Abaqus42 etc.) using MPI-, HDF5- or NetCDF-I/O libraries.

• Scalability: As a light weighted system running in production environments,its overhead (both time and resource consumption) should be acceptable. Tosolve large engineering problems, the engineering applications usually scaleout till more than hundreds of thousands of compute nodes. The capabilityto run with large scaling applications is non-trivial, because its overhead mustnot be enlarged accordingly.

• Usability: To encourage more scientists and engineers to use this intelligentframework, it should not require additional skills or knowledge, and be easyto use. Engineers and scientists remain focused on their own simulations andtake little consideration about the I/O performance. Most proprietary appli-cations are executable files and don’t provide detailed optimizing instructionsabout their I/O requests. Running with applications transparently will appealto more engineers and scientists to apply it.

41In the rest of this dissertation, HDF5 and NetCDF imply parallel HDF5 and parallel NetCDF, ifnot mentioned explicitly.

42http://www.3ds.com/products-services/simulia/products/abaqus/


• Portability: The framework must be designed for not only one HPC platform.Sometimes there are more than one HPC systems in a HPC center, which areupdated or upgraded regularly. A software that suits only one platform will beout of date soon, therefore it should also be able to run on multiple platforms.

1.3 Organization of the Dissertation

This dissertation is organized as follows: Chapter 1 illustrates the common parallelI/O stack in an HPC system environment and gives a brief introduction of comput-ing resource consumption at HLRS in 2016. In addition, the current software andhardware solutions are investigated in short, and then lead to my light weightedapproach. Chapter 2, today’s software technologies, are introduced concerning myresearch work in details. Besides, their possibilities to accelerate parallel I/O opera-tions are investigated. Chapter 3 shows the conception, architecture and implemen-tation of my light weighted and intelligent solution, Semi-Automatically I/O-tuningframework (SAIO), for engineering applications. Afterwards I present the evalua-tion results, the improvements for MPI and HDF5 applications, with IOR benchmarkin Chapter 4. Evaluations of SAIO’s overhead, scalability and portability are alsopresented in Chapter 4. Chapter 5 uses two engineering use cases to present SAIO’susability and its optimization results in a production environment. The last, but notthe least, Chapter 6 sums up my work and presents my future work to extend andimprove SAIO.

13

Chapter 2

State of the Art

2.1 Distributed Parallel File Systems

2.1.1 Lustre

The Lustre file system is an open-source, parallel file system supporting many re-quirements of leadership class HPC simulation environments1. It is an object-basedfile system composed of three components: Metadata Servers (MDSs), Object Stor-age Servers (OSSs), and clients[35]. Lustre clients are installed in the compute nodesor I/O nodes of an HPC system, which are connected with MDSs and OSSs via highspeed connecting networks such as InfiniBand. Figure 2.1 illustrates a simplifiedLustre components architecture[35].

Lustre Client

Lustre Client

Lustre Client

Lustre MDS

Lustre MDS

Lustre OSS

Lustre OSS

Interconnect

HPC System

File System

MDT MDT

OST OST OST OST

OST OST OST OST

InterconnectMDT MDT

Interconnect

FIGURE 2.1: A Simplified Illustration of Lustre File System Compo-nents

1http://www.lustre.org/

14 Chapter 2. State of the Art

Each MDS manages one (till Lustre software release 2.3) or multiple (since Lustresoftware release 2.4) Metadata Targets (MDTs), which store the metadata informa-tion such as file name, path, permissions etc. The OSSs provide file I/O servicesand manage Object Storage Targets (OSTs), where the application data are stored.Both MDT and OST can be constructed out of one disk or disk RAID to increasethe capacity and I/O performance. The OSTs are like multiple disks connecting toOSSs. Users can decide how many OSTs they want to stripe their files over. This ap-proach enables concurrent accesses to multiple OSTs and eventually accelerates theI/O requests. Besides the number of OSTs, users can also set the stripe size, whichindicates how many bytes can be stored in one OST stripe before moving to the nextOST or next stripe. Different settings of these two factors lead to a huge differenceof I/O performance. One target of my work is to find out the optimal combinationfor each I/O request and to set them at run-time.

LustreClient0

OST0 OST1 OST2 OST3 OST4 OST5

OSS0 OSS1

LNet LNet LNet

LustreClient1 LustreClient2

FIGURE 2.2: Lustre File System Striping Mechanism

There are two tuning parameters, that can be set at run-time: stripe_count(the number of OST) and stripe_size (the size of one stripe on each OST). Figure2.2 illustrates the Lustre striping mechanism. Assuming three Lustre clients are writ-ing files into the OSTs managed by two OSSs. The application on the compute nodeof LustreClient0 generates 8 MB data and sends it through the Lustre Network (LNet)with following predefined setups: stripe_count=2 and stripe_size=4MB. OSS0

is assigned to accomplish this task and started two I/O service threads. These twothreads allocate a 4 MB stripe block on each OST (OST0 and OST1) and then writethe data into these two blocks concurrently. Meanwhile, another application alsogenerates 8 MB data with its two processes. Each process needs to write 4 MB datathrough two Lustre clients: LustreClient1 and LustreClient2. Unlike the last appli-cation, this one wants to write files with a different setup: stripe_count=4 andstripe_size=2MB. The task is distributed to OSS0 and OSS1. OSS0 starts the thirdI/O service thread connecting to OST2 while OSS1 starts three I/O service threads

2.1. Distributed Parallel File Systems 15

connecting to OST3, OST4 and OST5. All of the six I/O service threads are writingdata to their target OSTs simultaneously and no Lustre client has to wait until anI/O service thread is available. As a result, the application (with LustreClient0) couldcost twice as much time as the other one (with LustreClient1 and LustreClient2) to ac-complish its writing request, as it accesses 2 OSTs instead of 4 OSTs. The readingprocess is analogous. Because of this Lustre file system specification, the readingperformance will also depend on the ways that files were created or striped.

In addition to these two run-time tunable parameters, system administrators can

• also set the number of service threads on MDS and OSS to allow more concur-rent I/O requests,

• change the parameters of LNet to define the transmitting/receiving buffers’sizes, three router buffers’ sizes and the policy for delivering events and mes-sages to the upper layers,

• choose a suitable policy to handle the events and messages from LNet etc.

These options need administrative privilege and will not be discussed in my PhDwork.

2.1.2 IBM Spectrum Scale - GPFS

IBM’s General Parallel File System (GPFS), a high performance distributed parallelfile system, was renamed as Spectrum Scale.2 Unlike Lustre, it supports IBM AIX,Red Hat Linux, SUSE Linux, Microsoft Windows and IBM z Systems.3 Figure 2.3[18]presents a simplified GPFS components architecture. GPFS clients are installed onthe I/O or compute nodes and connected with the shared disks via a switching fab-ric. Through a Storage Area Network (SAN) in the switching fabric, each GPFS clienthas the same access on all Network connected Shared Disks (NSD).

Files in GPFS are striped over all NSDs and divided into multiple blocks. Theblock-size, which is defined when the file system is created, ranges between 16 KBand 16 MB.4 Large files are striped and stored in blocks, while small files are storedin so-called sub-blocks. The size of a sub-block is fixed as 1/32 of the defined block-size, which is also the smallest allocation of a single file. The block-size cannot bechanged after the file system is established. Therefore, system administrators have toeither deploy multiple GPFS file systems with different block-sizes, or search for anacceptable compromise to maximize the throughput with one block-size in one GPFSfile system. Additionally, another mechanism, pagepool, is introduced by GPFS fortuning its I/O performance. The pagepool is an allocated part of physical memory

2http://www-03.ibm.com/systems/storage/spectrum/scale/3http://www-03.ibm.com/systems/storage/spectrum/scale/specifications.html4http://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20

Parallel%20File%20System%20(GPFS)/page/File%20System%20Planning


SwitchingFabric

GPFS Client

GPFS Client

GPFS Client

GPFS Client

HPC System Network Shared Disks

Shared Disk

Shared Disk

Shared Disk

Shared Disk

Shared Disk

Shared Disk

FIGURE 2.3: A Simplified Illustration of GPFS Components

that caches data as well as metadata. It supplies memory for buffering operationslike prefetch (read) and write behind. Its size can be very different depending onthe types of nodes. For example, the pagepool can be configured as 8 GB on NSDservers, 4 GB on login nodes and 1 GB on compute nodes. However, these twofeatures need root authority and cannot be changed by users.

GPFS also provides several I/O optimization mechanisms: recognizing I/O ac-cess patterns[18], enabling GPFS data shipping[36] for small data accesses, intro-ducing GPFS byte range locking[37] to maximize concurrent accesses, offering GPFSprogramming interfaces like gpfs_fcntl() subroutine for file access hints and di-rectives etc. Among all these options, the subroutine gpfs_fcntl() offers appli-cation developers an interface to control their file accesses. Some of its file accesshints and directives don’t need root authority and can be tuned at run-time. Table2.15 presents a part of data structures that are accepted by gpfs_fcntl(). The sub-routine requires users to gain in-depth knowledge about GPFS and its optimizationmechanisms in order to set the parameters properly.

File Access Hints File Access DirectivesgpfsAccessRange_t gpfsCancelHints_tgpfsFreeRange_t gpfsDataShipMap_tgpfsMultipleAccessRange_t gpfsDataShipStart_tgpfsClearFileCache_t gpfsDataShipStop_t

TABLE 2.1: Examples of GPFS File Access Hints and Directives

5http://www.ibm.com/support/knowledgecenter/SSFKCN_4.1.0.4/com.ibm.cluster.gpfs.v4r104.gpfs100.doc/bl1adm_gpfs_fcntl.htm


2.1.3 Hadoop Distributed File System - HDFS

The Apache Hadoop[38][39] framework is an open source software for reliable, scal-able and distributed computing6. Its distributed file system, HDFS, is designedto store very large data sets reliably, and to stream those data sets at high band-width to user applications[40]. HDFS is built upon commodity servers communi-cating with each other through Transmission Control Protocol (TCP)-based proto-cols. Similar to Lustre file systems, HDFS stores metadata in a dedicated server,named NameNode, while application data are distributed on other servers calledDataNodes. A data protection mechanism like RAID is not necessary, because dataare replicated and distributed to multiple DataNodes. In a Hadoop framework, thecommodity servers have local disks providing storage space for HDFS, which arealso the compute nodes of MapReduce[41] process. Hence, applications runningin Hadoop framework can read/write data locally, instead of accessing a networkconnected file system.

Figure 2.4[40] illustrates an abstracted HDFS architecture and the I/O requestsissued by HDFS clients. A NameNode in HDFS works as a management node and isin charge of many administrative tasks:

• storing metadata

• monitoring the status of DataNodes via their heartbeats

• sending instructions to DataNodes within the acknowledgment

• creating file system snapshots

• being either BackupNode or CheckpointNode

• assigning DataNodes to HDFS clients’ requests

The NameNode is connected with the top level switch, root switch, which offersthe same high speed accessing from all DataNodes. Instead of connecting with theroot switch directly, DataNodes are grouped and connected to their own groupswitch, rack switch. This architecture avoids the potential overload of root switchby large data transmission and accelerates the intra-rack’s data transmission amongDataNodes.

The main task of DataNode is to store application data. While a DataNode is(re-)starting up, it firstly connects to the NameNode to require or verify its ID. Afterthe successful handshake DataNode sends heartbeats every 3 seconds to NameNodeand receives the acknowledgments accordingly. These acknowledgments includethe instructions from NameNode, such as replicating blocks to other DataNodes,removing local block replicas, sending an immediate block report etc. To understandthe reading and writing process on HDFS better, we need to keep in mind, that a

6http://hadoop.apache.org/


HDFSClient0

DataNode0

Rack0 Rack1

HDFSClient1

DataNode1 DataNode2 DataNode10 DataNode11 DataNode12

Root

NameNode

HDFSClient2

Each I/O request of HDFS clients needs to access metadata on NameNode

Hadoop Distributed File System(HDFS)

FIGURE 2.4: Hadoop Distributed File System Components and Read-/Write Process Examples

DataNode is, at the same time, a compute node processing MapReduce operations.Data can be accessed either locally or remotely by HDFS clients.

The local reading process is quite simple. HDFS client asks the NameNode forthe location of data, which is also the same location as the compute node. Appli-cation reads data locally without occupying any further network bandwidth. TheHDFSClient1 and HDFSClient2 in Figure 2.4 give an example of remote reading pro-cesses. Assuming the two clients are neither within the network area of Rack0 norRack1. The HDFSClient1 gets the closest location of requested data in DataNode2 andstarts to read. At the same time, the NameNode receives another reading request fromHDFSClient2 for the same data. The DataNode10 storing the same file is assigned toHDFSClient2, which avoids occupying too much network bandwidth within Rack0area and ensures a relatively high data transmission rate.

Because of the three replicas policy, its writing process is a little more compli-cated. Applications request a free place to write data through HDFS client thread.Beside the DataNode where the HDFS client is located, two other DataNodes indifferent rack areas are appointed to store these data. If the number of replicas isconfigured as more than three, the rest of DataNodes are randomly assigned. TheHDFS replica placement policy is: no more than one replica is placed at one node andno more than two replicas are placed in the same rack[40]. When the HDFSClient0in Figure 2.4 gets DataNode12, DataNode0 and DataNode1 according to the file replicaplacement policy, a data pipeline will be created in following sequence: DataNode12,DataNode1 and then DataNode0. A "setup" control signal is sent through the pipelineby HDFSClient0. After receiving the acknowledgment from DataNode0, HDFSClient0


starts sending data packet into the pipeline using a non-block policy. For each datapacket received by DataNode0, a corresponding acknowledgment is sent back toHDFSClient0 through the pipeline. After HDFSClient0 receives all the acknowledg-ments, a "close" control signal will be sent out through the same pipeline. The ac-knowledgment of this "close" control signal indicates the end of the writing processand the visibility of data to other HDFS clients.

The characteristics of HDFS enable the maximization of the concurrent readingperformance. Its file replica distribution policy improves the network bandwidthutilization. The key to improve its I/O performance is the policy to decide where tostore the data and how to distribute the MapReduce process, so that the applicationsdo as many as possible "local" data accesses. The optimization of the policy will notonly improve I/O performance, but also distribute the MapReduce application moreefficiently, so as to increase the efficiency of the entire Hadoop cluster.

2.1.4 Summary

Lustre and GPFS are two of mostly used distributed parallel file systems on the listof Top5007, while HDFS is being deployed in more and more commercial clusters ofmany companies. However, companies like PayPal8 have applied Lustre file systeminto their Hadoop cluster for real-time fraud detection9 and saved over 700 milliondollars in fraudulent transactions that they would not have detected previously10.The solution was implemented by Intel Enterprise Edition for Lustre Software11.Cray has released Urika-GX12, the first agile analytics platform, for BigData13. Itsupports Hadoop applications by combining 35TB PCIe SSD on-node memory intothe traditional HPC compute nodes14. In the near future, there will be more andmore systems or applications relying on both HPC system and Hadoop framework.Hadoop applications with parallel file systems, such as Lustre and GPFS, can bemade more efficient by improving I/O performance.

7http://www.top500.org/8http://www.paypal.com9http://www.hpctoday.com/viewpoints/bringing-lustre-relevance-to-the-enterprise/

10http://www.intel.com/content/www/us/en/lustre/intel-lustre-big-data-wp.html11http://www.intel.com/content/www/us/en/lustre/intel-enterprise-edition-for-lustre-

software.html12http://www.cray.com/products/analytics/urika-gx13http://en.wikipedia.org/wiki/Big_data14http://www.cray.com/products/analytics/urika-gx?tab=technology


2.2 Parallel I/O Algorithms

2.2.1 Parallel I/O Types

A software developer can choose either blocking or non-blocking I/O. Just as thename implies, blocking I/O will block all processes until finishing reading/writingdata. While the I/O operations are in progress, CPUs stay idle. A typical blockingI/O request is to read the input data, such as configuration/parameter files, beforethe application runs. As for the non-blocking I/O, performing independently, willnot block other processes from running. Writing checkpoint files during the simu-lation can be implemented as non-blocking I/O requests. Applications using non-blocking I/O are more complex to design and implement, since developers have totake the process synchronization and data consistence into consideration.

Given the fact that applications can access any part of one file or multiple files,there are two further I/O type definitions: contiguous and non-contiguous I/O.When a process reads/writes contiguous blocks of data, the start address and thedata transfer size should be provided. A non-contiguous I/O will need more infor-mation, the offset, to indicate the location of data blocks is needed. In most cases,these data blocks are not in alignment and could be very small. Each time the appli-cation requests a small data block, it has to establish the connection to a file system,seek out the data position, and access them.

The above mentioned parallel I/O types are not changeable after the softwareis implemented or the process starts. To simplify the implementation of differ-ent parallel I/O types, computer scientists have drafted standards like MPI-IO andPOSIX-I/O. Plus, they have also developed different high-level I/O libraries (HDF5,NetCDF, ADIOS) to support different applications (Section 2.4).

To accelerate the non-contiguous I/O operations as well as small data accesses,a concept of collective I/O was developed. In the next two sections, Two collectiveI/O algorithms: data sieving and two-phase I/O will be introduced.

2.2.2 Data Sieving

Data sieving is a parallel I/O algorithm dealing with non-contiguous I/O requests.It was introduced and implemented in ROMIO[42], an implementation of MPI-IO li-brary (Section 2.3.3). The basic idea is to allocate a piece of local memory for cachingthe entire file or a rather large part of the file (� the data transfer size of each non-contiguous I/O request). The application only needs to establish connection to a filesystem twice: one connection for reading a large chunk of data from the file system,the other one for writing data back to the file system if necessary. The further I/Ooperations are issued within the local memory, which takes up almost no seek time,

2.2. Parallel I/O Algorithms 21

compared to accessing the file system. It minimizes the network load by accessingthe local memory, even though it reads more data than needed.

File System

Application

Accessing file system for each small chunk (no Data Sieving)

File System

ApplicationData Sieving

Local Memory

FIGURE 2.5: I/O Request with and without Data Sieving

Figure 2.5[42] illustrates an example of the same I/O request with (top) and with-out (bottom) data sieving. Six small data chunks (darker stripes) are to be read/writ-ten. The standard I/O access (bottom) is to establish the connection with the filesystem 6 times. With data sieving (top), a large and continuous chunk of data is readinto the temp buffer in the local memory, whose size can be set at run-time. Com-pared to accessing a file system, it is far more quickly to access the local memory forreading/writing operations. The allocated memory can be released for reading op-erations, while the writing operations need to do one more task, putting the changeddata back to the file system. The writing operation actually uses the read-modify-write I/O pattern and needs a lock mechanism to prevent the same data chunk beingoverwritten by other processes at the same time.

The drawbacks of data sieving are A.) reading a large chunk of data into memoryand B.) the locking mechanism to block other processes. Some experiences show thatthe read buffer size can be set that large, until the time consuming by reading extraunused data succeeds the multiple accessing to the file system. On the other hand,the write buffer size should be set as small as necessary to avoid blocking too manyprocesses[42]. In [43] Yin Lu et al. have designed a new data sieving approachnamely Performance Model Directed data sieving, or PMD data sieving in short,which dynamically determines A.) when it is beneficial to perform data sieving; andB.) how to perform data sieving if beneficial.

2.2.3 Two-Phase I/O

Two-phase I/O was proposed in [44] for accessing distributed arrays from files andis a client-sided collective I/O algorithm. It was evaluated and turned to be working


well on array distributions in lecture [44] and [45]. The idea is dividing the entire I/Oprocess into two phases: the I/O phase and the communication phase. For a readingprocess, each process firstly reads a quite large and continuous data chunk into thelocal memory, including the data required by other processes. In the second phase,the processes communicate with each other to redistribute the data chunks, so thateach process gets its desired data chunks from the local memory. A writing processis the reverse of the reading process. An extended two-phase I/O algorithm[46] isintegrated into ROMIO and proves a more efficient I/O access through[46]:

• dynamically partitioning the I/O workload among processes, according to theaccess requests

• combining several I/O requests into fewer larger granularity requests

• reordering requests with accessing file in proper sequence

• eliminating simultaneous I/O requests for the same data

Figure 2.6 and 2.7[42] presents two examples of the same reading process withoutand with the two-phase I/O algorithm. Each process (P0, P1 and P2) of an applica-tion wants to read a small chunk of data. For the reading process without two-phaseI/O (Figure 2.6), each process accesses the file system, seeks the position of its dataand then reads the data into its local memory. In the example, P0, P1 and P2 each hasto access file system 3 times (3 × 3 = 9 accesses in all). Meanwhile, the two-phaseI/O algorithm has reduced the number of file system accesses to 3 (Figure 2.7). Inthe first phase, each process reads a continuous chunk of data into its temporarybuffer, which is larger than the data chunk size. This continuous chunk of data alsoincludes the data chunks of other processes. In the second phase, the processes com-municate with each other and exchange the data chunks, so that the user buffer ofeach process stores its desired data.

P0 P1 P2 P0 P1 P2 P0 P1 P2

Read Read Read

User Buffer of P0

P0 P0 P0

User Buffer of P1

P1 P1 P1

User Buffer of P2

P2 P2 P2

FIGURE 2.6: An Example of Reading Process without Two-Phase I/O

The algorithm uses a temporary buffer to accelerate the I/O operations, espe-cially for small data accesses. However, when the user application scales out, thetime consumption for communication and data exchange among large amount ofprocesses can be very high. Therefore, disabling the temporary buffer for large scaleapplications as well as large data accesses would help shorten the processing time.In Section 4.3, I have researched when to disable the temporary buffer for ROMIO.

2.3. Message Passing Interface (MPI) 23

P0 P1 P2 P0 P1 P2 P0 P1 P2

Temp Buffer of P0

P0 P1 P2

Temp Buffer of P1

P0 P1 P2

Temp Buffer of P2

P0 P1 P2

Read Read Read

User Buffer of P0

P0 P0 P0

User Buffer of P1

P1 P1 P1

User Buffer of P2

P2 P2 P2

Send Send Send

FIGURE 2.7: An Example of Reading Process with Two-Phase I/O

2.3 Message Passing Interface (MPI)

2.3.1 MPI Standard

MPI is neither an implementation nor a software, but a message-passing library in-terface specification. It addresses primarily a message-passing parallel programmingmodel, in which data is moved from the address space of one process to that of an-other process through cooperative operations on each process[3]. The goal of MPIsimply stated is to develop a widely used standard for writing message-passing pro-grams[3]. The communication between two MPI-processes is abstract, and it doesnot need any connection and network address exchange. The complexity of inter-process communication is concealed from application users and application devel-opers. The MPI applications can be migrated and transferred easily among differentHPC platforms, without modifying the source codes. A concrete MPI implementa-tion offers the interface for various programming languages like C, C++, Fortran,and aims at a practical, portable, efficient and flexible implementation on variousarchitectures[3].

In April 1992, the basic features essential to a standard message-passing interfacewere discussed on a workshop of Standards for Message-Passing in a Distributed Mem-ory Environment[3]. May 1994 witnesses the first version of Message-Passing Inter-face standard released as MPI-1[3], followed by the version 2.0 was released in July1997 as MPI-2, which introduced and standardized parallel I/O as MPI-IO[3]. Thelatest released version is MPI-3.1 since June 2015 and the next generation, MPI-4.0, isstill in progress (Status June 2017). MPI Forum has listed several MPI implementa-tions that fulfill the MPI-3.1 standard15, such as MPICH16, Open MPI17, Cray MPI18,

15http://mpi-forum.org/mpi31-impl-status-Jun16.pdf16http://www.mpich.org/17http://www.open-mpi.org/18http://www.cray.com/


Intel MPI19, IBM Spectrum MPI20, IBM Platform MPI21 and so on.

MPI-IO

MPI-IO was introduced since MPI version 2.0, aiming portability as well as opti-mization for parallel I/O, which cannot be achieved with the POSIX interface[3].Besides the point-to-point communication, MPI defines the so-called collective com-munication, which leads to a considerable I/O performance improvement for par-allel I/O-operations with appropriate algorithms. With the development of theMPI standard, MPI-IO provides a high-level interface for applying parallel I/O al-gorithms (Section 2.2), controlling file layout on file system, partitioning file dataamong processes logically, issuing collective and asynchronous/non-blocking dataaccess (Section 2.2.1) etc. As its name implies, the MPI collective data access requiresall participated MPI processes to issue the same data access. Opening and closingfiles are collective routines, as all MPI processes within the same MPI group22 mustopen and close the same file with the same access mode. As for read/write routines,the collective operations (MPI_FILE_XXX_ALL) may perform better than their in-dependent counterparts (MPI_FILE_XXX), as global data accesses have significantpotential for automatic optimization[3]. MPI supports and defines the blocking andnon-blocking I/O routines[3]:

• A blocking I/O call will not return until the I/O request is completed

• A non-blocking I/O call initiates an I/O operation, but does not wait for it tocomplete

Application programmers can use non-blocking I/O routines to keep proceedingcomputations while transferring the data. But, the completeness of non-blockingI/O routine has to be verified by for example calling MPI_TEST or some other equiv-alent functions. MPI names the non-blocking I/O routines as MPI_FILE_IXXX,where the I stands for immediate[3].

MPI File Hints/Info

MPI info is an object that stores an unordered set of (key, value) string pairs,and they are passed as info parameter to MPI subroutines[3]. MPI standard definesvarious MPI infos, such as communicator info, window info and file info (alsocalled as MPI file hints), for users to provide information for direct optimization.The MPI file info passes file access information from user applications to MPI-IO

19http://software.intel.com/en-us/intel-mpi-library20http://www-03.ibm.com/systems/spectrum-computing/products/mpi/index.html21http://publibfp.dhe.ibm.com/epubs/pdf/c2753190.pdf22more details about MPI group concept please refer to Chapter 6 in book [3]


libraries or even the underlying distributed parallel file systems, so that the parallelI/O performance can be improved.

The MPI file info is specified on a per file basis[3] and impacts the file manip-ulation as well as data access operations. MPI defines following subroutines to in-terpret its file info: MPI_FILE_OPEN, MPI_FILE_DELETE, MPI_FILE_SET_VIEWand MPI_FILE_SET_INFO. Except those MPI reserved file hints23, different MPIimplementations and MPI-IO libraries can define their own file hints/info. Forexample, ROMIO defines romio_cb_read and romio_cb_write to enable/dis-able the collective buffering for MPI collective reading/writing subroutines. In addi-tion, it delivers the MPI reserved file hints striping_factor (=stripe_count inLustre) and striping_unit (=stripe_size in Lustre) to the underlying Lustrefile system, which affect the Lustre file striping configurations (Section 2.1.1).

MPI File View

MPI defines file view to partition the data among processes logically. A file viewindicates a logical file data appearance for each MPI process, which is defined bydisplacement, etype and filetype. Figure 2.8 illustrates the file view amongthree MPI processes[3]. Starting at the point of displacement, MPI processes ac-cess their own data blocks predefined by filetype. With the help of this mecha-nism, each MPI process reads/writes data independently and the communicationsamong MPI processes are eliminated. In [47], setting a proper MPI file view for non-contiguous I/O operations was approved to speedup MPI collective I/O operationssignificantly.

etype

P0 filetype

P1 filetype

P2 filetype

displacement

FIGURE 2.8: An Example of MPI File view

23more details about MPI reserved file hints please refer to page 502 in book [3]


2.3.2 MPI Implementations

The MPI forum defines and publishes the MPI specifications, which give guidelinesfor developers to implement their MPI implementations. Some MPI implementa-tions are developed by research institutes (like MPICH) or a community consistingof research institutes and industrial companies (like Open MPI), while others aredeveloped and released by HPC manufacturers to support their own HPC systemsbetter. In the following two sections, Two widely used MPI implementations andtheir derivatives are briefly introduced.

MPICH

Together with standardizing the MPI in 1992, MPICH was implemented to quicklyexpose problems that the specification might pose for developers and to provideearly experimenters with an opportunity to try ideas being proposed for MPI be-fore they became fixed[48][49]. MPICH is designed to achieve portability withoutaffecting its high performance. Beside maximizing the amount of shared codes, itis structured as easily portable to a new platform and then gradually tuned for thatplatform by replacing parts of the shared code by platform-specific code[48]. Nowa-days, MPICH is one of the most successful MPI implementations. Its collaborationwith supercomputer manufacturers and software vendors24 derivatives some plat-form tuned MPI implementations such as Microsoft MPI (MS-MPI)25, Cray MPI, IBMPlatform MPI, Intel MPI, MVAPICH26[50] and so on.

Open MPI

Open MPI is designed to be scalable, fault tolerant, and provide high performancein a variety of HPC environments[51][52][53]. It provides a high performance im-plementation of the MPI standard across a variety of platforms through the useof the Modular Component Architecture (MCA), which allows users to customizetheir MPI implementation for their hardware at run-time[53]. The high overheadof MCA is avoided by supporting neither inter-process object communication norcross-language[53]. Components are opened and loaded on demand, so that the in-terfaces are called by MPI process locally. Because of two above mentioned charac-teristics, its overhead was analyzed and estimated as insignificant[53]. Furthermore,Open MPI realizes a comparable and even performance competitive implementationagainst other MPI implementations on the market. The derivatives of Open MPI areIBM Spectrum MPI, Sun HPC ClusterTools 7+[54], bullx MPI27 and so on.

24http://www.mpich.org/about/collaborators/25http://msdn.microsoft.com/library/bb524831.aspx26http://mvapich.cse.ohio-state.edu/27http://bull.com/wp-content/uploads/2016/08/f-bull_scsuite-en7_web.pdf


2.3.3 MPI-IO Libraries

The two above mentioned MPI implementations have implemented and integrateddifferent MPI-IO libraries, which are designed and implemented with different ar-chitectures. ROMIO, a portable MPI-IO library, has been integrated into most of theMPI implementations on market, while OMPIO is a specialized MPI-IO library onlyfor Open MPI and its derivatives. The following two sections introduce these twoMPI-IO libraries.

ROMIO

ROMIO[15][55] is one of the most used MPI-IO libraries on the market. It consists ofA) a large part of portable codes, B) a small part of file system and running machineoptimized codes. To conquer the performance barrier of standard Unix I/O and theportable barrier of POSIX I/O, ROMIO has designed and implemented a componentnamed Abstract-Device Interface of I/O (ADIO). Various parallel I/O APIs for stan-dard UNIX and POSIX as well as specified file systems are implemented in ADIO. Inorder to maximize the portability of user applications, ROMIO can recognize the un-derlying file system by calling different file systems’ stat functions. Figure 2.9[15]illustrates the abstracted architecture of ROMIO28.

Portable Implementation of MPI-IO APIs

Special ADIO Implementations

For

UFS

For

Cra

y

For

IBM

For

NFS

For

GP

FS

For

XFS

For

PV

FS2

For

Lust

re

FIGURE 2.9: The Abstracted ROMIO Architecture

Two parallel I/O algorithms, data sieving (Section 2.2.2) and two-phase I/O (Sec-tion 2.2.3), are integrated to achieve higher performance for small data accesses aswell as non-contiguous I/O requests. Another optimizing possibility for MPI-IO ispassing MPI hints (Section 2.3.1) to ROMIO. Table 2.2 lists a part of ROMIO sup-ported MPI hints that impact the parallel I/O performance significantly29.

28http://press3.mcs.anl.gov/romio/29see Chapter 4 for evaluations results when using different MPI hints


for Data Sieving for Collective I/O Othersromio_ds_read romio_cb_read striping_factorromio_ds_write romio_cb_write striping_unitind_rd_buffer_size cb_nodes direct_ioind_wr_buffer_size cb_buffer_size . . .

cb_config_list

TABLE 2.2: A Part of ROMIO Supported MPI hints

OMPIO

OMPIO, a new MPI-IO library based on MCA architecture for Open MPI, was pub-lished at EuroMPI-Conference 2011[16] and is designed to coexist with ROMIO.Compared to ROMIO, OMPIO presents two following main advantages[16]:

• the usage of different frameworks allows a more fine grained separation offunctionality than the approach used in ROMIO,

• and OMPIO introduces the ability to make non-file system specific moduleselection that do not require any modifications of the end-user application.

OMPIO inherits the characteristic of Open MPI, so that the proper components ormodules will be compiled and chosen depending on the running process at run-time.As shown in Figure 2.10[16], ROMIO and OMPIO are two coexisted but independentparallel I/O-libraries in Open MPI. User can choose either one of these two MPI-IOlibraries, although OMPIO is the default I/O library since Open MPI version 2.xrelease. The four frameworks (fs, fcoll, fbtl and sharedfp) in OMPIO areindependent, but can support each other for better I/O performance.

I/O

OMPIO

ROMIO

fbtl

Base

POSIX

PVFS2

fcoll

Base

Dynam

ic-Segm

ent

Static-Segm

ent

Two-Phase

fs

Base

Lustre

PVFS2

sharedfp

Base

flock

addproc

Frameworks Components

FIGURE 2.10: The Abstracted Architecture of OMPIO Frameworksand Components

The fs framework contains components for different parallel file systems. Justlike the selection logic in other Open MPI frameworks, the fs framework can list

2.4. High-Level Scientific Data Libraries 29

and open its underlying components while the MPI_INIT subroutine is called. Ac-cording to the underlying file system, a suitable component in fs framework will beinitialized while an application calls the MPI subroutine such as, MPI_FILE_OPEN,MPI_FILE_CLOSE or MPI_FILE_DELETE. The file system specific information andstatus can be translated and applied within the component, so that the I/O opera-tions can run more efficiently.

The fbtl framework provides the abstraction for all individual read and writeoperations[16]. Besides the standard POSIX-I/O semantics, fbtl framework inte-grates the native read/write operations of PVFS2 file system. It takes advantages ofthe file system native I/O operations and can improve its I/O performance.

The fcoll framework provides interfaces for collective I/O operations[16]. Incontrary to the fs framework, the fcoll framework triggers the selection logic notupon opening a file, but every time the file view is being set[16]. Under the fcollframework, different parallel I/O algorithms are implemented and integrated ascomponents, in order to achieve better I/O performance[56].

Shared file pointer, a data access feature defined in MPI-IO, is jointly main-tained by a communicator group of MPI processes[3]. MPI subroutines with pat-tern MPI_FILE_XXX_SHARED provides an I/O optimization possibility by apply-ing shared file pointers. Different algorithms for shared file pointer operations inMPI-IO has been integrated into the sharedfp framework as components[57].

2.4 High-Level Scientific Data Libraries

The MPI-IO libraries are usually understandable for scientists and engineers with in-depth programming skills. Nevertheless, efficiently mapping a numerical schemeto a computer program is still a big challenge, which lots of scientists have triedto overcome. One practical and effective way is to design new data formats thatintegrate a proper I/O library. In the next sections, Three scientific data formats andtheir I/O libraries, which are built upon MPI-IO library, will be introduced.

2.4.1 Hierarchical Data Format (HDF)

The first version of HDF was originally released in 1988, while the 5th version wasreleased in 199830 and became one of the most used scientific data format. HDF5is not only a data format, but also a data model and software library for storingand managing data[6]. It is designed to organize, store, discover, access, analyze,share, and preserve diverse, complex data in continuously evolving heterogeneous

30http://www.hdfgroup.org/hdf-group-history/


computing and storage environments[6]. In the past decades, HDF5 itself and HDF5related software31 are widely spread in lots of industrial32 and scientific33 fields.

The HDF5 data model consists of two primary components: group and datasetobjects. The group object is similar to a normal directory in a file system. It containsa collection of named links to other objects in an HDF5 file[6]. The dataset objectstores the application data as multiple dimensional arrays, which make the scientificdata easy to understand and exchange. The elements of an HDF5 dataset can bestored in three ways: contiguous, chunked and compact[7]. In order to manipulateand access the objects of an HDF5 file, an I/O library interface in C was imple-mented. It also provides APIs for many other programming languages, includingFortran, C++, Java, Python and so on.

Using MPI programming interface, the applications can access HDF5 files "in-dependently" or "collectively"[6]. The HDF Group has implemented parallel HDF5interface to parallel access HDF5 files stored in distributed parallel file systems. Fig-ure 2.11 presents the I/O stack of parallel HDF5 applications. Since parallel HDF5uses the MPI-IO library, parallel HDF5 applications’ I/O requests can be tuned bypassing MPI info objects.

Parallel HDF5 Applications

Distributed Parallel File Systems

Distributed Data Storage Systems

Parallel HDF5

MPI-IO

FIGURE 2.11: Parallel HDF5 Application I/O Stack

2.4.2 Network Common Data Form (NetCDF)

NetCDF was introduced for scientific data access in 1990[10]. It is a set of softwarelibraries and self-describing, machine-independent data formats that support thecreation, access, and sharing of array-oriented scientific data34. The climate andweather domains have long-established NetCDF-based workflows (using NetCDFdatasets for archiving, analysis and data exchange)[9]. After decades’ development,numerous open source and licensed (proprietary) software are implemented for ma-nipulating or displaying NetCDF data35. Similar to HDF5 data model, the NetCDFdata format file can be divided into two parts: header and multidimensional arraydata. The first part, file header, stores metadata like data type, array dimensions, at-tributes and so on, while the second part stores the application data. The I/O process

31http://support.hdfgroup.org/products/hdf5_tools/SWSummarybyType.htm32http://www.hdfgroup.org/our-industries/33http://www.hdfgroup.org/scientific-fields/34http://www.unidata.ucar.edu/software/netcdf/35http://www.unidata.ucar.edu/software/netcdf/software.html

2.4. High-Level Scientific Data Libraries 31

of original NetCDF was serial and needed a master process to operating program’sI/O requests, which turns to be the bottleneck on today’s modern HPC platforms.

Jianwei Li et al. designed and developed parallel NetCDF in 2003[11]. It is buildon top of the MPI-IO library (Figure 2.12) and takes advantage of MPI’s "indepen-dent" and "collective" I/O optimizations. Since the release of NetCDF version 4.0 in2008, its data model is a restricted subset of the HDF5 data model. Hence, NetCDF-4files can be read or written by HDF5 library, from version 1.8 or later.

Parallel NetCDF Applications



Parallel NetCDF

MPI-IO

FIGURE 2.12: Parallel NetCDF Application I/O Stack

2.4.3 Adaptable I/O System (ADIOS)

ADIOS36 was initially designed to support data management for the visualizationof fusion simulations, such as the Gyrokinetic Toroidal Code (GTC)37, in Oak RidgeNational Laboratory (ORNL38)[58]. It has achieved an efficient code execution ofscientific applications on a variety of HPC resources by solving the following threeimportant issues[59]:

• end users should be able to select the most efficient I/O methods for theircodes, with minimal effort in terms of code updates or alterations

• such performance-driven choices should not prevent data from being storedin the desired file formats, since those are crucial for later data analysis

• it is important to have efficient ways of identifying and selecting certain datafor analysis, to help end users cope with the flood of data produced by highend codes

ADIOS Applications



ADIOS

MPI-IO

FIGURE 2.13: Parallel ADIOS Application I/O Stack

36http://www.olcf.ornl.gov/center-projects/adios/37http://phoenix.ps.uci.edu/gtc_group/38http://www.olcf.ornl.gov/


Similar to HDF5 and NetCDF, ADIOS also supports serial I/O and parallel I/Ooperations, and is built upon the MPI-IO library (Figure 2.13). An external XML filefor application I/O description and configuration is defined by ADIOS. Through thisXML file, end users can change their applications’ I/O methods without updating orrecompiling the source codes. This mechanism is implemented by invoking a singleAPI for all I/O methods and the new BP file format.39 To maximize its softwarecompatibility and to exchange data with other applications, ADIOS offers utility toconvert the BP file format to HDF5 and NetCDF at small cost[59].

39http://users.nccs.gov/pnorbert/ADIOS-DevManual-1.6.0.pdf

33

Chapter 3

Semi-Automatically I/O-TuningFramework (SAIO)

3.1 SAIO Design Requirements

In Section 1.2.3, a light weighted approach to transparently accelerate the I/O re-quests of engineering applications was proposed. The approach should be A.) com-patible with as many engineering applications as possible, B.) scalable with largescaling engineering applications, C.) usable for engineers and scientists with littleknowledge of parallel I/O, and D.) portable across multiple HPC platforms. To ful-fill these four abilities, SAIO needs to:

• follow the current MPI standard,

• run transparently to the users,

• produce acceptable little overhead,

• and improve I/O performance automatically.

3.1.1 Following MPI Standard

The main benefits of establishing a message-passing standard are portability andease of use[3]. Hence, SAIO can inherit these advantages by following the MPI stan-dard and using messages to communicate in HPC systems. Plus, the excellent scala-bility of MPI implementations also makes sure, that SAIO can manage large scalingengineering applications as well. Running SAIO across multiple HPC platformsrequires an implementation without any platform or operating system dependentlibrary. All these design requirements will be met as long as the MPI standard isfollowed.

34 Chapter 3. Semi-Automatically I/O-Tuning Framework (SAIO)

3.1.2 Running Transparently

One of the non-functional requirements of SAIO is its transparency to the applica-tions and the intuition to non-I/O-expert users. Firstly, SAIO should not force sci-entists and engineers to recompile or change the source codes, since they have littleexpertise in programming or no access to the source codes of proprietary software.Secondly, SAIO should be loadable by simply adding several rows of directives intotheir job submission scripts.

At the beginning, it will be implemented as a dynamic C library and loaded viasetting LD_PRELOAD (Section 3.5). After its experimental phase, it will be availableas an application module loaded by Environment Modules1.

3.1.3 Producing Little Overhead

As a solution for (engineering) applications in a production environment, SAIO isnot supposed to disturb the running application. Therefore, its overhead (both mem-ory and time consumption) must remain as little as possible.

The log file writing process should only occur when the MPI_FINALIZE subrou-tine is invoked. Just like other I/O tracing software, SAIO allocates memory to storethe tracing results. Since the memory is limited and the allocated memory must notcrash the applications, two methods that trace applications with a limited memoryallocation are analyzed:

• ring-buffer: As soon as the number of tracing results reaches the predefinedlimit, SAIO will overwrite the previous records from the beginning until theapplication finalizes. The advantages are negligible overhead to overwrite therecords in memory and never exhausting the memory. However, the tracingresults are not always complete, if the limit is exceeded. The tracing limit isalso hard to determine, because of a variety of applications’ I/O behaviors.

• flush out: As soon as the number of tracing results reaches the predefinedlimit, SAIO will flush its allocated memory and write the tracing results intolog files. It ensures that all I/O requests are recorded and the predefined limitremains relatively small, so that only a limited memory space will be occupied.Nevertheless, this writing process itself brings in extra overhead by accessingthe file system and block the running process while writing log files.

Another point to be mentioned is that the (engineering) applications in HPC en-vironment are usually capable of scaling out till hundreds of thousands of computenodes. As a light weighted I/O tuning software, SAIO will be able to scale out ac-cordingly, but keep its overhead from enlarging.

1http://modules.sourceforge.net/

3.2. SAIO Software Stack 35

3.1.4 Optimizing Automatically

Along with large amounts of running applications on the HPC system, file system’sstatus, like capacity, used spaces, loads and so on, keeps changing. Moreover, thedata transfer sizes of applications are different as well. The previously found opti-mal configurations could be thus out of date and no longer optimal. SAIO will offera training utility (Section 3.4.7) for scientists and engineers to find out the latest opti-mal configurations manually. Unfortunately, such manual processes might becomea barrier, because the knowledge of how to start the SAIO training process is needed.To relieve the application users, the SAIO training process, which keeps the optimalconfigurations up to date, is automated.

Automating the training process with machine learning concept will make SAIOintelligent. In book [60], Thomas M. Mitchell has defined machine learning as: Acomputer program is said to learn from experience E with respect to some class of tasks Tand performance measure P, its performance at tasks in T, as measured by P, improves withexperience E. In SAIO, tasks T are the I/O requests, performance measure P is the I/Obandwidth or resource consumption, and experiences E are the historical runningresults with different configurations. In other words, SAIO learns from histories andimproves the future I/O requests. More details about the machine learning conceptin SAIO will be following in Section 3.3.4.

3.2 SAIO Software Stack

As shown in Figure 3.1, the SAIO core module (stripes marked in gray) surroundsthe MPI-IO library like a wrapper. As soon as user applications call MPI-IO sub-routines, SAIO is triggered to apply optimal configurations from Configuration poolbefore executing the I/O operation. After executing I/O operations, SAIO recordsthe I/O related information to Log file pool. As introduced earlier (Section 2.4), high-level I/O libraries, such as parallel HDF5, parallel NetCDF and ADIOS, are all builtupon MPI-IO library and are therefore compatible with SAIO as well.

Application

High-Level I/O Library


Configuration pool

Learning module

Log file poolDistributed Storage Systems

MPI-IO Statistic utility

CSV files

FIGURE 3.1: SAIO Abstract Software Stack


The SAIO Learning module stands outside the I/O stack and works like an opti-mizing engine: reading log files (consuming fuels) to extract optimal configurationsfor approaching a higher I/O performance (producing power). Its job is to readand parse the generated log files, and to use different (statistic) strategies to dig outthe optimal configurations. The Learning module can be either a standalone processor a coprocess of two SAIO core components. Working as a standalone process, itreads log files from the file system and then writes configuration files back to the filesystem. As a coprocess, the Learning module reserves one thread to read the tracingresults from memory and then updates the configurations. However, this coprocessworking mode would take up extra computing resources along with the applica-tions and eventually lower the system efficiency. It is therefore not implemented inthe prototype yet.

The third component of SAIO is a Statistic utility, which supports system ad-ministrators to analyze I/O requests and locate a system wide default setup. Todevelop a fully functional and powerful decision support system, the implementa-tion process of statistic utility needs to be divided into two steps (the prototype hasimplemented the first one):

• First step: The Statistic utility parses the log files and generates CSV files con-taining all log information. System administrators can use their tools to ana-lyze these results and then choose a system wide default setup based on theirprofessional experiences.

• Second step: The Statistic utility will integrate different statistic algorithms toanalyze I/O requests, and the third part software to present the results.

3.3 SAIO Architecture

SAIO framework, as shown in Figure 3.2, is built up with two functional modules:A.) Core module for the run-time I/O tracing and optimizing, and B.) Learning modulefor parsing log data and generating optimal configurations. SAIO traces and storesthe I/O requests’ information into log files to feed the configuration generator, whoextracts optimal configurations based on historical running logs. The optimal config-urations will be used for future I/O requests and help the applications to run faster.The Configuration pool and Log file pool are usually located in a shared workspace,where user applications can access easily. These two SAIO modules support eachother as a team to realize a semi-automatically I/O tuning mechanism upon MPI-IOlibrary.

3.3. SAIO Architecture 37

MPIInitializes

MPIOpens File

MPIReads/Writes File

MPICloses File

MPIFinalizes

SAIO (as tracer) traces MPI-I/O requests and their performance related results including MPI Info objects

SAIO (as optimizer) gets configurations from configuration pool and uses them for I/O requests

Configuration pool Log file poolSAIO generates configurations from log files (configuration generator)

Core module

Learning module

FIGURE 3.2: SAIO Architecture

3.3.1 SAIO Running Modes

To provide more flexibility to end users and eliminate unnecessary overhead at run-time, five SAIO running modes are designed:

• SAIO_MODE_SIZEONLY: As implied by the name, this mode only records thedata transfer size of each I/O operation and requires no extra inter-processcommunication at run-time, to keep the tracing overhead minimized. Theserecords provide the basic data transfer size information for the SAIO trainingprocess (Figure 3.6).

• SAIO_MODE_OPTONLY: The "optimizing only" mode allows SAIO to set op-timal configurations for all file manipulation and access operations transpar-ently. The tracing component will not be activated under this mode, so that nomemory will be allocated for tracing I/O operations.

• SAIO_MODE_TRONLY: The "tracing only" mode records the I/O related infor-mation and provides users as well as system administrators a clear view toprofile the applications’ I/O behaviors. The log files generated in this modeare the sources of the SAIO learning module to search for optimal configura-tions.

• SAIO_MODE_OPTTR: This mode starts the two SAIO core components together.It sets the optimal configurations, and records their impact on I/O requestsinto log files. This mode produces the highest overhead among the five run-ning modes. A so-called real-time optimizing process has been developed us-ing this running mode.

• SAIO_MODE_OPTCOLL: This mode is similar to the "optimizing only" modebut only for the MPI collective I/O operations. It is compatible with parallelHDF5 and parallel NetCDF applications.


3.3.2 Core Module: I/O Tracer & Optimizer

As shown in Figure 3.2, there are two components in the SAIO core module: I/Otracing and I/O optimizing. The MPI_INIT subroutine initializes the tracing andoptimizing component, while the first MPI_FILE_OPEN subroutine of the entire ap-plication requires SAIO to access the Configuration pool. The I/O tracing componentwill record all performance related information of file manipulation and access sub-routines. As soon as the MPI_FINALIZE subroutine is invoked, it informs SAIO towrite down the log file and finalizes the tracing component. Meanwhile, the I/Ooptimizing component gets the optimal configurations from the Configuration pool.These optimal configurations including MPI info objects will be passed to the ap-plication twice. The first one is an application related configuration, which is forthe most expensive saio_file_type group (Section 3.4.2) among all I/O requestsof an application, while the second one is a data transfer size related configuration.Like the tracing component, the optimizing component will be closed when the ap-plication finalizes.

I/O Tracer

In a parallel computation environment, an absolute synchronization among pro-cesses is impossible. Therefore, the I/O operation running on each process startsand finishes differently. The first challenge for designing I/O tracer is which tracingmechanism to choose. Two tracing policies were considered for the SAIO tracingcomponent: "all processes" and "one process" tracing.

The "all processes" tracing policy (used by Darshan) needs each I/O tracing pro-cess to allocate a piece of its local memory space storing the tracing results. Whencalling MPI_FINALIZE, all processes flush the memory and write the results intoone log file. Its advantages are minimized inter-processes’ communications as wellas the precise and detailed tracing results. However, it needs to reserve local mem-ory spaces on all I/O processes, and write a large amount of log data. For some largescaling applications, the size of log data will be multiple gigabytes. In addition, apost-processing task for tracing results is necessary. Otherwise, a pre-processingprocedure for Learning module is required. Both of them are too expensive.

On the other hand, the "one process" tracing policy requires one process, usuallythe rank 0 MPI process, to be in charge of the entire tracing task. Three of its mainadvantages are:

• memory occupation is demanded on one process,

• its tracing overhead will not be enlarged, when applications scale out largely,

• and no pre- or post-processing program is required.


Nevertheless, its disadvantages are unavoidable. As the file access duration of eachI/O process is not the same, it is necessary to check the file access duration on all MPIprocesses and choose the longest one indicating an effective I/O operation’s dura-tion. This policy synchronizes MPI processes via calling the MPI_ALLREDUCE sub-routine, which issues inter-process communications. However, according to the MPIstandard[3], the two MPI subroutines, MPI_FILE_OPEN and MPI_FILE_CLOSE, arecollective I/O operations and synchronize the I/O processes implicitly. Therefore,some extra synchronizations between these two collective operations will not pro-duce too much overhead, except applications including a lot of asynchronous op-erations between these two MPI subroutines above mentioned. In that case, theSAIO_MODE_SIZEONLY running mode still works, as no extra synchronization isrequired.

SAIO is designed with the "one process" tracing and uses the rank 0 MPI processto trace and store tracing results2. Figure 3.3 shows its tracing process:

• Step 1: MPI_INIT triggers the initialization of SAIO. The initialization pro-cess will not allocate any memory space, until the MPI_FILE_OPEN functionis invoked. Meanwhile, SAIO stays inactive.

• Step 2: Based on the MPI file access mode (read-only, write-only, read-write)when an application opens a file, SAIO allocates the memory space on rank 0

MPI process for recording the "read" and "write" operations separately.

• Step 3: For the monitored MPI data access operations ("read" and "write"),SAIO records their I/O related information, such as starting time stamps, datatransfer sizes, I/O duration, bandwidths, MPI info objects and so on. All theinformation is stored in the local memory of rank 0 MPI process. SAIO idlesagain and consumes no computing resource.

• Step 4: The records are written into a categorized log file (Section 3.4.3) by rank0 MPI process while the MPI_FINALIZE subroutine is invoked. This writingdesign avoids extra I/O requests at run-time and minimizes the SAIO tracingoverhead.

MPI_INITMPI_FILE_OPEN

MPI_FILE_READ/WRITE

MPI_FILE_CLOSE

MPI_FINALIZE

SAIOinitializes

SAIOrecords I/O-related information

SAIOwrites log files

SAIOallocates memory

FIGURE 3.3: SAIO Tracing Process

2The overhead compare of these two different tracing strategies for SAIO and Darshan will bepresented in Section 4.2.3


I/O Optimizer

I/O optimizer, another role of SAIO core module, improves I/O performance bytransparently applying the optimal configurations. While investigating the factorsthat affect the I/O performance, I realized that two factors are not changeable at run-time, which are the number of MPI processes (MPI rank size) and the data transfersize (Section 3.4.2). Therefore, I designed a two-level category structure for the SAIOconfiguration pool: the first level is using the number of MPI-IO processes to namethe configuration files, and the second level is grouping different data transfer sizesas multiple saio_file_types (Section 3.4.3 and Listing A.3). Among the availableI/O tuning options/configurations, the MPI info is one of the most intuitive andwidely used tuning options. Therefore, I will use MPI info objects to explain theSAIO optimization mechanism, although SAIO is capable of integrating other tuningoptions through extensions. In the next sections, the (optimal) configurations areusually referred to the MPI info objects, if not explicitly mentioned.

While designing the SAIO optimizing process, I had to overcome a problemabout getting data transfer sizes. The number of bytes to read/write on each MPIprocess (data transfer size) is unknown until the MPI read/write subroutines are in-voked[3]. However, some MPI info objects, such as striping_factor as well asstriping_unit for Lustre file systems (Section 2.1.1 and 2.3.3), impact the I/O per-formance when and only when a new file is created by calling the MPI_FILE_OPENsubroutine. Imagine the following scenario: SAIO needs to get a configuration forcreating a new file, but the information about the data transfer sizes is unknown.Choosing an optimal configuration is almost impossible at this moment. To let SAIOget the optimal configuration before knowing the data transfer sizes, I have designeda two-stage optimization process (Figure 3.4):

• Stage 1: SAIO accesses the application related configuration pool, fetches a properconfiguration and then passes it to the MPI_FILE_OPEN subroutine. This ap-plication related configuration is the optimal configuration of the most expen-sive group of data transfer sizes within an application. Its definition will beeasier to understand after knowing about the SAIO configuration file structure(Section 3.4.3).

• Stage 2: Before the MPI reading/writing operations, SAIO takes out the op-timal data transfer size related configuration and then passes it to the MPI read-ing/writing subroutines. The data transfer size related configuration could bethe MPI info objects that impact I/O performance of reading/writing existedfiles. Examples of these MPI info objects are cb_nodes, romio_cb_read,romio_cb_write etc.

The locations of these two configuration pools need to be accessible from all run-ning MPI processes. To minimize the overhead of reading configuration files, two


MPI_FILE_OPEN MPI_FILE_READ/WRITE

SAIO gets configurationApplication related configuration pool

MPI_INIT

SAIO gets configurationData transfer size related

configuration pool

FIGURE 3.4: SAIO Optimizing Process

configuration pools are physically stored in one SAIO configuration file but logicallyseparated (Section 3.4.3).

3.3.3 Learning Module

The Learning module runs independently of the SAIO core module (Figure 3.2). Itstask is using different strategies to find out the optimal configurations from log filesand generate a configuration index file for selecting proper configuration files (Fig-ure 3.5). The Learning module will be designed with a plug-in interface to integratedifferent statistic algorithms. A simple algorithm, which extracts the MPI info ob-jects of the fastest I/O operation for each saio_file_type, was implemented inthe current prototype (Section 3.4.6). Additionally, the Learning module is in chargeof generating a configuration index file, which offers a guideline for optimizing un-trained applications. In Section 3.4.3, the SAIO log, configuration and configurationindex files will be presented in detail.

SAIO reads and parses log files

SAIO generates configuration files

SAIO generates default configuration file

SAIO generates configuration index file

FIGURE 3.5: SAIO Learning Process

3.3.4 Machine Learning

The cooperation of the SAIO Core module and the SAIO Learning module realizes asemi-automatically I/O-tuning solution for engineering applications. To extract theoptimal configurations from various I/O requests, large amounts of I/O log filesare needed to feed the SAIO Learning module. It would consume too many computa-tional resources to test all possible configurations with real engineering applications,therefore, I have designed a training utility to simulate different I/O types, whichwill include


• self-implemented MPI programs (available in prototype),

• instrumented parallel I/O benchmark software (unavailable yet)

• and I/O kernels of end user applications3 (unavailable yet).

Using these I/O simulation programs, SAIO Core module and Learning module canbuild a basic knowledge base of optimal configurations as well as an initial config-uration index file (called training phase). When scientists and engineers use SAIOto accelerate their applications, the SAIO tracing component will extend the log filepool and provide more up-to-date tracing results. Incessantly running the SAIOLearning module and/or the SAIO training utility will keep the configuration pooland the configuration index file up to date (called learning phase).

Training Phase

The goal of the training phase is to find out the optimal "application" and "data trans-fer size" related configurations with the above mentioned I/O simulations. Figure3.6 illustrates the SAIO training process as two steps. In the first step, the SAIOtracing component traces the I/O simulation, and then generates log files, which in-clude the I/O bandwidths by applying various configurations. In the second step,the SAIO Learning module parses the log files in the Log file pool and then stores thegenerated optimal configurations in the Configuration pool.

MPI_FILE_OPEN

MPI_FILE_READ/WRITE

SAIO gets configurations and data transfer sizes

Log file pool

MPI_INIT

SAIO recordsI/O-related information

SAIOinitializes

Training configuration pool

MPI_FILE_CLOSE

MPI_FINALIZE

SAIO writeslog file

SAIO generates configurations from log files

Configuration pool

Step 2: LearningStep 1: Tracing

FIGURE 3.6: SAIO Training Process - Two Pools in Red Font are Vari-ables for SAIO Training and Learning Processes

3If the simulation and science aspects of the program can be removed, and leaving only the rep-resentative data structures, the resulting "I/O kernel" becomes a valuable resource for exploring alltuning options available[9].


The training process needs two input files (Figure 3.6): one contains a list of test-ing configurations and the other includes a list of application’s data transfer sizes.The configuration list is created by system administrators or someone who knowsabout the MPI-IO library, while the list of data transfer sizes is generated by SAIOrunning in SAIO_MODE_SIZEONLY mode with applications. Using these two inputfiles, the SAIO training utility tries every combination of the testing configurations aswell as the data transfer sizes. After the I/O tracing step accomplishes, the generatedlog file will be stored in the Log file pool and fed into the Learning module (learningstep). Its generated configuration files are used to accelerate the applications.

Because of these two not fully automatically created lists, SAIO is defined as asemi-automatically I/O tuning framework.

Learning Phase

The task of the learning phase is to update the Configuration pool along with thechanging applications and systems, since HPC systems are normally confrontedwith very changeable system load conditions and the growing usage of the underly-ing file system. Sometimes the optimal configurations could be out of date, when theI/O requests of applications have been changed, or the applications run on differentnumber of MPI processes. Sometimes users have achieved a better I/O performancewith some "brand new" configurations (untested configurations in SAIO trainingprocesses), which are recorded by SAIO. These changes lead to the modifications ofthe following four changing factors:

• training configurations list: Normally, this list is created by system adminis-trators or I/O experts. If the configuration of the file system has been changed,like a file system’s extension, system administrators can generate a new list.Regularly running the SAIO training process will keep the Configuration poolup to date.

• optimal configurations: Some users have a solid programming background.They develope their own applications with their specific I/O access patterns.With their new configurations, they can push the I/O performance up to an-other higher level. If these new configurations are traced by SAIO, regularlyrunning the SAIO Learning module will discover and store them in the Configu-ration pool.

• data transfer sizes list: Some applications read different source files and thenwrite the target files accordingly. Under the circumstances, their data transfersizes will be changed and recorded in the Log file pool. There are two options toextend the Configuration pool: the first one is running the SAIO Learning module,and the second one is running the SAIO training process.


• number of processes: Changing the source data could also result in runningapplications on different number of processes. In this case, a proper optimalconfiguration file will be applied according to the configuration index file (Sec-tion 3.4.3). Similar to changing the data transfer sizes list, both SAIO Learningmodule and SAIO training process can extend the Configuration pool.

The training phase refers in particular to the first time running SAIO trainingprocess to create initial SAIO configuration files for applications. The learning phaserefers to the period, that SAIO adjusts or extends the Configuration pool along withthe changing of the four factors above mentioned. This Configuration pool adjustmentprocess can be achieved by either the SAIO Learning module or the SAIO training pro-cess. The SAIO Learning module will only add new optimal configurations withoutupdating the existing ones, while the SAIO training process will update the entireConfiguration pool including the configuration index file. Automating this processmakes SAIO an intelligent I/O tuning solution.

3.4 SAIO Implementation

3.4.1 Introduction

SAIO is implemented in C and compiled with GNU Compiler Collection (GCC)4

from version 4.8.4 as well as Intel Compiler5. As a wrapper of MPI-IO library, it iscompatible with parallel HDF5 and parallel NetCDF applications in C & Fortran.To maximize its portability, SAIO was implemented using as few system dependentfunctions and/or libraries as possible. Table 3.1 lists the seven key components ofSAIO and their positions in the SAIO source code directory.

saio/src/core saio/src/utilMPI wrapper (for C applications) SAIO learning modulePMPI wrapper (for Fortran applications) SAIO training utilityMPI-IO tracing component SAIO statistic utilityMPI-IO optimizing component

TABLE 3.1: Seven SAIO Key Components

3.4.2 Influence Factors of I/O Performance

Which information is necessary for SAIO to evaluate the I/O performance? Whichfactors affect the I/O performance and should be recorded? Which options can betuned for a better I/O performance? The answers to these three questions determine

4http://gcc.gnu.org/5http://software.intel.com/en-us/intel-compilers

3.4. SAIO Implementation 45

the content of SAIO log files as well as SAIO configuration files. The following fourinfluence factors have been considered in the prototype.

Number of MPI Processes for I/O Operations

Engineers and scientists use different numbers of processes to run their simulations.However, these numbers are usually not equal to the MPI rank6 size for MPI-IOrequests, because applications can use one process, a subset of processes or all pro-cesses to execute the I/O operations, which results into the following three differentI/O patterns:

• file-per-process: each process accesses one file (the number of files = the num-ber of I/O processes)

• multiple shared files: I/O processes access several shared files (the number offiles < the number of I/O processes)

• one single shared file: all I/O processes access one shared file (the number offile = 1)

Different I/O patterns and different number of I/O processes can influence theI/O performance significantly. However, they are not changeable at run-time, hence,this factor has been used as file name of SAIO log and configuration files (Section3.4.3).

Data Transfer Size

The data transfer size, another factor that is unchangeable at run-time, indicateshow many bytes of data are read/written by each MPI-IO process. There are dif-ferent optimal configurations for different data transfer sizes. However, as it wouldbe too expensive to find out the optimal configuration for every data transfer size,the group index using saio_file_type to indicate a group of data transfer sizesthat have similar optimal configurations (Listing A.3) has been defined. In the pro-totype, 61 data transfer size groups are defined from 1 B to 1,024,000,000 B (976.56MB), and any data transfer size larger than 976.56 MB belongs to the 62nd group.Since saio_file_type is merely a group index in SAIO log files where the datatransfer sizes are also stored, SAIO is compatible with the existing log files if thegroup definition is changed.

6A group is an ordered set of process identifiers (henceforth processes); processes areimplementation-dependent objects. Each process in a group is associated with an integer rank. Ranksare contiguous and start from zero.[3]


MPI-IO Subroutine

The MPI-IO subroutine used in user applications cannot be changed at run-time ei-ther. Recording the name of the MPI-IO subroutine in log files provides valuableinformation to analyze the application’s I/O kernel and helps users to identify thebottleneck of their I/O requests. Since different MPI-IO subroutines (block vs. non-block, collective vs. non-collective) are implemented differently, their tuning strate-gies are also different.

MPI info

The MPI info is one of the most intuitive run-time I/O tuning options for diversMPI-IO libraries. According to the MPI standard, different MPI implementationsor MPI-IO libraries can define different MPI info objects besides the reserved MPIfile hints7. The implemented prototype searches the optimal combinations of MPIinfo objects and uses them as configurations to accelerate applications’ I/O re-quests.

3.4.3 Definition of SAIO Files

SAIO Log File

SAIO log files store the tracing results of MPI-IO subroutines. Instead of an I/Otracing software, SAIO is mainly an I/O auto-tuning framework. Its log files supporttuning engineering applications’ I/O requests. Therefore, the efforts to generate andprocess SAIO log files should be as few as possible.

Besides the previously mentioned influence factors, SAIO records the time stamp8

when the MPI-IO operations are invoked. It provides a timeline for analyzing ap-plications’ I/O requests, and can be used by Learning module and Statistic utility tolimit their searching range when parsing the SAIO log files. To evaluate the I/O per-formance, SAIO logs the aggregated data transfer size of all I/O processes (valueof bytes) and the processing duration of the slowest process in seconds (value ofduration). Based on these two values, the bandwidth (MB/S) is also calculatedand stored.

SAIO log files are designed in a JavaScript Object Notation (JSON)9-like format,which is light weighted, readable and extendable. Listing 3.1 presents some trac-ing results in SAIO log file 1200.saio, whose file name indicates the number of

7MPI reserves some potentially useful file hints which can be found on page 502 in MPI standardbook [3]

8SAIO time stamp is a Unix epoch, which is the number of seconds since 1st. Jan. 1970.9http://www.json.org/


I/O processes. Two data transfer sizes, 1,037,504 (1, 245, 004, 800 ÷ 1, 200) and 96(115, 200÷1, 200) bytes, are assigned in the 35th and the 1st saio_file_type group.It takes about 0.21 seconds for the MPI_File_write_all subroutine to finish writ-ing all 1.16 GB (1,245,004,800 B) data with the recorded MPI info objects (Line 2).This example log file shows that the I/O performance of collectively writing 1.16GB data striped over 8 OSTs with 1 MB stripe_size reaches about 5693.77 MB/S(Line 2), while the writing bandwidth of 14460.24 MB/S over 20 OSTs with 4 MBstripe_size is achieved (Line 2).

1 ...

2 {"saio_file_type":35,"time_stamp":1497693385.447081,"op":"

MPI_File_write_all","bytes":1245004800,"duration":0.208531

,"bandwidth":5693.773574,"mpi_info":[{"romio_cb_write":"

enable"},{"striping_factor":"8"},{"striping_unit":"1048576

"}]}


MPI_File_write_all","bytes":115200,"duration":0.005382,"

bandwidth":20.412864,"mpi_info":[{"romio_cb_write":"enable

"},{"striping_factor":"8"},{"striping_unit":"1048576"}]}

4 ...


MPI_File_write_all","bytes":1245004800,"duration":0.082110

,"bandwidth":14460.237983,"mpi_info":[{"romio_cb_write":"

enable"},{"striping_factor":"20"},{"striping_unit":"419430

4"}]}


MPI_File_write_all","bytes":115200,"duration":0.005805,"

bandwidth":18.925579,"mpi_info":[{"romio_cb_write":"enable

"},{"striping_factor":"20"},{"striping_unit":"4194304"}]}

7 ...

LISTING 3.1: Two Records of SAIO Log File (1200.saio) fromTraining the CFD Application’s Process APE4sources in Section 5.2

SAIO Configuration File

SAIO configuration files save the optimal configurations generated from the SAIOLearning module. Similar to SAIO log files, they are also named after the numberof MPI-IO processes and formatted in a JSON-like structure. The content of a SAIOconfiguration file is a subset of the corresponding SAIO log file. Since the SAIO opti-mizing component keeps the chosen SAIO configuration file in the local memory ofevery MPI-IO process at run-time, the configuration files are designed to store as lit-tle as possible but as much as necessary optimization information. When changing


the configuration file manually, user can see the optimizing effects of new configu-rations without recompiling the application.

Listing 3.2 presents the SAIO configuration file (1200.conf) generated fromthe above mentioned example SAIO log file (1200.saio). The SAIO Learning mod-ule compares all of the tracing results in the SAIO log file, extracts the MPI infoof the fastest writing operation, and then generates a SAIO configuration file ac-cordingly. In each configuration file there is a default configuration assigned assaio_file_type=0 group. It is the copy of the configuration in the file type groupthat has the maximal total time consumption during the SAIO training process. Forexample, in Listing 3.2 the default configuration is a copy of the 35th file type group’sconfiguration.

1 {"saio_file_type":35,"mpi_info":[{"romio_cb_write":"enable"},

{"striping_factor":"20"},{"striping_unit":"4194304"}]}

2 {"saio_file_type":0,"mpi_info":[{"romio_cb_write":"enable"},{

"striping_factor":"20"},{"striping_unit":"4194304"}]}

3 {"saio_file_type":1,"mpi_info":[{"romio_cb_write":"enable"},{

"striping_factor":"8"},{"striping_unit":"1048576"}]}

LISTING 3.2: Generated SAIO Configuration File (1200.conf) fromTraining the CFD Application’s Process APE4sources in Section 5.2

SAIO Configuration Index File

What is the SAIO configuration index file? To answer this question, an explana-tion of the SAIO Configuration pool’s structure is necessary. It contains three typesof configuration files: one mandatory default configuration file (0.conf), one op-tional configuration index file (index.conf) and multiple optional configurationfiles (#_PROCESSES.conf). The default configuration file is a copy of the configu-ration file with maximal #_PROCESSES (the largest trained job size). However, thisdefault configuration file could lead to poor I/O performance for some untrainedjob sizes, because the optimal configurations are very different, even though thesame application runs on different number of processes (Section 4.3). To optimizethe untrained job sizes, the SAIO configuration index file is designed to provide theguideline for selecting a proper configuration file from the SAIO configuration pool,instead of constantly selecting the default one.

Listing 3.3 shows the content of index.conf used in Section 4.2.2. The un-trained job sizes will be assigned the configuration file that is the one for the nextlarger trained job size. For example, when an application running with 360 MPI-IOprocesses starts, the configuration file named 480.conf will be selected. The con-figuration file selection logic can be altered by simply changing the value of min andmax. Section 3.4.6 describes the generating process of the index file, while Section


4.2.2 evaluates the benefits of using this index file.

1 {"min":1,"max":24,"conf":24}

2 {"min":25,"max":120,"conf":120}

3 {"min":121,"max":240,"conf":240}

4 {"min":241,"max":480,"conf":480}

5 {"min":481,"max":1200,"conf":1200}

6 {"min":1201,"max":1536,"conf":1536}

7 {"min":1537,"max":2400,"conf":2400}

8 {"min":2401,"max":999999999,"conf":0}

LISTING 3.3: SAIO Configuration Index File (index.conf) ofEvaluations in Section 4.2.2

3.4.4 MPI and PMPI Wrapper

The two most widely used MPI implementations, MPICH and Open MPI, are im-plemented in C. Both of them redirect the MPI subroutines in Fortran to the pro-filing MPI (PMPI) interface in C. Therefore, I have implemented an MPI wrapper,which is compiled as the shared library libsaio.so for C applications, and a PMPIwrapper (libpsaio.so) for Fortran applications (Figure 3.7). Both wrappers areimplemented with dynamic symbol10 in POSIX specification and can be dynami-cally loaded by setting the system environment variable LD_PRELOAD. Since thetwo wrappers are similar, the MPI wrapper is used as example to introduce the im-plementation of SAIO in the following sections.

MPI ImplementationSAIO

Fortran Applications

PMPIWrapper

I/O TracingI/O Optimizing

MPIInterface in C

CApplications

MPIWrapper

PMPIInterface in C

MPILibrary

in C

User Applications

FIGURE 3.7: SAIO MPI and PMPI Wrapper for User Applications

10http://pubs.opengroup.org/onlinepubs/9699919799/


Figure 3.8 shows the workflow of wrapping the MPI_Init() subroutine, andListing A.4 presents its code segment in C. The dynamic link function obtains theaddress of a symbol pointing to the profiling MPI subroutine: PMPI_Init(). Thesame input parameters used in user applications are passed to PMPI_Init() di-rectly. By successfully executing the PMPI_Init() subroutine, SAIO starts initializ-ing its two components according to the SAIO running modes. The implementationof the MPI wrapper for other MPI or PMPI subroutines are analogous.

Start

Find the address of PMPI_Init()

PMPI_Init() = MPI_SUCCESS?

Return error

Initializing SAIO tracing component

YES

Optimizing?

Return MPI_SUCCESS

NO

Initializing SAIO optimizing component

YES

Tracing? YES

NO

NO

FIGURE 3.8: SAIO MPI Wrapper for MPI_Init() Flow Chart

SAIO has instrumented MPI and PMPI subroutines for opening, closing, reading,writing files and setting file views in mpi_wrapper.c, pmpi_wrapper.c as wellas fh_mpi_wrapper.c (for HDF5 applications in Fortran).

3.4.5 I/O Tracing and Optimizing

I/O Tracing

The SAIO tracing component is implemented with "one process" tracing policy. TheSAIO process instance on the rank 0 MPI process is in charge of collecting tracing re-sults and storing them into its allocated local memory. Meanwhile, the SAIO processinstances on other MPI processes release the resources and stay idle after contribut-ing to the duration of their I/O operations. As a result, all other MPI processes keeprunning while the rank 0 MPI process saves the tracing results into its local memory.

Figure 3.9 illustrates the workflow of the SAIO I/O tracing process: the wrap-per of MPI_INIT starts initializing the SAIO tracing component, which allocatesthe local memory space on the rank 0 MPI process to store tracing results. Aftertracing the MPI reading/writing operations, the SAIO tracing component saves theI/O related information, such as operations’ duration, data transfer sizes, operationbandwidths, names of MPI-IO subroutines, MPI info objects and so on into the allo-cated memory (refer to Listing 3.1 for a concrete log file example and Listing A.1 forits data structure in memory). As soon as the application calls the MPI_FINALIZE


subroutine, the rank 0 MPI process writes all tracing results into the log file pool andfinalizes the SAIO tracing component.

SAIO Traces MPI I/O Functions

MPI Application SAIO on Rank 0 Process Log File Pool

Start


MPI_FILE_OPEN

MPI_FILE_READ

MPI_INIT

Recording read-related information

Some calculations

MPI_FILE_WRITE

Recording write-related information

MPI_FILE_CLOSE

MPI_FINALIZE

Writing log file and then finalizing SAIO tracing component

End

Log File

Wrapper

Wrapper

Wrapper

Wrapper

FIGURE 3.9: SAIO Tracing Process Flow Chart

In a parallel computing environment, each process runs a small part of an appli-cation code and takes different time to execute them, although all processes are thesame. This different executing time consumption phenomenon is even more obvi-ous for parallel I/O operations, since files are stored in a file system outside the HPCsystem. The connection between an HPC system and a file system is unfortunatelylimited, therefore, the time consumption of I/O operations on the rank 0 MPI pro-cess cannot represent the time consumption of I/O operations on all MPI processes.Thanks to the MPI_ALLREDUCE subroutine, an MPI collective operation collectingparticular information from all MPI processes, a valid duration of I/O operation canbe obtained (Listing 3.4). After obtaining the duration, the rank 0 MPI process startscollecting other tracing results, while the other MPI processes continue to run nextinstructions.

1 time_stamp_1 = PMPI_Wtime ( ) ;2 r e t = __rea l_PMPI_Fi le_read_a l l ( fh , buf , count , datatype , s t a t u s ) ;3 time_stamp_2 = PMPI_Wtime ( ) ;4 double read_time = time_stamp_2 − time_stamp_1 , longest_read_t ime ;5 PMPI_Allreduce(&read_time , &longest_read_time , 1 , MPI_DOUBLE,

MPI_MAX, mpi_comm) ;


6 . . . . . .7 i f ( rank == MASTER_RANK) {8 r e t = saio_trace_mpi_read (mpi_comm , info , count , datatype ,

mpi_rank_size , " MPI_Fi le_read_al l " , longest_read_time , &r e a d _ r e a l _ t i m e _ o p t _ f i l e _ t y p e ) ;

9 }

LISTING 3.4: Code Segment for Recording Read Duration

Using the MPI_ALLREDUCE subroutine to trace MPI-IO operations can be a hid-den trouble for those non-blocking and independent MPI-IO operations, especiallywhen the application scales out11 largely. As for the collective MPI-IO operations,this would not be a problem, since they synchronize the MPI processes implicitly.The overhead’s evaluation results in Section 4.2.3 indicate, that its effect is very lim-ited and acceptable in a production environment.

Figure 3.10 describes the workflow of SAIO recording the details of MPI-IO oper-ations. As SAIO stores the tracing results in two categories (read and write), the trac-ing workflow is divided into two paths as well. In both paths, the mpi_rank_sizeis used to name the SAIO log files, while the previously mentioned I/O related infor-mation including MPI info objects are collected and stored into the local memoryof the rank 0 MPI process.

Start READ?

WRITE?

NO

Get mpi_rank_size

YESGet and record mpi_info etc.

Real-time optimization process for read

Get mpi_rank_size

YESGet and record mpi_info etc.

Real-time optimization process for write

End

NO

FIGURE 3.10: SAIO Recording Operation Details Flow Chart - TwoProcesses with Red Font will be Presented in Figure 3.13

Some engineering applications can issue hundreds of thousands of I/O opera-tions. If SAIO tries to record them all, the tracing results would exhaust the localmemory on the rank 0 MPI process and crash the running process. To prevent thisworst case, a variable, MAX_NO_SAIO_RECORDS, has been defined to restrict theamount of tracing results and the maximal memory occupation on the rank 0 MPIprocess. Even though, the problem, how to handle the situation, that the number ofI/O operations exceeds the predefined MAX_NO_SAIO_RECORDS, had to be solved.Two possible solutions, ring-buffer and flush out, were discussed in Section 3.1.3.

11scale out or scale horizontally means using more compute nodes to run applications. scale up orscale vertically means using more powerful resources, like CPU, memory to replace the old resourceson compute nodes, so that applications run faster.


After evaluating the SAIO finalizing overhead in Section 4.2.3, I found out that writ-ing 65,535 reading and 65,535 writing records plus freeing the allocated memoryspent less than 5 seconds. Hence, the flush out solution has been implemented.

I/O Optimizing

The SAIO optimizing component needs to set the optimal MPI info objects on allrunning MPI processes, therefore, every MPI process has to allocate a small piece ofits local memory to store the optimal MPI info objects.

SAIO Optimizes MPI I/O Functions

MPI Application SAIO on ALL Processes Configuration File Pool

Start Initializing SAIO optimizing component

MPI_FILE_OPEN

Some calculations

MPI_FILE_CLOSE

MPI_FINALIZE

Finalizing SAIO optimizing component

End

Application & data size related configurationsStoring configurations in local memoryWrapper

Getting application related configuration

MPI_FILE_READ

Wrapper

Getting data size related configuration

MPI_FILE_WRITE

Wrapper Getting data size related configuration

MPI_INIT

Wrapper

Wrapper

FIGURE 3.11: SAIO Optimizing Flow Chart

Figure 3.11 illustrates the SAIO optimizing workflow. The wrapper of MPI_INITsubroutine initializes the SAIO optimizing component which prepares the SAIO op-timizing environment variables. Calling the MPI_FILE_OPEN subroutine triggersSAIO to read the application and data size related configurations from the SAIOconfiguration files. These configurations are stored in the local memory of each MPIprocess. The SAIO optimizing instances on all MPI processes pass the applicationrelated configuration to the MPI_FILE_OPEN subroutine (the stage 1 optimization).Just before reading and writing the data, SAIO knows the data transfer size. TheSAIO optimizing component gets the data size related configuration from memory,and sets it via the MPI_FILE_SET_INFO subroutine (the stage 2 optimization). Sim-ilar to the SAIO tracing component, the SAIO optimizing component’s finalizing


process starts as soon as the MPI subroutine MPI_FINALIZE is invoked. It closesthe SAIO optimization component, and frees the allocated memory on all MPI pro-cesses.

In the SAIO configuration file pool, there are three kinds of configuration files(Section 3.4.3 and 3.4.6). Figure 3.12 shows the workflow of the SAIO optimizingcomponent accessing them:

• #_PROCESSES.conf: These configuration files are the first searching targetsof SAIO optimizing component. The file name indicates the number of MPIprocesses used for the I/O requests. The SAIO optimizing component gets thevalue of mpi_rank_size and searches for the configuration file accordingly.If this configuration file is found, all MPI processes read and store its contentinto their local memories. Otherwise, it tries to open the configuration indexfile index.conf.

• index.conf: The configuration index file provides a guideline for those un-trained processes to get their proper configuration files instead of the defaultone. If the index.conf file is found, the SAIO optimizing component gets theproper SAIO configuration file name from this index file and saves the config-urations into all MPI processes’ local memories. Otherwise, the optimizingcomponent has to use the default configurations in 0.conf.

• 0.conf: The default configuration file is a copy of the SAIO configurationfile with the maximal number of MPI processes in the configuration pool. Itscontent is used if neither of the previously mentioned files exists.

StartInitialize the

SAIO optimizing component

Found #_PROCESSES

.conf?

NO

Get configurations

Store configurations into memory

YES End

Foundindex.conf?

YES

NO

Get default optimal configurations from

0.conf

FIGURE 3.12: SAIO Getting Configuration Flow Chart

The selected configurations are stored as data structure saio_conf_t (ListingA.1) in each MPI process’s memory until the end of the application. When the appli-cation related configuration is passed to the MPI_FILE_OPEN subroutine, the datatransfer size is unknown. In a Lustre file system environment, its stripe_countand stripe_size must be set when calling the MPI_FILE_OPEN subroutine tocreate new files (Section 3.3.2). However, the striping setups of application related


configuration could be not the best one for every MPI-IO operation, since the datatransfer sizes are unknown.

To solve the problem, the SAIO real-time optimization process has been devel-oped (bold font in Figure 3.13). Its basic idea is to optimize the longest I/O oper-ation among the last N I/O operations (N=1 by default). As shown in Figure 3.13,this process requires the cooperation of both SAIO optimizing and tracing compo-nents running in SAIO_MODE_OPTTR mode. OPT frequency is set via the predefinedvariable SAIO_REAL_TIME_OPT_FREQUENCY (Listing A.5). If OPT frequency is setto 1, the real-time optimization process will not start and the default applicationrelated configuration is always used for all MPI_FILE_OPEN subroutines. Other-wise, it controls how many previous I/O operations to compare. After recordingthe MPI_FILE_WRITE subroutine, SAIO checks if it is time to compare the tracingresults. If yes, it traverses the last N records, picks up the record of the longest I/Ooperation, broadcasts its saio_file_type value (index) to all MPI processes andthen sets N=1. Otherwise, it executes N=N+1 and continues to run the next instruc-tion. While calling the next MPI_FILE_OPEN subroutine, SAIO uses the optimalconfiguration of this longest I/O operation and hopes that it fits the next I/O oper-ation better than the default application related configuration.

SAIO Real-Time Optimization

MPI ApplicationSAIO on ALL Processes Optimization SAIO on Rank 0 Process Tracing

StartInitializing SAIO optimizing component

MPI_FILE_OPEN

MPI_FILE_CLOSE

MPI_FINALIZE

Finalizing SAIO optimizing component

End

Storing configurations in local memory Wrapper

Getting application related configuration based on index ( = 0 )

MPI_FILE_WRITE

Wrapper

Getting data size related configuration

MPI_INIT

Wrapper

Wrapper


Wrapper

Recording write-related information

N <OPT frequency?

Wrapper

Broadcasting the index of data size related

configuration of the longest I/O among last N operations

MPI_FILE_OPEN

Getting application related configuration based on index

More Operations

MPI_FILE_CLOSE

Finalizing SAIO tracing component

N=N+1

YES

N = 1

Finding out the longest I/O operation among the last N operations

NO

OPT frequency = 1?

NO

YES

FIGURE 3.13: SAIO Real-Time Optimization Based on Predefined Fre-quency Flow Chart

Similar to the above presented writing process example, the same real-time opti-mization process is implemented for reading operations as well.


3.4.6 SAIO Learning Module

The SAIO Learning module is one of the SAIO core components to realize its semi-automatically I/O tuning feature. Its task is to find out the optimal configuration foreach saio_file_type from the tracing results of applications and SAIO trainingprocesses. These optimal configurations are stored in the SAIO configuration files.After gathering large amounts of tracing results from various scientific and engi-neering applications, system administrators can use this module to create a globalSAIO configuration pool.

Start More log file?

Generate configuration index file

NO

Read each record from log file

YES

Get MPI info objects of each file type with

max. bandwidth

Write file type and MPI info objects to configuration file

Generate default configuration file

End

FIGURE 3.14: Workflow of SAIO Learning Module - the Two Pro-cesses with Red Font will be Represented in Figure 3.15

Figure 3.14 presents the workflow of the SAIO Learning module. It reads alllog files in an assigned SAIO log file pool one by one. For all the records in alog file, SAIO compares their bandwidths, and gets the MPI info objects of eachsaio_file_type with the maximal bandwidth. After parsing all records in onelog file, a SAIO configuration file is created accordingly. Afterwards it searches forthe next log file. When all log files in the SAIO log file pool have been processed, theSAIO Learning module continues to create the default configuration file (0.conf) aswell as the configuration index file (index.conf).

Start

End

Found 0.conf?Create an

empty 0.confNO

Read the file name of all N configuration files into an integer array conf[]

YES

Sort conf[] in an ascending order

For each element in conf[], write index.conf:

conf[i]+1 to min , conf[i+1] to max , conf[i+1] to conf

For the last record in index file, write:

conf[N]+1 to min , 999,999,999 to max ,

0 to conf

Copy configuration file conf[N] to 0.conf

FIGURE 3.15: SAIO Creating a Default Optimal Configuration File(0.conf) and a Configuration Index File (index.conf)

The SAIO configuration index file (Section 3.4.3) has been designed to supportdynamically selecting optimal configuration files. The default selection logic is toassign the configuration file of the next larger job to the incoming applications, if an


exact suitable configuration file is not available. Its generation process is presentedin Figure 3.15. The advantages of using this index file has been evaluated in Section4.2.2.

3.4.7 SAIO Training Utility

SAIO training utility is an automated process combining the SAIO Core module andthe SAIO Learning module. Regularly running the SAIO training process keeps theSAIO Configuration pool up to date (learning phase). Figure 3.16 illustrates the work-flow of the SAIO training utility. Firstly, it reads the log file generated by applicationsas well as the configuration searching scope file created by an I/O expert. The resultsfrom the Cartesian product12 of these two files are used, so that the SAIO trainingprocess can launch the self-implemented MPI program to test all listed configura-tions and data transfer sizes. The tracing log file contains the I/O performance in-formation by applying different configurations. At last, the SAIO Learning modulereads this tracing log file and generates a configuration file.

StartGenerate lists of data

transfer sizes from log files

Log file

Generate configuration list from searching scope file

Configuration searching scope file

Test all lists of data transfer sizes with all configurations

from the searching scope file

SAIO generates log files

Log file

SAIO learning module generates configuration file

Configuration file

End

FIGURE 3.16: SAIO Training Utility Flow Chart - The Process withRed Font was Presented in Figure 3.14

3.4.8 SAIO Statistic Utility

A simple statistic utility has been implemented to support system administratorsor engineers to profile the MPI-IO operations in the prototype. It reads the SAIOlog files and generates CSV formatted files accordingly. System administrators orengineers can use their preferred tools such as Microsoft Office13, Open Office14 etc.to analyze the I/O behavior and the impacts of different configurations. In the nextversion of SAIO, I plan to implement a complete solution like darshan-util15 or agraphic user interface (GUI) to support I/O analysis.

12http://en.wikipedia.org/wiki/Cartesian_product13http://en.wikipedia.org/wiki/Microsoft_Office14http://www.openoffice.org/15http://www.mcs.anl.gov/research/projects/darshan/docs/darshan-util.html


3.4.9 SAIO Software Compatibility

SAIO has been tested with different MPI, parallel HDF5 and parallel NetCDF appli-cations in both C and Fortran codes. It follows the current MPI standard and allSAIO running modes support MPI applications. The parallel HDF5 I/O library usesall processes to read/write the real data, but uses a subset of MPI processes to read-/write the HDF5 format metadata. As MPI_ALLREDUCE and MPI_FILE_SET_INFOare collective operations used for the SAIO I/O tracing and optimizing components,three SAIO running modes are not compatible with parallel HDF5 applications (Ta-ble 3.2). In order to optimize parallel HDF5 and parallel NetCDF applications,the SAIO_MODE_OPTCOLL and SAIO_MODE_SIZEONLY running modes are imple-mented. Table 3.2 lists the compatibility of SAIO in June 2017. More other softwarewill be tested in the near future.

SAIO_MODEC & Fortran Code C Code Fortran Code

MPI HDF5 NetCDF Fluent WRF ModelSIZE_ONLY 4 4 4 4 4

OPTMIZING_ONLY 4 8 8 8 8

TRACING_ONLY 4 8 8 8 8

OPTMIZING & TRACING 4 8 8 8 8

OPTMIZING_COLL 4 4 4 4 4

TABLE 3.2: SAIO Software Compatibility

3.5 How to Use SAIO

One of the SAIO’s features is easy to use. It was implemented in a way, that engi-neers and scientists do not have to change or recompile their source codes to profitfrom SAIO. The prototype was implemented as a dynamic C library and uses thePOSIX export directive to set the SAIO environment variables. Listing A.5 showshow to export the SAIO predefined environment variables.

59

Chapter 4

Evaluations

4.1 Evaluation Setups

4.1.1 System Specifications

All evaluations were made on Hazel Hen (Cray XC40) with an InfiniBand connectedLustre file system at HLRS (High Performance Computing Center Stuttgart). Witha peak performance of 7.42 PFLOPS, Hazel Hen is one of the most powerful HPCsystems in the world (position 17 of Top500 list in June 20171). Table 4.12 showsthe technical details about the Hazel Hen and the experimental Lustre file system.On each compute node, there are two Intel Xeon E5-2680 v3 CPUs3 with total 24CPU cores and 128 GB shared memory installed. The Lustre file system is deployedon a Cray Sonexion4 scale-out Lustre storage system with 7 MDTs and 54 OSTs.The theoretical peak bandwidth of each Lustre OST is 3.75 GB/s, which leads to anaggregated 202.5 GB/s (3.75GB/s×54) peak bandwidth on the experimental Lustrefile system.

Architecture Hardware File System Storage BandwidthCray XC40 Intel Xeon E5-2680 v3 Lustre Cray 3.75 GB/s

Cray Aries Network 7 MDTs Sonexion per OST7712 Compute nodes 54 OSTs 200090 Service nodes

TABLE 4.1: Technical Details of Hazel Hen and Lustre File System

4.1.2 Software Configurations

Different compilers (GNU and Intel), programming environments, MPI implemen-tations, (parallel) HDF5 libraries and (parallel) NetCDF libraries installed on Hazel

1http://www.top500.org/list/2017/06/?page=12http://wickie.hlrs.de/platforms/index.php/CRAY_XC40_Hardware_and_Architecture3http://ark.intel.com/products/81908/Intel-Xeon-Processor-E5-2680-v3-30M-Cache-2_50-GHz4http://www.cray.com/products/storage/sonexion

60 Chapter 4. Evaluations

Hen were used.5 Since Hazel Hen disables the dynamic link by default, the envi-ronment variable CRAYPE_LINK_TYPE has to be set as dynamic in job scripts, sothat the SAIO dynamic library will be loaded. Evaluations were performed by awidely used parallel file system I/O benchmark software, IOR version 3.0.1. IORis developed for measuring parallel file system I/O performance through differentinterfaces and access patterns. It supports POSIX, MPI, HDF5 I/O libraries and usesMPI for its process synchronization.6 Besides setting dozens of I/O access patternoptions through an IOR script file, user can also set MPI info when running MPI,HDF5 benchmarks. The evaluation results came from IOR’s reports, if not men-tioned explicitly.

4.1.3 I/O Configurations’ Searching Scope

The evaluations ran on different number of processes to simulate the parallel I/Orequests of real engineering applications. I chose to evaluate SAIO on 1, 5, 10, 20,50, 64 and 100 Hazel Hen compute nodes. There are 24 CPU cores on each HazelHen compute node. By researching the characteristics of Lustre file systems and theROMIO MPI-IO library, I have learned that the following MPI info objects impactparallel I/O performance significantly:

• romio_cb_read: enabling/disabling the collective buffering on reading op-erations when collective I/O operations are used.

• romio_cb_write: enabling/disabling the collective buffering on writing op-erations when collective I/O operations are used.

• striping_factor: specifying the number of Lustre OSTs (stripe_count)to stripe new files.

• striping_unit: specifying the size (in bytes) of each Lustre file system OSTstripe (stripe_size) used for new files. In the current default collective I/Oalgorithm of Cray MPI on Hazel Hen, the value of cb_buffer_size, the col-lective buffer size, equals to the value of striping_unit.

How to choose the values of previously mentioned MPI info objects to test,especially the Lustre striping configurations, came to be the first challenge for theSAIO training process. According to the Lustre user manual7, a better I/O perfor-mance can be achieved by striping files over multiple OSTs and selecting a propervalue for its stripe size. To explain this declaration intuitively, I ran tests with IORusing 16 MB stripe size and different number of OSTs (Figure 4.1):

5Details please refer to the website of HLRS specific system Wiki and Cray XC406http://github.com/LLNL/ior/blob/master/doc/USER_GUIDE7http://doc.lustre.org/lustre_manual.xhtml

4.1. Evaluation Setups 61

• For small job running on 240 processes, the number of OSTs has hardly impacton the writing operations. Its reading performance stops increasing after 20OSTs.

• For middle jobs (1200 and 2400 processes), the I/O performance does not un-limitedly keep increasing with the number of OSTs. Figure 4.1 shows that thewriting bandwidths stop raising, when 1200PEW stripes over more than 20OSTs and 2400PEW stripes over more than 36 OSTs. However, their readingbandwidths (1200PER and 2400PER) keep increasing, when the files are stripedover more OSTs.

• For large job (7200 processes), its I/O performance keeps raising together withthe increasing number of OSTs.

0

50

100

150

200

4 8 12 16 20 24 28 32 36 40 44 48 52

Band

wid

th [

GB

/S]

Number of Lustre OSTs / striping_factor

Collectively Read/Write a Single Shared File (500MB Transfer Size)

System Limit240PEW240PER

1200PEW1200PER2400PEW2400PER7200PEW7200PER

FIGURE 4.1: I/O Performance Impact of Different Number of OSTs

It is interesting to find out, that the writing bandwidths for 7200PE from 16 to32 OSTs are slightly better than the reading bandwidths. According to the Lustremanual, the writing calls from Lustre clients are sent asynchronously and the back-end storage can aggregate these writes efficiently, while the reading calls from clientsmay come in a different order and need a lot of seeking to get read from the disk,which noticeably hampers the read throughput[61].

Increasing the number of OSTs for large jobs also improves the I/O performanceaccordingly, but the resources (Lustre OSTs and network connections) are limited.Therefore, I investigated the influence of multiple applications concurrently access-ing large amount of OSTs. Figure 4.2 presents the scenario of two concurrently run-ning benchmarks. The first I/O benchmark (purple) ran on 7200 processes and col-lectively wrote 500 MB data per process to a single shared file (striping over 8 OSTs)10 times, while the second one (green) ran on 24000 processes and collectively wrote


500 MB data per process to a different file (striping over 40 OSTs) 10 times as well. Al-though these two benchmarks did not use up all 54 OSTs of the experimental Lustrefile system, their I/O performances dropped down enormously. The points ("+" and"×") in Figure 4.2 indicate the measured bandwidth of each writing operation. Thefirst 5 writing operations of Benchmark1 (7200PE) finished within 400 seconds, whilethe second 5 ones costed almost 1000 seconds. As for the Benchmark2 (24000PE), itsfirst 5 writing processes spent about 1000 seconds and the second 5 ones finishedwithin 600 seconds. If these two benchmarks would have not run simultaneously,the Benchmark1 could have spent about 600 seconds (= 1000−400) less or saved 1200core hours (= 7200× 600÷ 3600) computing time, while the Benchmark2 could havebeen accomplished about 400 seconds (= 1000−600) earlier or saved 1333 core hours(= 24000× 400÷ 3600).

0

10

20

30

40

50

60

70

80

90

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Band

wid

th [

GB

/S]

Timeline [Seconds]

Concurrently Running Two Benchmarks Writing 500MB Transfer Size

Benchmark1: 7200PE - 8 OSTsBenchmark2: 24000PE - 40 OSTs

FIGURE 4.2: Performance Impact when Two I/O Benchmarks RunSimultaneously

The average bandwidths of these two benchmarks running separately and con-currently are listed in Table 4.2. The results present that the I/O performance ofBenchmark1 dropped down far more steeply than the I/O performance of Benchmark2(64.55% vs. 20.49% decreases). In other words, the performance impacts of smalljobs are much greater than the impacts of large jobs.

PEs Lustre Striping Setups avg. BW sep. avg. BW con. decrease7200 8 OSTs & 16 MB 19.41 GB/S 6.88 GB/S 64.55%24000 40 OSTs & 16 MB 82.83 GB/S 65.86 GB/S 20.49%

TABLE 4.2: Configurations and Further Information about Two Si-multaneously Running I/O Benchmarks

To avoid such kind of worst cases, the maximal striping_factor for the SAIOtraining process was set as 16. The values of striping_unit were selected from

4.2. Evaluation Results 63

1,048,576 (1 MB) to 16,777,216 (16 MB) with powers-of-two because of four rea-sons[61]:

• firstly, the values of striping_unit in Lustre file system are between 65,536(64 KB) and 4,294,967,296 (4 GB);

• secondly, 65,536 (64 KB) is the smallest value of striping_unit;

• thirdly, large values of striping_unit may result in longer lock hold times;

• and lastly, values will be automatically set to 65,536 (64 KB) if they are notdivisible by 65,536.

As for the collective I/O operations, automatic is the default value in Cray MPI.Based on its run-time heuristics, Cray MPI decides whether to disable or enable thecollective buffering. However, the default alignment algorithm on Hazel Hen favorsenabling the collective buffering8.

Name Value Quantity of Valuesnumber of processes 24; 120; 240; 480; 1200; 1536; 2400 7striping_factor 4; 6; 8; 10; 12; 16 6striping_unit 1048576 - 16777216 5romio_cb_read automatic; disable; enable 3romio_cb_write automatic; disable; enable 3

TABLE 4.3: Configurations’ Searching Scope for Training Process

4.2 Evaluation Results

4.2.1 SAIO - Training Process

The data transfer sizes were chosen from 100 bytes to 200,000,000 bytes (61 differ-ent sizes). The sum of MPI info objects’ combinations listed in Table 4.3 is 630(7 × 3 × 6 × 5) for write and 21 (7 × 3) for read. Running the prototypical SAIOtraining process once for this very large search space consumed a lot of computingresources (Table 4.4). However, the computing resource consumption for trainingspecific applications was very little comparing to the saved core hours that SAIOachieved (Chapter 5).

OPs No. of Files Total Size Max. Size Wall Time Core HoursRead 61× 21 = 1, 281 4.37 TB 1.75 TB ≈ 5 hours ≈ 4, 780Write 61× 630 = 38, 430 392.90 TB 0.44 TB ≈ 35 hours ≈ 40, 920

TABLE 4.4: Resources Consumed of the SAIO Training Process

8http://docs.cray.com/cgi-bin/craydoc.cgi?mode=Show;q=2490;f=books/S-2490-40/html-S-2490-40/chapter-sc4rx058-brbethke-paralleliowithmpi.html#section-8ik9vp6x-oswald


The self-implemented MPI program for the SAIO training process deleted thetraining files after closing them. Therefore, the evaluation of SAIO training processfor writing operations only occupied 0.44 TB storage space for the largest file cre-ated by 2,400 processes, although it created 38,430 files and wrote 392.90 TB data.On the other hand, searching for the optimal configurations of reading operationsrequired more storage space (1.75 TB) to store the source files. The SAIO trainingprocess of each job size (number of processes) created 61 files with the generatedoptimal writing configurations, and then read them with three different values ofromio_cb_read (automatic, disable and enable). Although the SAIO train-ing process consumed about 45,700 core hours computing resources and spent about40 hours, it managed to find out 371 optimal configurations.

4.2.2 SAIO - Capability

After the SAIO training process, there are 53 different configuration sets for the datatransfer sizes from 100 to 200,000,000 bytes in each generated SAIO configurationfile. In this section, the training results were evaluated using IOR benchmark and 32MB data transfer size, which was not trained explicitly. The optimal configurationsof 40,000,000 bytes (≈ 38.15 MB) data transfer size (Table 4.5) were automaticallyselected for the 32 MB data transfer size, since they were grouped under the samesaio_file_type. The bandwidth results in this section came from 50× IOR run-ning samples.

No. PEs romio_cb_read romio_cb_write striping_factor striping_unit24 disable automatic 4 1048576120 disable automatic 10 4194304240 disable automatic 16 4194304480 disable automatic 16 83886081200 disable disable 16 83886081536 disable disable 16 167772162400 disable disable 16 16777216

TABLE 4.5: Found Optimal Configurations for Reading and Writing40,000,000 Bytes (Data Transfer Size) per Process

Accelerating MPI Applications

The default setup of Lustre striping configuration on the experimental file systemwas striping_factor=1 and striping_unit=1048576. The bottleneck (≈ 750

MB/S) of one Lustre OST for collective writing operation was easily reached (Figure4.3). In addition, its default setup will be changed after the experimental phase.Therefore, I changed it to striping_factor=4 and striping_unit=1048576

for further evaluations, which was a common default setup of the other Lustre filesystems at HLRS.


400

600

800

1000

1200

Default-1OST Optimized



and

wid

th [

MB

/S]

24 Processes Collectively 'Write' a Single-Shared-File(32 MB Data Transfer Size)

0

1500

3000

4500

6000

Default-1OST Optimized



Ban

dw

idth

[M

B/S

]


FIGURE 4.3: Default Setups with 1 OST vs. SAIO Optimization

Figure 4.4 presents the achievement of SAIO by collectively writing a single-shared-file with 32 MB data transfer size. The improvement kept raising when thebenchmarks scaled out, even though the number of Lustre OSTs did not changesince 240PE. Increasing the number of Lustre OSTs and the size of striping_unitbrought about 63% (120PE), 150% (240PE) and 160% (480PE) speedups. But for theother three larger jobs, the writing performance increased by disabling the collec-tive buffering. SAIO achieved about 4× (1200PE), 4.5× (1536PE) and 6× (2400PE)improvements.

0

2000

4000

6000

Default-4OSTs Optimized



Ban

dw

idth

[M

B/S

]


0

2000

4000

6000

8000




and

wid

th [

MB

/S]


0

3000

6000

9000

12000




Ban

dw

idth

[M

B/S

]


0

5000

10000

15000

20000




Ban

dw

idth

[M

B/S

]


0

6000

12000

18000

24000




Ban

dw

idth

[M

B/S

]


0

7000

14000

21000

28000




Ban

dw

idth

[M

B/S

]


FIGURE 4.4: Default Setups with 4 OST vs. SAIO Optimizing for MPIWrite Benchmarks


Based on these results, I presume that disabling the collective buffering to avoidthe inter-processes communication of collective I/O algorithm like two-phase I/O(Section 2.2.3) could improve the I/O performance. To verify this hypothesis, extraexperiments were made to investigate the collective buffering for the collective I/Oalgorithm of Cray MPI and their results were presented in Section 4.3.

As for the MPI collective read operations, I researched three different values ofromio_cb_read, enable, disable and automatic. Besides this MPI info ob-ject, the striping setup of files in Lustre file system can impact the read performanceas well (Figure 4.1). Therefore, I evaluated the reading performance of files createdwith both default (4 Lustre OSTs) and optimal writing configurations (Table 4.5).Figure 4.5 illustrates the reading performance applying different configurations:

• 4OSTs_W: files created by striping over 4 Lustre OSTs and 1MB stripe size.

• OPT_W: files created by applying the optimal configurations in Table 4.5.

• Default_R: reading files using default setup (romio_cb_read=automatic).

• OPT_R: reading files using the found optimal configurations in Table 4.5.

0

3500

7000

10500

4OSTs_WDefault_R

4OSTs_WOPT_R

OPT_WDefault_R

OPT_WOPT_R

Ban

dw

idth

[M

B/S

]

120 Processes Collectively 'Read' a Single-Shared-File(32 MB Data Transfer Size)

0

4500

9000

13500

4OSTs_WDefault_R

4OSTs_WOPT_R

OPT_WDefault_R

OPT_WOPT_R

Ban

dw

idth

[M

B/S

]


0

4500

9000

13500

18000

4OSTs_WDefault_R

4OSTs_WOPT_R

OPT_WDefault_R

OPT_WOPT_R

Ban

dw

idth

[M

B/S

]


0

7500

15000

22500

4OSTs_WDefault_R

4OSTs_WOPT_R

OPT_WDefault_R

OPT_WOPT_R

Ban

dw

idth

[M

B/S

]


0

7500

15000

22500

4OSTs_WDefault_R

4OSTs_WOPT_R

OPT_WDefault_R

OPT_WOPT_R

Ban

dw

idth

[M

B/S

]


0

6000

12000

18000

24000

4OSTs_WDefault_R

4OSTs_WOPT_R

OPT_WDefault_R

OPT_WOPT_R

Ban

dw

idth

[M

B/S

]


FIGURE 4.5: Default Setups with 4 OST vs. SAIO Optimizing for MPIRead Benchmarks


Figure 4.5 shows that disabling the collective buffering for reading operationsand striping files over more Lustre OSTs improve the I/O performance enormously.Compared to those applying all system default setups ("4OSTs_W, Default_R" vs."OPT_W, OPT_R") on the I/O requests, SAIO achieved about 4.1× (120PE), 4.8×(240PE), 6.5× (480PE), 8.2× (1200PE), 7.5× (1536PE) and 7.7× (2400PE) improve-ments for the reading performance.

Accelerating Untrained MPI Applications

Besides 8 generated SAIO configuration files, the SAIO training process also createdthe SAIO configuration index file (Table 4.6). The default configuration file, 0.conf,was created as a copy of 2400.conf. When an untrained MPI application (runningon N processes) cannot find its own configuration file in the SAIO configuration pool,it searches the index file and gets a proper one (Minimum ≤ N ≤Maximum).

Minimum (PEs) Maximum (PEs) Configuration File1 24 24.conf25 120 120.conf121 240 240.conf241 480 480.conf481 1200 1200.conf1201 1536 1536.conf1537 2400 2400.conf2401 999999999 0.conf

TABLE 4.6: Generated Configuration Index File after Training Process

To evaluate the advantages of using the SAIO configuration index file, I neededto choose some reasonable untrained number of processes, instead of random ones.Therefore, I investigated the statistic of consumed computing hours on Hazel Hen in2016. The core hours were grouped based on the number of process elements (PEs)used by applications.

Ranking Core Hours No. of PEs Trained? Configuration File6 50,977,208 1200 4 1200.conf7 50,898,360 1536 4 1536.conf10 39,655,360 2064 8 2400.conf11 37,451,067 2400 4 2400.conf13 27,067,676 792 8 1200.conf18 19,453,082 2304 8 2400.conf19 15,934,024 288 8 480.conf20 15,897,105 1920 8 2400.conf89 3,192,702 360 8 480.conf

TABLE 4.7: A Ranked Consumption Statistic of Different Applica-tions (≤ 2400 PEs) at HLRS in 2016


Table 4.7 lists the consumption state of the groups ≤ 2400 PEs ranked in top20 most used PEs plus the group 360PE. Among these 9 groups, 1200PE, 1536PEand 2400PE were trained. According to the SAIO configuration index file (Table 4.6),each untrained group would be assigned a proper configuration file (the last columnof Table 4.7). From this ranked list, I chose 2064PE, 792PE, 288PE and 360PE, to testthe SAIO configuration index file.

The writing operations of four chosen groups were evaluated with and withoutthe SAIO configuration index file (Table 4.6):

• WITHOUT INDEX: The SAIO configuration index file is not available andSAIO gets the default optimal configuration from 0.conf.

• WITH INDEX: The SAIO configuration index file is accessible and a SAIOconfiguration file is assigned.

The evaluation results in Figure 4.6 shows, that the SAIO configuration index filehelped two benchmarks running on 288 and 360 processes. However, for the othertwo benchmarks running on 792 and 2,064 processes, the SAIO configuration indexfile neither improved nor corrupted the I/O performances.

2000

4000

6000

8000

WITHOUT INDEX WITH INDEX



and

wid

th [

MB

/S]


4000

6000

8000




and

wid

th [

MB

/S]


4000

8000

12000

16000




Ban

dw

idth

[M

B/S

]


12000

18000

24000

30000


striping_factor=16, striping_unit=16MBromio_cb_write=disable

Ban

dw

idth

[M

B/S

]


FIGURE 4.6: Using SAIO Default Configuration 0.conf vs. UsingConfiguration Index File to Assign Predefined Configurations to MPI

Write Benchmarks

Accelerating HDF5 Applications

HDF group has implemented the parallel HDF5 API to support parallel HDF5 ap-plications. It is one of the high-level I/O libraries standing upon MPI-IO library and


calling MPI-IO subroutines. I researched the writing behavior of parallel HDF5 andfound out, that all MPI processes are used to write the real data and only a sub-set of the MPI processes (usually rank 0−5) to write the metadata of HDF5 format.Therefore, SAIO could only impact the bandwidth of writing real data. The eval-uations (Figure 4.7) were using IOR benchmark and SAIO SAIO_MODE_OPTCOLL

running mode to test, if the optimal configurations from an MPI training process(Table 4.5) also worked for parallel HDF5 applications. Compared to the MPI-IOperformance for Default-4OSTs setup (≈ 3000 MB/S) shown in Figure 4.4, HDF5could only achieve about 1200 MB/S writing bandwidth for Default-4OSTs setup.Although SAIO could not make HDF5 applications run at the same performancelevel as MPI applications did, 7% (120PE), 14% (240PE), 13% (480PE), 9× (1200PE),10× (1536PE) and 15× (2400PE) improvements were achieved.

900

1050

1200

1350

1500




and

wid

th [

MB

/S]


1050

1200

1350

1500




Ban

dw

idth

[M

B/S

]240 Processes Collectively 'Write' a Single-Shared-File


900

1050

1200

1350

1500




and

wid

th [

MB

/S]


0

3000

6000

9000

12000




Ban

dw

idth

[M

B/S

]


0

4500

9000

13500




Ban

dw

idth

[M

B/S

]


0

6000

12000

18000




Ban

dw

idth

[M

B/S

]


FIGURE 4.7: Default Setups with 4 OST vs. SAIO Optimizing forHDF5 Write Benchmarks

The evaluations of HDF5 read operations were similar to the evaluations of MPIread. Figure 4.8 illustrates the reading performance by applying different setups:

• 4OSTs_W: files created by striping over 4 Lustre OSTs and 1MB stripe size(striping_factor=4, striping_unit=1048576).

• OPT_W: files created by applying the optimal configurations in Table 4.5.


• Default_R: reading files using default setup (romio_cb_read=automatic).

• OPT_R: reading files applying the found optimal configurations in Table 4.5(romio_cb_read=disable).

Figure 4.8 shows that striping files over more Lustre OSTs and disabling the col-lective buffering could speed up the reading operations. Compared to those apply-ing all system default setups ("4OSTs_W, Default_R" vs. "OPT_W, OPT_R") on theirI/O requests, SAIO achieved about 3.9× (120PE), 3.8× (240PE), 9× (480PE), 6.6×(1200PE), 7× (1536PE) and 7× (2400PE) improvements for the reading performance.

0

3000

6000

9000

12000

4OSTs_WDefault_R

4OSTs_WOPT_R

OPT_WDefault_R

OPT_WOPT_R

Ban

dw

idth

[M

B/S

]


0

3000

6000

9000

12000

15000

4OSTs_WDefault_R

4OSTs_WOPT_R

OPT_WDefault_R

OPT_WOPT_R

Ban

dw

idth

[M

B/S

]


0

4000

8000

12000

16000

20000

4OSTs_WDefault_R

4OSTs_WOPT_R

OPT_WDefault_R

OPT_WOPT_R

Ban

dw

idth

[M

B/S

]


0

6000

12000

18000

24000

4OSTs_WDefault_R

4OSTs_WOPT_R

OPT_WDefault_R

OPT_WOPT_R

Ban

dw

idth

[M

B/S

]


0

6500

13000

19500

26000

4OSTs_WDefault_R

4OSTs_WOPT_R

OPT_WDefault_R

OPT_WOPT_R

Ban

dw

idth

[M

B/S

]


0

6000

12000

18000

24000

4OSTs_WDefault_R

4OSTs_WOPT_R

OPT_WDefault_R

OPT_WOPT_R

Ban

dw

idth

[M

B/S

]


FIGURE 4.8: Default Setups with 4 OST vs. SAIO Optimizing forHDF5 Read Benchmarks

Real-Time Accelerating MPI Applications

In Section 3.4.5, I have designed a real-time optimization for MPI applications (Fig-ure 3.13). Unlike the previous evaluations, which only read/wrote one data transfersize, this optimizing method needed to be triggered by reading/writing differentdata transfer sizes. Instead of creating multiple random data sizes, I chose to test


the SAIO tracing results from a WRF online tutorial process9. The WRF model is anext-generation mesoscale numerical weather prediction system designed for bothatmospheric researching and operational forecasting needs[62]. Its model couplingAPI provides a uniform and package-independent interface between WRF and ex-ternal packages for I/O operations (e.g. parallel NetCDF and parallel HDF5) as wellas data formats. Using the SAIO_MODE_SIZEONLY running mode, I have generateda list of data transfer sizes (Listing B.1).

Running the SAIO training process on 1,200 processes spent about 40 minutes(800 core hours) to generate the SAIO configuration file (1200.conf in Listing B.4).As this real-time optimization process was not compatible with the parallel HDF5or parallel NetCDF I/O library, the same self-implemented MPI program (I/O sim-ulator) from the SAIO training utility was used to simulate the parallel NetCDF I/Orequests of WRF:

• Default: striping_factor=4, striping_unit=1048576and romio_cb_write=automatic

• FRQ-#: applying optimal configurations in Listing B.4and set SAIO_REAL_TIME_OPT_FREQUENCY=#

0

50

100

150

200

Default FRQ-1 FRQ-2 FRQ-3 FRQ-4 FRQ-5 FRQ-6 FRQ-7 FRQ-8 FRQ-9

Write 188 Files - 414 GB178

71

4753 51

6170 65

75 76Seco

nd

s

SAIO Real-Time Optimizing on 1200 Processes

FIGURE 4.9: Evaluation Results of SAIO Real-Time Optimization

Figure 4.9 presents the optimizing results using different optimizing frequencynumber (Section 3.4.5). Without any optimization (Default bar), the simulator took178 seconds to finish writing all data in Listing B.1, while the normal SAIO optimiza-tion (FRQ-1 bar) achieved to shorten the time consumption to 71 seconds. The bestresult, 47 seconds, was achieved using the SAIO real-time optimization process bysetting SAIO_REAL_TIME_OPT_FREQUENCY=2 (FRQ-2 bar).

9http://www2.mmm.ucar.edu/wrf/OnLineTutorial/index.htm


To find out the reason why FRQ-2 shortened the time consumption from 71 to47 seconds, I checked the list of data transfer sizes (Listing B.1) and the SAIO con-figuration file (Listing B.4). The application related optimal configuration (groupsaio_file_type=0) was equal to the one for group saio_file_type=1, be-cause this file type spent the most time during the SAIO training process. However,there was another optimal configuration for the first 12 larger data transfer sizesto be written. In this SAIO real-time optimization evaluation, the sooner SAIO gotthe right optimal configurations, the shorter the I/O simulator would take to finishwriting all data.

4.2.3 SAIO - Overhead

As a light-weighted I/O tuning solution for engineering applications, it is nontrivialto produce as little overhead as possible. Measuring the overhead of parallel I/Orequests is difficult, because the concurrent file system accesses and inter-processcommunications are inconstant and unpredictable. In order to measure the SAIOoverhead as exactly as possible, three experiments were designed:

• process overhead: using the self-implemented MPI program (Listing 4.1) tomeasure the overhead for instrumented MPI-IO subroutines as well as the en-tire process (from initialization to finalization).

• run-time instrumentation overhead: using IOR to measure the time consump-tion of running "open→ write→ close" process once.

• finalize overhead: using the low level time measurement (Listing 4.2) to mea-sure the overhead of SAIO shutting down process inclusive writing log files.

Running the three experiments on one MPI process as a single thread applica-tion aimed to get the exact overhead by eliminating the inter-process communica-tions. Scaling out the overhead measurement processes to multiple thousands ofMPI processes was used to measure its overhead in a production environment. Ad-ditionally, Darshan was measured as a reference to compare the overhead of twodifferent MPI-IO tracing policies, "one process" and "all processes" tracing.

Process Overhead

In the testing MPI program (Listing 4.1), the overhead for open, write, read andthe entire MPI process was measured. Although it called MPI_FILE_OPEN() twice,only the first one needed to access the configuration pool and got measured. More-over, the results in seconds (Table 4.8 and Table 4.9) were the arithmetic means of50 running samples applying the Lustre striping setup, striping_factor=1 andstriping_unit=1048576.


1 //Get process s t a r t time stamp ;2 MPI_INIT ( ) ;3 MPI_INFO_CREATE ( ) ;4 //Get MPI f i l e open s t a r t time stamp ;5 MPI_FILE_OPEN ( ) ;6 //Get MPI f i l e open end time stamp ;7 //Ca l c u l a te the time consumption f o r opening f i l e ;8 //Get MPI f i l e wri te s t a r t time stamp ;9 MPI_FILE_WRITE_ALL ( ) ; //Each process wr i tes 200 MB data

10 //Get MPI f i l e wri te end time stamp ;11 //Ca l c u l a te the time consumption f o r wri t ing f i l e ;12 MPI_FILE_CLOSE ( ) ;13 MPI_INFO_CREATE ( ) ;14 MPI_FILE_OPEN ( ) ;15 //Get MPI f i l e read s t a r t time stamp ;16 MPI_FILE_READ_ALL ( ) ; //Each process reads 200 MB data17 //Get MPI f i l e read end time stamp ;18 //Ca l c u l a te the time consumption f o r reading f i l e ;19 MPI_FILE_CLOSE ( ) ;20 MPI_FINALIZE ( ) ;21 //Get process end time stamp ;22 //Ca l c u l a te the time consumption of the whole process ;

LISTING 4.1: Pseudo Codes of Overhead Evaluation MPI Program

As the results shown in Table 4.8, the process overhead of the SAIO tracing com-ponent was similar to Darshan’s. Because of reading the SAIO configuration file byfirst calling the MPI_FILE_OPEN subroutine, SAIO running in SAIO_MODE_OPTTR

mode spent about 0.022 seconds more than Darshan did. In a parallel computingenvironment, when the applications scale out, the costs of interconnect communica-tions must be taken into account, therefore I evaluated the process overhead of thetest program running on multiple processes.

Pure MPI w. Darshan w. TRONLY w. OPTTRMPI Open 0.0193337 0.0203754 0.0196572 0.0423951Open Overhead - 0.0010417 0.0003235 0.0230614MPI Write 0.1986837 0.2034453 0.2014300 0.2028708Write Overhead - 0.0047616 0.0027463 0.0041871MPI Read 0.1009597 0.1019149 0.1020364 0.1023974Read Overhead - 0.0009552 0.0010767 0.0014377MPI Process 0.5769140 0.6059839 0.6045661 0.6303406Process Overhead - 0.0290699 0.0276521 0.0534266Other Overhead - 0.0223114 0.0235056 0.0247404

TABLE 4.8: Overhead Results on 1 MPI Process: Other Overhead in-clude the MPI_INIT and MPI_FINALIZE instrumentation overhead

(initializing software, writing log files, finalizing software etc.)

The evaluation results on multiple processes (Table 4.9) caught my attention. For


concurrently running on 24 MPI processes, the results looked normal, but the ex-tra overhead of reading the SAIO configuration file for SAIO_MODE_OPTTR runningmode disappeared. It seemed that both Darshan and SAIO accelerated the test pro-cess, when it scaled out. However, both of them need to write the tracing resultsinto a file system, which costs extra time. After reviewing the test process and MPIstandard [3], I suspected the following reasons:

• network interference: Some simultaneously running applications occupiednetwork bandwidth while the Pure MPI processes were running.

• unpredictable file system load: The file system were less busy while the timeconsumption for Darshan and SAIO was measured.

• unsuitable statistic method: The arithmetic mean is not suitable, as the distri-bution of the samples can affect the results largely.

Pure MPI w. Darshan w. TRONLY w. OPTTR24 MPI Processes 5.53313277 5.58966704 5.57347765 5.5896309124PE Overhead - 0.05653427 0.04034488 0.05649814120 MPI Processes 10.58832613 10.41491607 9.75389447 10.1279822120PE Overhead - -0.17341006 -0.83443166 -0.46034393240 MPI Processes 11.21902398 11.43519842 10.77165548 11.14874655240PE Overhead - 0.21617444 -0.4473685 -0.07027743

TABLE 4.9: Process Overhead on Multiple MPI Processes

Run-Time Instrumentation Overhead

In this section, the SAIO run-time instrumentation overhead of all five SAIO runningmodes (Section 3.3.1) and Darshan was evaluated with the following setups:

• One process: IOR benchmark was used to create a 1 GB file with Lustre strip-ing setup, striping_factor=1 and striping_unit=1048576. The threeSAIO optimization modes (OPTON, OPTTR and OPTHDF5) read the SAIOconfiguration file, which stored the Lustre striping setup above mentioned asMPI info objects.

• Multiple processes: IOR benchmark ran on multiple processes and collec-tively wrote a single-shared-file (32 MB data transfer size). The striping se-tups were chosen from the training results’ Table 4.5 in Section 4.2.1. Just asusing SAIO in a production environment, the three SAIO optimization modesread the SAIO configuration files automatically. When the evaluations ran inother modes (MPI, Darshan, SIZE and TRON), the corresponding MPI infoobjects were passed to the MPI-IO library via the IOR environment variablesmanually.


Six instances of IOR benchmark were started to investigate the SAIO run-timeinstrumentation overhead with MPI applications: one without instrumentation, onewith Darshan instrumentation and the other four with SAIO instrumentation in dif-ferent SAIO running modes. Each IOR benchmark instance created 100 new filesand reported 100 samples of total I/O time, which was the duration for one "open–> write –> close" process in seconds. The results are presented as box plotsdiagrams in Figure 4.10. While the benchmarks scaled out from 1 process to 1200processes, the SAIO tracing component brought similar overhead as Darshan did.As for the SAIO optimizing component, reading the SAIO configuration files didnot spend more time when the benchmarks scaled out.

1.55

1.6

1.65

1.7

1.75

1.8

1.85

1.9

MPI Darshan SIZE OPTON TRON OPTTR

Seco

nd

s

MPI Run-Time Instrumentation Overhead on 1 Process

0.7

0.75

0.8

0.85

0.9

0.95


Seco

nd

s

MPI Run-Time Instrumentation Overhead on 24 Processes

0.95 1

1.05 1.1

1.15 1.2

1.25 1.3

1.35 1.4

1.45 1.5


Seco

nd

s


2

2.5

3

3.5

4

4.5

5

5.5


Seco

nd

s


FIGURE 4.10: Overhead Test of Different SAIO Modes (SIZE: sizeonly; OPTON: optimizing only; TRON: tracing only; OPTTR: opti-mizing and trancing) as well as Darshan for MPI-IO - Setting the Same

MPI info Objects for All Test Cases

As for the applications using a parallel HDF5 I/O library, they not only read-/write data with MPI collective I/O operations, but also use a subgroup of MPIprocesses to read/write the HDF5 format metadata. Because of this special I/O be-havior, the SAIO run-time instrumentation overhead for parallel HDF5 applicationswas evaluated. Four IOR benchmark instances were started: one without instru-mentation, one with Darshan instrumentation and the other two with SAIO instru-mentation in two SAIO running modes. The evaluation process was analogous tothe last one for MPI applications. Figure 4.11 presents the evaluation results, whichare similar to the evaluation results of MPI applications. As it presents, the SAIOrun-time instrumentation overhead for parallel HDF5 applications was comparableto Darshan’s. When the benchmarks scaled out, the SAIO run-time instrumentationoverhead was not enlarged.


1.75

1.8

1.85

1.9

1.95

2

HDF5 Darshan SIZE OPTHDF5

Seco

nd

s

HDF5 Run-Time Instrumentation Overhead on 1 Process

0.75

0.8

0.85

0.9

0.95

1

1.05


Seco

nd

s

HDF5 Run-Time Instrumentation Overhead on 24 Processes

5

5.5

6

6.5

7

7.5

8

8.5


Seco

nd

s


3

3.5

4

4.5

5

5.5

6

HDF5 Darshan SIZE OPTHDF5Se

con

ds


FIGURE 4.11: Overhead Test of Different SAIO Modes (SIZE: sizeonly; OPTHDF5: optimizing coll/optimizing hdf5) as well as Dar-shan for parallel HDF5 - Setting the Same MPI info Objects for All

Test Cases

The overhead of SAIO optimizing component in last two evaluations was evalu-ated in the best cases, hitting the exact SAIO configuration files. But sometimes theexact SAIO configuration file is not available. Under the circumstances, the SAIOoptimizing component has to read the SAIO configuration index file (Section 3.4.3).

To investigate the impact of accessing the SAIO configuration index file, the sameevaluation process on 360 MPI processes was launched. Compared to the last eval-uations, this time the SAIO optimizing component had to access two files, one SAIOconfiguration index file and one SAIO configuration file, to get the configurations.From the evaluation results shown in Figure 4.12, accessing the SAIO configurationindex file resulted in negligible overhead.

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3


Seco

nd

s


7.5

8

8.5

9

9.5

10

10.5

11

11.5

12


Seco

nd

s


FIGURE 4.12: Overhead Test of Different SAIO Modes as well as Dar-shan when Accessing the SAIO Configuration Index File

The previous evaluation results present that the SAIO run-time instrumentationworks as well as Darshan’s. The extra synchronization of MPI processes through the


MPI_ALLREDUCE subroutine did not slow down the benchmarks, even when thesynchronizations happened among over thousand MPI processes. The overhead ofSAIO’s "one process" tracing policy in run-time instrumentation was proved to be asgood as Darshan’s "all process" tracing policy for MPI collective I/O operations andparallel HDF5 applications.

Finalize Overhead

SAIO is developed in the way, that it does not write any log file, until the applica-tion calls the MPI_FINALIZE subroutine. This design avoids the extra file systemaccesses at run-time, but brings more overhead while finalizing the SAIO tracingcomponent. The finalizing process of SAIO needs to:

• use the rank 0 MPI process to write the tracing results into the SAIO log filepool (all tracing modes),

• release the allocated memory for tracing results on the rank 0 MPI process (alltracing modes),

• and release the allocated memory for optimal configurations on all MPI pro-cesses (all optimizing modes).

Unfortunately, these three finalizing operations cannot be measured by IOR. There-fore I implemented a low level measurement shown in Listing 4.2, and ran the bench-marks on different numbers of MPI processes (from 1 process to 7,200 processes) toinvestigate the SAIO finalizing overhead.

1 double time_stamp_1 , time_stamp_2 ;2 time_stamp_1 = PMPI_Wtime ( ) ;3 // F i n a l i z i n g process s t a r t s4 i f ( saio_mode_int == SAIO_MODE_TRONLY || saio_mode_int ==

SAIO_MODE_OPTTR || saio_mode_int == SAIO_MODE_SIZEONLY) {5 r e t = s a i o _ t r a c e _ f i n a l i z e ( rank ) ;6 }7 i f ( saio_mode_int == SAIO_MODE_OPTONLY || saio_mode_int ==

SAIO_MODE_OPTTR || saio_mode_int == SAIO_MODE_OPTCOLL) {8 r e t = s a i o _ o p t _ f i n a l i z e ( ) ;9 }

10 // F i n a l i z i n g process ends11 time_stamp_2 = PMPI_Wtime ( ) ;12 double f i n a l i z e _ t i m e = time_stamp_2 − time_stamp_1 ;13 double l o n g e s t _ f i n a l i z e _ t i m e ;14 // Get the longes t f i n a l i z i n g duration from a l l processes15 PMPI_Allreduce(& f i n a l i z e _ t i m e , &l o n g e s t _ f i n a l i z e _ t i m e , 1 ,

MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) ;

LISTING 4.2: Code Segment of SAIO Finalize Overhead Evaluation


The SAIO_MODE_TRONLY running mode was used to evaluate the overhead ofwriting SAIO log files and releasing the allocated memory for tracing results on therank 0 MPI process, while the SAIO_MODE_OPTONLY running mode was applied tomeasure the overhead of releasing allocated memory for optimal configurations onall MPI processes. According to the SAIO design concept, the overhead of writingSAIO log files depends on the number of recorded I/O operations, and should notbecome higher when the application scales out, because only the rank 0 MPI processis in charge of writing tracing results into the SAIO log file pool. On the other hand,the overhead of releasing memory for SAIO optimal configurations might increasewhen the application scales out, because the amounts of allocated memory for SAIOoptimal configurations increase as well.

In the first evaluation, the IOR benchmark read and wrote one file using differentnumber of MPI processes. The results in Figure 4.13 were from 40 times submittingthe same IOR benchmark job. The finalizing overhead for the SAIO_MODE_OPTONLYrunning mode (the left side of Figure 4.13), which was caused by releasing allo-cated memory on all MPI processes, increased from 1 process (1PE) to 240 processes(240PE), but stabilized from 240 processes (240PE) to 7,200 processes (7200PE). WhenSAIO ran in the SAIO_MODE_TRONLY mode (the right side of Figure 4.13), the rank0 MPI process wrote the SAIO log file and released its local memory. The overheadfor writing two 0.8 KB SAIO log files (read and write once) was between 0.01 and0.014 seconds for all job sizes, and barely changed.

0

5x10-6

1x10-5

1.5x10-5

2x10-5

2.5x10-5

3x10-5

1PE 24PE 240PE 1200PE 2400PE 7200PE

Seco

nd

s

SAIO Finalize Overhead for Optimizing Only(read + write once)

0.008 0.009

0.01 0.011 0.012 0.013 0.014 0.015 0.016 0.017 0.018

1PE 24PE 240PE 1200PE 2400PE 7200PE

Seco

nd

s

SAIO Finalize Overhead for Tracing Only(read + write once)

FIGURE 4.13: SAIO Finalize Overhead of Tracing Only and Optimiz-ing Only Modes (Each Reads and Writes Once)

In the second evaluation, I researched the SAIO finalizing overhead after exe-cuting 500 reading and writing operations on 1,200, 2,400 and 7,200 MPI processes.The finalizing overhead for the SAIO_MODE_OPTONLY (the left side of Figure 4.14)increased from 1,200 (1200PE) to 7,200 (7200PE) processes. However, the overheadranged from 0.000020 to 0.000024 seconds, which was the same as shown in Fig-ure 4.13. It indicated, that the SAIO finalizing overhead for releasing the allocatedmemory on all MPI processes was negligible, even when the job size and the num-ber of I/O operations increased. When SAIO ran in the SAIO_MODE_TRONLY mode(the right side of Figure 4.14), its finalizing overhead from 1200PE to 7200PE barelychanged. But compared to Figure 4.13, the overhead for writing two 402 KB SAIOlog files was about 5 times as much as the overhead for writing two 0.8 KB SAIO log


files.

0.0000200.0000200.0000210.0000210.0000220.0000220.0000230.0000230.0000240.000024

1200 PE 2400 PE 7200 PE

Seco

nd

s

SAIO Finalize Overhead for Optimizing Only(read + write 500 times)

0.042 0.044 0.046 0.048

0.05 0.052 0.054 0.056 0.058

0.06

1200 PE 2400 PE 7200 PE

Seco

nd

s

SAIO Finalize Overhead for Tracing Only(read + write 500 times)

FIGURE 4.14: SAIO Finalize Overhead of Tracing Only and Optimiz-ing Only Modes (Each Reads and Writes 500 Times)

The results of the last evaluation led to the third evaluation, which measured theSAIO finalizing overhead by recording different number of I/O operations. Sincethe overhead of writing SAIO log files did not get higher, when the applicationsscaled out, running IOR benchmarks on 24 MPI processes was chosen. The followingnumbers of I/O operations were evaluated: 1, 100, 500, 1000, 5000, 10000, 20000,40000, 60000 and 65535. Figure 4.15 shows a linear growth when the number of I/Ooperations increased. After recording the default maximal number of I/O operations(65535 each for read and write), the rank 0 MPI process spent less than 5 secondsto write the two SAIO log files (53.13 MB each for read and write) and release itsallocated memory for tracing results.

0

1

2

3

4

5

11005001k 5k 10k 20k 40k 60k 65535

Seco

nd

s

Number of Operations for Each Read \ Write

SAIO Finalize Overhead for Tracing Only (read + write multiple times)

Overhead for Writing Log Files

FIGURE 4.15: SAIO Finalize Overhead of Tracing Only Mode for Mul-tiple Reading and Writing Operations on 24 MPI Processes


Memory Overhead

One task of SAIO tracing component is tracing the parallel I/O operations to providelarge amounts of I/O related information for the SAIO Learning module and the SAIOStatistic utility. It is nontrivial to minimize the run-time memory occupations. Thesizes of generated SAIO log files and configuration files could imply the memoryconsumption of SAIO, therefore, the research and compare of the memory overheadfor SAIO and Darshan were based on their log file sizes.

Darshan has integrated a compress library compressing the tracing results beforewriting them to file systems. To view the content of a Darshan log file, user needs touse the Darshan utility program to extract the readable results. But SAIO does notcompress the log file because the tracing results are appended to the existed log fileat run-time. To compare the log file size of SAIO and Darshan, I used GNU Tar10 tocompress the SAIO log files. The results in Table 4.10 was generated by using IORbenchmarks to read and write 500 times on different numbers of MPI processes. Thelog file size of SAIO increased very slowly when benchmarks scaled out, while thelog file size of Darshan increased faster.

No. PEs SAIO Uncompressed SAIO Compressed Darshan Compressed24 821,820 Bytes 17,772 Bytes 82,358 Bytes1200 822,927 Bytes 21,690 Bytes 93,832 Bytes7200 823,499 Bytes 23,776 Bytes 100,154 Bytes

TABLE 4.10: Size of Log Files Generated by Darshan and SAIO

To prevent the memory overflow while tracing large amounts of I/O operations,SAIO is designed to write log files when the maximal number of recordable I/Ooperations exceeds. The default value of this maximal number is 65,536, which in-dicates tracing 65,536 reading and 65,536 writing operations. On Hazel Hen and themounted Lustre file system, each record needs about 850 bytes place. The tracingresults of the above mentioned default setup occupy maximal 106.26 MB (53.13 MBfor each read and write) memory space on the rank 0 MPI process.

In addition to the memory occupation of the SAIO tracing component, the SAIOoptimizing component also needs to allocate memory space for the optimal configu-rations. The size of a SAIO configuration file varies base on the content of MPI infoobjects, which are implemented by the underlying MPI-IO library. On Hazel Hen,the size of one SAIO configuration file generated from the SAIO training process inSection 4.2.1 was about 38 KB.

10http://www.gnu.org/software/tar/

4.3. Collective Buffering or not? Lessons Learned 81

4.2.4 SAIO - Scalability

During the performance evaluation phase (Section 4.2.2 and 4.2.3), SAIO was eval-uated with IOR benchmarks from 1 to 7200 MPI processes. To test its scalability,SAIO successfully ran with IOR benchmarks on 500 (12000PEs), 1000 (24000PEs),2000 (48000PEs) and 3000 (72000PEs) Hazel Hen compute nodes.

4.2.5 SAIO - Portability

SAIO was tested on the NEC LX Cluster at HLRS (called Laki11) and worked wellwith both Open MPI and Intel MPI. A further research of mapping the found optimalconfigurations from Hazel Hen to Laki is one of my future work. As for the NEC vec-tor supercomputer, SX-ACE12, it does not support dynamic libraries. Therefore, thecurrent prototype is not compatible with SX-ACE yet. However, the NEC MPI im-plementation on SX-ACE is a derivative of MPICH, which is compatible with SAIO.To implement and provide SAIO as a static library is also one of my future work.

4.3 Collective Buffering or not? Lessons Learned

The evaluation results of the SAIO training process presented that disabling the col-lective buffering accelerated those large scaled I/O requests reading/writing largedata collectively. I investigated the two-phase I/O algorithm (Section 2.2.3) again,and realized that it aimed to optimize the I/O requests for small data transfer sizes,which was also confirmed by the SAIO configuration files:

• small data transfer size: Better performances were achieved by "enabling" (en-able) or "letting Cray MPI decide to enable or disable" (automatic) the collectivebuffering.

• large data transfer size: Better performances were achieved by "disabling" (dis-able) the collective buffering when applications scaled out.

According to the default setup of collective I/O algorithms used by Cray MPI, thecollective buffering size is equal to the Lustre stripe size13, therefore, I replaced thecollective buffering size with the Lustre stripe size to describe the evaluations in therest of this section. The following evaluations used IOR to read/write different sizesof data with different Lustre striping setups (collective buffering setups) on differentnumber of MPI processes. With the help of SAIO and IOR, I managed to researchthe impact of the collective buffering on MPI-IO requests.

11http://www.hlrs.de/systems/nec-cluster-laki-laki2/12http://www.hlrs.de/systems/nec-sx-ace/13http://docs.cray.com/cgi-bin/craydoc.cgi?mode=View;id=S-2490-40


Figure 4.16 presents the results of collectively reading (left diagram) and writ-ing (right diagram) a single shared file with 8 KB data transfer size. For the collec-tive reading operations on evaluated job sizes, both explicitly enabling and disablingthe collective buffering worked better than automatic. Moreover, disabling the col-lective buffering made a little more progress when the benchmarks scaled out, be-cause there is no inter-process communication among MPI processes (Section 2.2.3).However, the evaluation of collective writing operations presented opposite results.The collective buffering improved collective writing performance enormously onall scaling grades, although the inter-process communicating costs could be largerwhen the benchmarks scaled out.

0

50

100

150

200

250

120240 480 1200 1536 2400

1MB Stripe Size4 Lustre OSTs

Ban

dw

idth

[M

B/S

]

Number of Process Elements

Collectively Read a Single Shared File (8 KB Data Transfer Size)

automaticdisableenable

0

50

100

150

200

250

300

120240 480 1200 1536 2400


Ban

dw

idth

[M

B/S

]


Collectively Write a Single Shared File (8 KB Data Transfer Size)


FIGURE 4.16: Reading and Writing 8KB per Process Using DifferentSetups of Collective Buffering

The optimal configurations for 32 MB data transfer size shown in Table 4.5 im-plied that disabling the collective buffering led to better performance, when the ap-plications scaled out larger than 1,200 MPI processes. But why? What would it looklike, when the collective buffering size decreased/increased? With this question, Iinvestigated the impact of collective buffering for reading/writing 32 MB data perprocess with different striping setups on different number of processes. Files werecreated by striping over 16 OSTs with 4MB, 8MB, 16MB, 32MB and 64MB stripingsizes.

The results in Figure 4.17 present that disabling the collective buffering acceler-ated the reading operations with all tested striping setups, because the I/O nodescould directly read data from file systems, instead of transferring the data into theirtemp buffers. As for the writing operations, collective buffering helped to improvethe writing performance for small jobs, but it limited the writing performance forlarger jobs. When the collective buffering size (Lustre stripe size) was increased, thecollective writing performance (automatic and enable) of larger jobs improved as well.However, the bandwidth bottleneck of using the collective buffering still limited itswriting performance. Therefore, larger jobs with larger data transfer sizes accessesshould disable the collective buffering.

Increasing the collective buffering size made the collective writing operationsbetter, but which size should be chosen for a specific application? SAIO could an-swer this question through empirical experiments quickly. However, it is still worth


5000

10000

15000

20000

25000

30000

120240 480 1200 1536 2400


Ban

dw

idth

[M

B/S

]


Collectively Read a Single Shared File (32MB Data Transfer Size)


0

5000

10000

15000

20000

25000

120240 480 1200 1536 2400


Ban

dw

idth

[M

B/S

]


Collectively Write a Single Shared File (32MB Data Transfer Size)


5000

10000

15000

20000

25000

120240 480 1200 1536 2400


Ban

dw

idth

[M

B/S

]




0

5000

10000

15000

20000

25000

120240 480 1200 1536 2400


Ban

dw

idth

[M

B/S

]




5000

10000

15000

20000

25000

120240 480 1200 1536 2400


Ban

dw

idth

[M

B/S

]




0

5000

10000

15000

20000

25000

120240 480 1200 1536 2400


Ban

dw

idth

[M

B/S

]




5000

10000

15000

20000

25000

30000

35000

40000

45000

120240 480 1200 1536 2400


Ban

dw

idth

[M

B/S

]




0

5000

10000

15000

20000

25000

120240 480 1200 1536 2400


Ban

dw

idth

[M

B/S

]




5000

10000

15000

20000

25000

30000

35000

40000

45000

120240 480 1200 1536 2400


Ban

dw

idth

[M

B/S

]




0

5000

10000

15000

20000

25000

120240 480 1200 1536 2400


Ban

dw

idth

[M

B/S

]




FIGURE 4.17: Reading and Writing Data with Different Setups of Col-lective Buffering on Different Number of Processes

to analyze the impacts of different collective buffering sizes (Lustre stripe sizes) forI/O operations running on different numbers of MPI processes.


Firstly, let’s take a look at the reading operations of small jobs (left side of Fig-ure 4.18). Whether or not enabling the collective buffering, the best performance ofcollective reading operation was reading files stored with 32 MB stripe size, sinceeach process accessed only one OST to read 32 MB data. If data were stored withstripe size < 32 MB, each process would have to access at least two OSTs. As for thedata created with stripe size > 32 MB setups, each Lustre OST would be accessedby more than one process. As for the writing operations (right side of Figure 4.18),small jobs performed better when the applications used collective buffering. Forthe benchmarks running on 120 processes, the buffer size had hardly impact on theperformance. For the benchmarks running on 240 processes, the best writing perfor-mance was reached when the collective buffering size was set to 8 MB. The bench-marks running on 480 processes reached the best writing performance by setting thecollective buffering size to 32 MB. When the collective buffering was disabled, thewriting performance was hardly influenced by different Lustre stripe sizes.

4000

6000

8000

10000

12000

4MB8MB 16MB 32MB 64MB

120 Processes16 Lustre OSTs

Ban

dw

idth

[M

B/S

]

Buffer Size (also Lustre stripe size) for Collective Buffering



2000

3000

4000

5000



Ban

dw

idth

[M

B/S

]




4000

8000

12000

16000



Ban

dw

idth

[M

B/S

]




4000

5000

6000

7000

8000

9000

10000



Ban

dw

idth

[M

B/S

]




5000

10000

15000

20000

25000



Ban

dw

idth

[M

B/S

]




7000

8000

9000

10000

11000



Ban

dw

idth

[M

B/S

]




FIGURE 4.18: Reading and Writing 32MB per Process Using DifferentSetups of Collective Buffering (Small Jobs)

The reading operations of big jobs (left side of Figure 4.19), just like small jobs’reading operations, reached their best performance when the files were stored with32 MB stripe size. However, the evaluation results of writing operations (right sideof Figure 4.19) were different. First of all, disabling the collective buffering made


the writing operations run much faster, although the larger collective buffering sizeshelped accelerate the writing operations as well (enable). Two possible reasons whyenabling the collective buffering slowed down the writing operations of large datatransfer size on large jobs could be (Section 2.2.3):

• The data transfer size was large enough to fill the collective temp buffer of oneI/O node.

• Sending data to the collective temp buffer of other I/O nodes would takelonger than directly writing to the underlying file system.

7000

14000

21000

28000

35000



Ban

dw

idth

[M

B/S

]




8000

12000

16000

20000



Ban

dw

idth

[M

B/S

]




10000

20000

30000

40000



Ban

dw

idth

[M

B/S

]




8000

12000

16000

20000

24000



Ban

dw

idth

[M

B/S

]




10000

20000

30000

40000



Ban

dw

idth

[M

B/S

]




8000

16000

24000

32000



Ban

dw

idth

[M

B/S

]




FIGURE 4.19: Reading and Writing 32MB per Process Using DifferentSetups of Collective Buffering (Big Jobs)

Putting the evaluation results of reading and writing operations into one picture,it could be confusing, which Lustre stripe size should be selected for this 32 MB datatransfer size’s example, the 16 MB for writing operations or the 32 MB for readingoperations? It depends on the application. If an application writes checkpoints veryoften, or its generated files are not the input data of other applications, it is better tooptimize the writing operations (to choose 16 MB) instead of the reading operations.


If the application’s results are going to be post processed, for example to be visual-ized or to be read as input data for other applications multiple times, it is better tooptimize the reading operations (to choose 32 MB).

4.4 Conclusion of Evaluations

SAIO’s compatibility, usability, scalability and portability were proved and evalu-ated by IOR benchmark. The empirical optimizing concept made SAIO to be a goodsolution for engineering applications. SAIO did not know much about the appliedparallel I/O algorithms, but it tried different possibilities in the training process andfound out the optimal ones.

While researching the collective buffering mechanism, I noticed, that the bestconfigurations for some jobs (especially the small jobs running on 120, 240 and 480processes) were not the same as learned results in Table 4.5. There could be threereasons:

• The configuration searching scope of Lustre stripe_size was set from 1 MBto 16 MB, not 64 MB. A larger stripe_size will be considered in the futuretraining processes.

• The optimal configurations in Table 4.5 were generated by writing 40,000,000bytes, not exactly 32 MB, data transfer size. It was on purpose to test if thefound optimal configurations worked for different, but in the same group, datatransfer sizes. In practice, the data transfer sizes for the SAIO training processwould be the same as the running applications, so that the optimizing effectwould be even more accurate.

• The SAIO training process as well as the benchmarks in section 4.2.2 weremade during weekdays, while the benchmarks to study collective bufferingin section 4.3 were made during Easter vacation (2017). Therefore, differentworkload of the file system as well as the entire HPC system could impactthe I/O performance. Plus, concurrently reading/writing Lustre OSTs (Fig-ure 4.2) could have happened during weekdays. Running the SAIO trainingprocess occasionally will keep the optimal configuration pool up to date (learn-ing phase). Additionally, deploying SAIO for more applications could preventmore users from misusing too many file system resources like Lustre OSTs.

Since evaluating SAIO training process and its capability on over 1,000 HanzelHen compute nodes will consume too many computing resources and decrease theHPC system efficiency, SAIO will be tested and deployed for large scale projects inthe near future.

87

Chapter 5

Engineering Use Cases

5.1 Introduction

Each year, Hazel Hen at HLRS alone can provide around 1.5 billion core hours’ com-puting resource (exclusive regular maintenance) for various engineering applica-tions. These applications typically use proprietary or open source CAE softwaresuch as Ansys Fluent, OpenFOAM1 and SIMULIA2 to conquer engineering prob-lems. Part of the engineering applications need standalone data processing pro-grams to handle their massive volumes of data. In this section, I will try to acceleratetwo real engineering applications in a production environment with SAIO. The firstapplication, project GCS-JEAN, which received 250 million core hours on Hazel Henin 17th Large-Scale Call of GCS3, studies turbulence and focused on creating quieter,safer, more fuel efficient jet engines[63]. The second application, project DropImp,uses Ansys Fluent to do numerical study of paint drop impacting onto dry solidsurfaces[64].

5.2 Engineering Use Case - CFD: HDF5, Fortran

The prediction and reduction of noise generated by turbulent flows has become oneof the major tasks of today’s aircraft development and is also one of the key goals inEuropean aircraft policy[63]. This researching project, from the Institute of Aerody-namics RWTH Aachen University4, uses CFD to simulate the flow and the acousticfield of an axial fan as well as a helicopter engine jet. Its purpose is to researchand develop quieter and more efficient axial fans with the help of HPC simulationtechnology. In the first phase, Large-Eddy Simulations (LES)5 are performed to de-termine the acoustic sources. In the second phase, the acoustic field on the near and

1http://www.openfoam.com/2http://www.3ds.com/products-services/simulia/3http://www.gauss-centre.eu/SharedDocs/Downloads/GAUSS-CENTRE/EN/Newsroom

/2017/PR_2017_05_17th_GCS_LS_Call.pdf4http://www.aia.rwth-aachen.de/5http://en.wikipedia.org/wiki/Large_eddy_simulation

88 Chapter 5. Engineering Use Cases

far-field is determined by solving the Acoustic Perturbation Equations (APE)[65] ona mesh[63].

The next sections will show how to analyze the I/O requests (Section 5.2.1), de-scribe how to use SAIO (Section 5.2.2), and present the optimization results accom-plished (Section 5.2.3).

5.2.1 Analyzing Application

There is a self-programmed data processing application (in Fortran) using the par-allel HDF5 I/O library to process the LES results. The application uses a hybridparallel programming model that combines MPI and OpenMP, while the I/O re-quests are using parallel HDF5 API. The data processing can be divided into twosteps:

• step 1: reading three large source files, which are 3.36 GB, 5.63 GB and 23.63GB, and creating three large target files, which are 4.50 GB, 3.38 GB and 11.25GB.

• step 2: 500 iterations, each of which reads a 5.63 GB file and generates a 6.75GB file (reading 500 source files and creating 500 target files).

Running the application once involves reading about 2.78 TB and writing about 3.31TB data in all. Users have been complaining about the poor I/O performance, espe-cially for this data processing application.

After consulting the project users, I have found out, that they never set any MPIinfo object for this application, but use the default setup, striping_factor=1,striping_unit=1048576 and letting Cray MPI dicide to disable/enable collec-tive buffering for reading and writing operations. Since they can only provide thesize of files to be read and written for this application, it is impossible to determinethe data transfer sizes without profiling the I/O requests. I chose the optimal config-uration in most cases from the evaluation’s training process, striping_factor=16,striping_unit=4194304, disabling collective buffering for both reading andwriting operations, and then instructed the users to set these MPI info objects.According to their feedback, the optimized configuration succeeded in shorteningthe execution time of a test process by about 26.8%. At last, they are interested indeploying SAIO for this data processing application.

5.2.2 Applying SAIO Training Utility

First of all, SAIO needs the data transfer size of each MPI process to find out theoptimal configurations for this concrete data processing application. However, the

5.2. Engineering Use Case - CFD: HDF5, Fortran 89

project member failed to find out this information at first due to the lack of pro-gramming knowledge. Afterwards, he managed to get this information through anexample of Portable Batch System (PBS) script6 for running application with SAIOin SAIO_MODE_SIZEONLY mode (Section 3.3.1). The two generated SAIO log filesare summarized in Table 5.1.

Operation No. OPs Max. Tran. Size Min. Tran. Size Total Tran. SizeRead 77,582 1,037,504 B 8 B 2.89 TBWrite 6,023 1,037,504 B 96 B 3.42 TB

TABLE 5.1: Tracing Results of APE4sources Production Process

The SAIO tracing component generated two log files, one of 60.3 MB for 77,582reading operations and the other of 4.71 MB for 6,023 writing operations. With thehelp of the SAIO statistic utility (Section 3.4.8), the I/O requests of this applicationturned out uncomplicated. It used MPI collective I/O operations only for the largestdata transfer size ( 1,037,504 B ≈ 0.99 MB) and MPI independent I/O operations forthe other smaller data transfer sizes. These results are complied with the analysis forparallel HDF5 I/O behavior in Section 4.2.2, MPI collective I/O subroutines for read-ing/writing the real data and MPI independent I/O subroutines for its metadata.The traced total data transfer size of both operations (≈ 2.89TB+3.42TB = 6.31TB)is about 0.22 TB larger than the total file size (≈ 2.78TB + 3.31TB = 6.09TB).

According to the tracing log files, SAIO training utility has extracted 8 differentdata transfer sizes for read and 2 data transfer sizes for write (Table 5.2). MPI infoobjects are chosen based on the experience of the evaluations in Section 4.2.1. Themaximal number of striping_factor was set to 20, considering the I/O perfor-mance and the Lustre OSTs’ resource competition (Figure 4.1 and 4.2)7. 32 MB waschosen for maximal stripe size because of the evaluations in Section 4.3.

Name Value Quantitynumber of processes 1200 1Read transfer size (B) 8; 16; 80; 328; 352; 512; 544; 1037504 8Write transfer size (B) 96; 1037504 2romio_cb_read automatic; disable; enable 3romio_cb_write automatic; disable; enable 3striping_factor 8; 12; 16; 20 4striping_unit 1048576 - 33554432 6

TABLE 5.2: Configurations’ Searching Scope for Training ProcessAPE4sources

Since no training process was developed with parallel HDF5 I/O library, I usedthe self-implemented MPI application to search for the optimal configuration set.Executing the SAIO training process took about 250 seconds (about 83 core hours)

6http://www.pbspro.org/7Figure 4.1 shows that more OSTs will not bring more improvements for applications running on

1,200 processes.


and generated 152 files for read (8 files) and write (2×3×4×6 = 144 files) operations.Table 5.3 lists the optimal configurations found by SAIO training utility.

saio_file_type cb_read cb_write striping_factor striping_unit0 disable enable 20 41943041 disable enable 8 10485762 disable - - -4 disable - - -35 disable enable 20 4194304

TABLE 5.3: Found Optimal Configurations after Training ProcessAPE4sources

5.2.3 Optimization and Results

Indicated from the tracing log files, the number of reading operations (77,582) wasalmost 13 times as many as the number of writing operations (6,023). I had reasonsto believe that the bottleneck of the process is reading the source files. Therefore, Itested the following four setups and received the optimization results in Figure 5.1:

• Original (1 OST / 1 MB Read & 1 OST / 1 MB Write) - Reading source fileswith (striping_factor=1, striping_unit=1048576) and writing fileswith (striping_factor=1, striping_unit=1048576): This setup is de-fined as the system default since the project started. It took about 5.95 hoursfor 50 Hazel Hen compute nodes to finish.

• SAIO-1 (1 OST / 1 MB Read & 16 OSTs / 4 MB Write) - Reading source fileswith (striping_factor=1, striping_unit=1048576) and writing fileswith (striping_factor=16, striping_unit=4194304): This setup forwriting operations is the one I suggested the user for his test process, but thesetup for reading operations is romio_cb_read=automatic. It took about5.97 hours for 50 Hazel Hen compute nodes to finish.

• SAIO-2 (16 OSTs / 4 MB Read & 16 OSTs / 4 MB Write) - Reading and writ-ing files with (striping_factor=16, striping_unit=4194304): To ver-ify my speculation about the bottleneck, I reallocated all source files, stripingthem over 16 OSTs with 4 MB stripe size. SAIO ran in SAIO_MODE_OPTCOLL

mode, tested the combinations with romio_cb_read=disable listed in Ta-ble 5.3, and set the most used optimal configuration (striping_factor=16,striping_unit=4194304). It took about 2.23 hours for 50 Hazel Hen com-pute nodes to finish.

• SAIO-3 (16 OSTs / 4 MB Read & 20 OSTs / 8 MB Write) - Reading sourcefiles with (striping_factor=16, striping_unit=4194304) and writingfiles with (striping_factor=20, striping_unit=4194304): Similar to

5.2. Engineering Use Case - CFD: HDF5, Fortran 91

SAIO-2, SAIO-3 ran in SAIO_MODE_OPTCOLL mode to apply the optimal con-figurations listed in Table 5.3, resulted from its training outcomes. It took about2.10 hours for 50 Hazel Hen compute nodes to finish.

The optimization effects are represented with resource consumption, namelycore hours, for the APE4sources process in Figure 5.1. It took the Original setup7,155 core hours to complete processing all files. Only writing results files stripingover 16 OSTs (SAIO-1) could not bring any improvement, since the bottleneck wasthe reading process. The other two setups, SAIO-2 and SAIO-3, on the contrary, havesuccessfully accelerated the process enormously. SAIO-2, writing target files stripingover 16 OSTs and disabling reading collective buffer, took up 2,677 core hours, only37.4% of the Original time consumption to process the same files. Moreover, SAIO-3was able to approach the speed limit further, by writing target files striping over 20OSTs and disabling reading collective buffer, which was found in the previous train-ing results. For this very process, SAIO has managed to accelerate APE4sourcesabout 184%.

0

1000

2000

3000

4000

5000

6000

7000

8000

Original SAIO-1 SAIO-2 SAIO-3

77,582 Reads - 2.78 TB 6,023 Writes - 3.31 TB

7155 7169

2677 2521

Core

Hours

GCS-JEAN: APE4sources Production Process on 1200 Processes

FIGURE 5.1: Optimizing Results of Running APE4sources ProcessOnce with Different Configurations on 1200 Processes

Since it is not recommended to use extra resources to redistribute the sourcefiles over more OSTs, I asked the project users for more information about the I/Opath of their simulations. One of the many file generating processes is to read-/write data independently. The following two processes handle even more filesthan APE4sources:

• h5der: Using 484 compute nodes (12 CPU cores per node, 5,808 CPU cores inall) to read 5,802 HDF5 files (unknown size) and write 5,802 HDF5 files (≈ 6

MB/file, 34 GB data in total) independently. This process finished within 65


seconds with file-per-process I/O pattern and does not call any MPI-IO sub-routine.

• repartition: Using 484 compute nodes (24 CPU cores per node, 11,616 CPUcores in all) to read 5,802 HDF5 files (unknown size) and write 5,802 HDF5 files(≈ 192 MB/file, 1,087.88 GB data in total) independently. This process finishedwithin 84 seconds with file-per-process I/O pattern and does not call any MPI-IO subroutine.

Since these two processes (h5der and repartition) do not use MPI-IO library,SAIO cannot trace or optimize them. However, Lustre allows users to set file direc-tory’s striping setup through commands. New files created in that directory willinherit the striping setups. Therefore, files generated by these two mentioned pro-cesses can be striped over multiple Lustre OSTs. I used IOR with file-per-processI/O pattern running on 7200 processes to simulate h5der and repartition, so asto analyze the performance impacts of striping their result files over multiple OSTs.Figure 5.2 shows that the performance of file-per-process I/O pattern is affected bymultiple OSTs, especially for small data transfer sizes (like writing HDF5 metadata).Writing bandwidth for these small data transfer sizes drops down far more steeplythan for larger ones.

0

1500

3000

4500

6000

1 2 4 8 12 16 20

Ban

dw

idth

[M

B/S

]

Number of Lustre OSTs

File per Process Writing to Different Number of OSTs(7200 Processes)

64 KB128 KB

256 KB512 KB

1 MB

30000

60000

90000

120000

150000

1 2 4 8 12 16 20

Ban

dw

idth

[M

B/S

]

Number of Lustre OSTs

File per Process Writing to Different Number of OSTs(7200 Processes)

6 MB16 MB

32 MB64 MB

192 MB

FIGURE 5.2: Writing Files with File-per-Process Pattern on 7200 Pro-cesses (Stripe Size is 4 MB)

Coming back to the two processes mentioned above, process repartition

(writing 192 MB files across 20 OSTs) should be hardly affected, while the processh5der (writing 6 MB files across 20 OSTs) might last twice as long as writing to 1OST. Although the analysis is still in progress, SAIO has already helped engineersto identify the bottleneck and offered optimization suggestions.

5.2.4 Conclusion

SAIO is proved not only to be an I/O analyzing tool or an experimental software,it is also capable of accelerating engineering applications in a production environ-ment. In this use case, I used SAIO training utility implemented with MPI-IO li-brary in C and successfully found out the optimal configurations for APE4sources,

5.3. Engineering Use Case - CFD: ANSYS Fluent 93

a parallel HDF5 application in Fortran. Setting these optimal configurations hassaved about 4,634 core hours computing resources for engineers every time they runAPE4sources. Creating files using the setups suggested by SAIO would improvethe reading performance of following processes.

5.3 Engineering Use Case - CFD: ANSYS Fluent

ANSYS, one of the most popular CAE software developer, has been providing var-ious solutions to scientists and engineers. In November 2016, ANSYS, HLRS andCray have successfully scaled ANSYS Fluent, one of its powerful CFD softwaretools, to 172,032 computer cores and run at 82% efficiency on Hazel Hen8. An en-gineering project from Fraunhofer Institute for Manufacturing Engineering and Au-tomation (IPA)9 performs the numerical simulations with ANSYS Fluent based onthe finite-volume approach to investigate air entrapment in paint drops under im-pact onto dry solid surfaces[64]. By using a grid size of 80 million cells, they havedetermined, that running simulation on 1,200 and 2,400 processes will achieve thebest computing performance[64]. However, it is still too much time consumption forreading case files, writing checkpoint files and writing result files. The software’sI/O performance has hardly got any attention. The users claimed that they had toreserve one extra hour computing time for a ten hours’ process to write three check-point files and one result file (about 33 GB/file).

This use case runs on Hazel Hen, where the WS8 Lustre file system is mounted.Section 5.3.1 will investigate the I/O behavior of ANSYS Fluent with a small pro-duction example as well as their simulations, followed by the results from the SAIOtraining process and from the optimization in Section 5.3.2 and 5.3.3. Section 5.3.4will explain the shortcuts of SAIO for ANSYS Fluent and sum up the use case.

5.3.1 Analyzing Application

ANSYS Fluent has integrated three different I/O modules. The first one is calledserial I/O (DAT), which needs a host node/process to handle all I/O operations.For a reading process, the host node/process reads data from a file system and dis-tributes them to other processes through the interconnect network of HPC systems.As for the writing process, all the other processes send data to the host node/pro-cess, which is responsible for writing data into a file system. The second moduleis parallel I/O (PDAT) using MPI-IO library. It controls all processes to read/writedata simultaneously (since version 12). The third one is also parallel I/O but usingparallel HDF5 I/O library (since version 16). It supports reading/writing HDF5 files

8http://www.investors.ansys.com/press-releases/2016/15-Nov-16-1210211329http://www.ipa.fraunhofer.de/de/Kompetenzen/beschichtungssystem–und-

lackiertechnik.html


both independently and collectively. Compared to the serial I/O module, the othertwo parallel I/O modules shall improve the I/O performance and could potentiallyreduce a lot of inter-process communications for sending data.

User can activate the parallel MPI I/O by either reading/writing .pdat filesthrough Fluent GUI, or answering yes to the text command file/write-pdat.As for the parallel HDF5 I/O, users need to answer yes to the text command file

hdf-files and to set the independent or collective I/O mode accordingly. To ver-ify the compatibility of SAIO with Ansys Fluent as well as user’s engineering appli-cations, I have tested one of user’s production processes with different I/O modules(the same striping setups, striping_factor=16 and striping_unit=4194304,are applied by SAIO instrumentation). Since different data formats have differentfile sizes, the bandwidth is no longer the best characteristic to represent the differ-ences. Therefore, I used time consumption in seconds to demonstrate the results ofprocessing the following three data formats, which can be found in Figure 5.3 andTable 5.4:

• DAT (case & data file): reading 7,157 MB - writing 7,157 MB

• PDAT (case & parallel data file): reading 7,157 MB - writing 5,287 MB

• HDF5 (case & HDF5 data file): reading 10,382 MB - writing 10,382 MB

0

20

40

60

80

100

120

140

160

DAT PDAT SAIOPDAT

Ind.HDF5

SAIOInd. HDF5

Coll.HDF5

SAIOColl. HDF5

148

111

97 98

69

4840

SAIO shortened:PDAT 14 Seconds / 12.6%Ind. HDF5 29 Seconds / 29.6%Coll. HDF5 8 Seconds / 16.7%

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

|

Seco

nd

s

DropImp: Test Run of a Production Process on 240 Processes (WS8)

ReadWrite

FIGURE 5.3: Optimization Results of Running Part of Production Pro-cess on 240 Processes

Comparing the three data formats, it was determined that the HDF5 I/O moduleworked better than the other two I/O modules, despite of its larger file size. TheSAIO instrumentation was proved suitable for the parallel I/O modules, PDAT andHDF5. I also have noticed that the computing time for its solve iterations has barely


OPsSAIO Ind. SAIO Coll. SAIO

DAT PDAT PDAT HDF5 Ind. HDF5 HDF5 Coll. HDF5Write 103 60 53 80 50 22 16Read 45 51 44 18 19 26 24W + R 148 111 97 98 69 48 40

TABLE 5.4: Optimizing Results of Running Part of Production Processon 240 Processes (Operations’ Duration in Seconds)

changed, although the inter-process communication should have been eliminated bythe two parallel I/O modules. It seems that Ansys Fluent has used non-blocking I/Ofor its serial I/O module. While the host node/process is reading/writing data, itdistributes/collects the data to/from other processes at the same time. Nevertheless,the project users are recommended to abandon the legacy formats of .dat/.pdatand switch to the parallel HDF5 I/O module.

After SAIO being proved compatible with Ansys Fluent, another larger produc-tion process, lactec_1v64, came into my focus. This process is to execute 20,000time steps of solve. During the first 15,000 time steps, the process will write acheckpoint file every 500 time steps. After 1,000 time steps, users are supposed tocheck if everything went well and submit the next job reading the results of theprevious job as input files. Each job executing 1,000 time steps needs to reserve 50compute nodes (1,200 processes) and 4 hours wall time10. As a result, there will be15 jobs reserving 60 hours wall time, 15× reading and 30×writing operations. As forthe last 5,000 time steps, instead of checkpoint files, users require to store the interimresults every 50 time steps. These interim results will help users to catch more dataand to calculate the final result. Because of the poor I/O performance, users have tosubmit 10 jobs for the last 5,000 time steps, each job reserving 50 compute nodes and4 hours wall time to run 500 time steps. Under the circumstances, 10 jobs reserving40 hours wall time execute 10× reading and 100× writing operations. Similar to thelast testing process, the file size of HDF5 is much larger than the one of DAT. Table5.5 presents a summary of process’s I/O requests. To use the HDF5 format, processlactec_1v64 requires 642 GB more storage space than using the DAT format.

Format 1× Read 1×Write 25× Reads 130×Writes 100× Interim FilesDAT 13.98 GB 13.98 GB 349.50 GB 1,817.40 GB 1,398 GBHDF5 20.40 GB 20.40 GB 510.00 GB 2,652.00 GB 2,040 GB∆Size 6.42 GB 6.42 GB 160.50 GB 834.60 GB 642 GB

TABLE 5.5: Data Size Summary of Reading/Writing Different DataFormats for Process lactec_1v64

As shown in Table 5.5, the number of writing operations (130) are over 5 timesas many as the number of reading operations (25). Besides, these 130 writing op-erations during the entire process cannot be reduced, while the number of readingoperations can be decreased by submitting longer jobs. Instead of 25 jobs with 4

10http://en.wikipedia.org/wiki/Wall-clock_time


hours wall time, user can submit 5 jobs with 20 hours wall time and more time stepsof solve accordingly. This change could result in 130 writing operations and (only)5 reading operations. Since the source files for each reading operation are the re-sult files from the last job, optimizing the writing operations will also lead to betterreading performance (Section 4.2.2).

5.3.2 Applying SAIO Training Utility

Ansys Fluent is a proprietary software and the I/O requests are encapsulated withinits I/O modules. Therefore, I could only try to analyze its I/O behavior throughSAIO tracing results. Running Ansys Fluent with its independent and collectiveHDF5 I/O modules generates the same data transfer sizes (Listing 5.1 for read andListing 5.2 for write).

1 8 , 9 , 3 2 , 3 8 , 4 8 , 5 3 , 7 2 , 8 0 , 8 2 , 8 7 , 1 0 4 , 1 4 6 , 1 5 2 , 1 6 8 , 1 7 6 , 1 8 4 , 2 3 2 , 2 3 5 , 2 4 0 ,2 2 4 8 , 2 5 4 , 2 5 6 , 2 7 2 , 2 8 3 , 3 2 8 , 3 5 2 , 4 4 0 , 4 8 8 , 5 1 2 , 5 2 0 , 5 4 4 , 5 5 2 , 5 7 6 , 7 2 0 , 8 6 4 ,3 54508 ,82122 ,109016 ,132688 ,164244 ,325088 ,327048 ,331392 ,524288 ,

LISTING 5.1: 44 Different Data Transfer Sizes (Byte) of ReadingOperations from Process lactec_1v64

1 8 , 2 6 , 3 2 , 3 5 , 4 8 , 9 6 , 2 3 5 , 2 5 4 , 2 8 3 , 8 6 4 , 5 4 5 0 8 , 8 2 1 2 2 , 1 0 9 0 1 6 , 1 6 4 2 4 4 , 3 2 7 0 4 8 ,2 328488 ,331392 ,656976 ,

LISTING 5.2: 18 Different Data Transfer Sizes (Byte) of WritingOperations from Process lactec_1v64

As analyzed previously, considering the number of I/O operations, the optimiza-tion will be focused on the bottleneck of the use case, the writing operations. Table5.6 lists the configurations’ searching scope for writing operations. Although theinput and output file sizes, case and data file, are relatively large (Table 5.5), pro-cess’s data transfer sizes up to 656,976 B ≈ 0.63 MB (Listing 5.2) are comparativelysmall. Therefore, 16 MB was set as the maximal stripe size. Additionally, maximal20 Lustre OSTs are used, since more OSTs will not bring much more improvementsfor applications running on 1,200 processes (Figure 4.1).

Name Value Quantitynumber of processes 1200 1Write transfer size (B) see Listing 5.2 18romio_cb_write automatic; disable; enable 3striping_factor 8; 10; 12; 16; 20 5striping_unit 1048576 - 16777216 5

TABLE 5.6: A List of Configurations’ Searching Scope for Writing Op-erations of Process lactec_1v64

Each time the training process ran, it created 1,350 files in WS8 Lustre file sys-tem and spent about 16.15 minutes on 50 Hazel Hen computing nodes (≈ 323 core


hours). To eliminate the possible interference during the training process11, I startedthe same training process twice (≈ 646 core hours in all). As for the few reading oper-ations, I disabled their collective buffering and did not spend more computing timeon training them. After the SAIO training process, a list of optimal configurationswas generated (Table 5.7). Just as the default configuration (saio_file_type=0)indicated, the generated files were distributed over 20 OSTs, which was supposed toaccelerate the reading performance of the next process.

saio_file_type romio_cb_write striping_factor striping_unit0 enable 20 83886081 enable 8 20971522 enable 8 83886086 enable 12 419430421 enable 16 209715224 enable 20 209715226 automatic 20 209715228 enable 16 838860831 enable 20 8388608

TABLE 5.7: A List of Found Optimal Writing Configurations for Pro-cess lactec_1v64 (saio_file_type Definition in Listing A.3)

5.3.3 Optimization and Results

To evaluate the training results, I set the process executing 500 time steps solvewithout checkpoint file for the following five test processes12:

• DAT: Serial I/O module without MPI-IO library, as the basis of the evaluationfor optimization

• Ind. HDF5: Parallel HDF5 module with independent I/O operations and sys-tem default setup13

• SAIO Ind. HDF5: Parallel HDF5 module with independent I/O operationsand SAIO optimization

• Coll. HDF5: Parallel HDF5 module with collective I/O operations and systemdefault setup

• SAIO Coll. HDF5: Parallel HDF5 module with collective I/O operations andSAIO optimization

As the evaluation results shown in Figure 5.4 and Table 5.8, the new HDF5 I/Omodule without optimization has shortened I/O requests from 486 seconds to 322seconds by the independent HDF5 module and 190 seconds by the collective HDF5

11There could be some other application accessing file system, which impacts the tracing results.12Due to the compatibility of the software, PDAT optimization has not been taken into consideration.13The WS8 file system’s default setup is 4 OSTs and 1 MB stripe size.


module. Based on that, SAIO has achieved to cut down another 86 seconds for in-dependent HDF5 and 57 seconds for collective HDF5. Since independent HDF5 didnot use any collective I/O algorithms, disabling collective buffering had no impacton independent reading operations. Its 6 seconds difference was caused by differentfile distributions (4 OSTs vs. 20 OSTs). On the other hand, disabling the collectivebuffering for collective HDF5 reading operations with a source files distributed over20 OSTs has taken 22 seconds less, indicating 22.9% of shortening. As for the writ-ing operations, striping files over more OSTs has reduced the process time by 80seconds, 33.8% for independent HDF5 and 35 seconds, 37.2% for collective HDF5.

OperationInd. SAIO Coll. SAIO

DAT HDF5 Ind. HDF5 HDF5 Coll. HDF51×Write 327 237 157 94 591× Read 159 85 79 96 741× (W + R) 486 322 236 190 133

TABLE 5.8: Optimizing Results of Running Part of Production Processlactec_1v64 on 1200 Processes (Duration in Seconds)

As shown in Figure 5.4, the evaluation results of running 500 time steps pre-sented that SAIO could realize a 23.6% (independent) and 30.0% (collective) timesaving for one reading and writing operations.

0

100

200

300

400

500

DAT Ind.HDF5

SAIOInd. HDF5

Coll.HDF5

SAIOColl. HDF5

486

322

236

190

133

SAIO shortened:Ind. HDF5 76 Seconds / 23.6%Coll. HDF5 57 Seconds / 30.0%

|

|

|

|

|

|

|

|

|

|

|

|

|

|Seco

nd

s

DropImp: Another Production Process on 1200 Processes (WS8)

1 Read1 Write

FIGURE 5.4: Optimizing Results of Running Part of Production Pro-cess lactec_1v64 on 1200 Processes (Only Read & Write)

Because of the huge working amount and the extremely heavy resource occu-pation to compute 5 times executing 20,000 time steps of solve, I estimated theresource consumption of the process’s I/O requests based on the application analy-sis results (25× reads and 130× writes) and in reference of the evaluation results inTable 5.8. Table 5.9 presents the estimated duration in seconds on I/O requests of the


entire production process. Figure 5.5 shows the estimated total core hours consump-tion. Apart from the I/O requests of the original DAT process, which would spendalmost 13 hours, SAIO could shorten the entire process’s I/O requests by about 2.93hours for independent HDF5 I/O and 1.42 hours for collective HDF5 I/O.

OperationInd. SAIO Coll. SAIO

DAT HDF5 Ind. HDF5 HDF5 Coll. HDF5130×Write 42,510 30,810 20,410 12,220 7,67025× Read 3,975 2,125 1,975 2,400 1,850ALL (W + R) 46,485 32,935 22,385 14,620 9,520

TABLE 5.9: Estimated Optimizing Results of Running ProductionProcess lactec_1v64 on 1200 Processes (Duration in Seconds)

0

3000

6000

9000

12000

15000

18000

DAT Ind.HDF5

SAIOInd. HDF5

Coll.HDF5

SAIOColl. HDF5

15495

10978

7461

4873

3174

SAIO could save:Ind. HDF5 3517 Core Hours / 32.0%Coll. HDF5 1699 Core Hours / 34.9%

|

|

|

|

|

|

|

|

|

|

|

|

|

|

Core

Hours

DropImp: Estimation of Entire Resource Consumption Reads/Writes (WS8)

25 Reads130 Writes

FIGURE 5.5: Estimated Results of Running Entire Production Processlactec_1v64 on 1200 Processes (Only Read & Write)

5.3.4 Conclusion

In this use case, SAIO was proved to be a functioning I/O analysis and tuning soft-ware for the proprietary engineering software, Ansys Fluent. This use case is not anI/O-heavy application like the last one, since its I/O requests consume only about8.1% (DAT), 5.4% (Ind. HDF5) and 3.2% (Coll. HDF5) of the computing time14. How-ever, consuming the same amount of computing resources, SAIO can let engineersget a more accurate result by creating intermediate data every 35 time steps. Based

14The test process executing 500 time steps spent about 6,000 seconds and consumed about 2,000core hours. Among them, the I/O requests took 486 (DAT), 322 (Ind. HDF5) and 190 (Coll. HDF5)seconds.


on the tracing results to develop a particular SAIO training process for Ansys Flu-ent will find more accurate optimal configurations and attract more engineers to useSAIO for their CFD simulations.

101

Chapter 6

Conclusion and Future Work

6.1 Conclusion

In this dissertation, I have presented and evaluated SAIO, a semi-automatically I/Otuning solution for engineering applications. SAIO is implemented upon the MPI-IOlibrary, which is widely used in today’s modern HPC systems. Following the currentMPI standard makes SAIO be compatible with MPI based scientific and engineeringapplications, and meanwhile be portable to different HPC platforms deploying dif-ferent MPI implementations. Its design concept, which is transparent to applicationusers, is approachable for engineers with little or zero knowledge of parallel I/Oand has already attracted them to deploy SAIO for their simulations. EvaluatingSAIO with widely used I/O benchmark (IOR) and I/O tracing software (Darshan)has proved its compatibility, scalability and portability. Optimizing two engineeringapplications in different professional areas shows the usability of SAIO.

SAIO provides less detailed I/O tracing information than Darshan dose. There-fore, engineers or scientists can easily understand the tracing results without anypost-processing utility. Its intuitive log files present enough I/O tuning informa-tion for the users with little knowledge of parallel I/O to carry on the analysis andoptimizations. Unlike SIOX, SAIO applies simple text file in JSON-like format toeliminate high latency for establishing the database connections, and to offer easilyunderstandable tracing and tuning information. By changing the SAIO configura-tion files manually, engineers can try different I/O setups on their applications with-out changing or recompiling the source codes, so as to accelerate the processes bythemselves.

Built upon the MPI-IO library enlarges the usage scope of SAIO. It supports notonly MPI applications, but also parallel HDF5, parallel NetCDF applications as wellas many proprietary and open source software accessing MPI-IO library. The two-staged optimization, the first stage for file open operation and the second stage forread/write operations, realizes a dynamic and fully automated I/O tuning mech-anism. This optimizing concept simplifies the I/O tuning process and offers a bettersolution than directly setting MPI info objects in scripts or source codes, especially

102 Chapter 6. Conclusion and Future Work

when the application process stays unchanged and the input data being replaced.

While designing and implementing SAIO, I was trying to make SAIO cover moreapplications and more HPC platforms. The training results (found optimal configu-rations) could be only suitable on one HPC platform. But thanks to the SAIO semi-automated training utility, searching new optimal configurations on another HPCplatform is simple and fast. With the help of SAIO statistic utility and log files, sci-entists and engineers can analyze the I/O requests of their applications. The twoengineering use cases have not only presented the successful tuning results on pro-duction environment, but also provided a guideline for the potential users to analyzeand optimize their own applications.

6.2 Future Work

Besides two engineering use cases, SAIO has been tested with WRF model, one ofthe most popular open source software for climate research. The reasons why I choseWRF model applications to optimize are as following: Firstly, the climate researchesare mostly I/O-heavy applications. Secondly, the WRF implementation uses hybridparallel programming model combining MPI and OpenMP. Its model coupling APIprovides a uniform and package-independent interface between WRF and externalpackages for parallel I/O and data formatting. This software architecture makes theI/O optimization easier and allows computer scientists with little knowledge aboutclimate research to improve the I/O performance. Its I/O kernel is easy to extractand could be integrated into SAIO training utility. After successfully tested SAIOwith WRF model using an online tutorial (Appendix B), using SAIO to optimize aclimate research project is now in progress.

Since training with I/O kernel of application can generate the most suitable op-timal configurations, I am considering an easy way to reconstruct application’s I/Okernel from log files without looking deeply into the application. Kim et al. haveproposed an automatic code instrumentation technique to collect detailed statisticsof I/O stack[66]. The I/O tracing tool uses a configuration file to specify the tracingtargets such as I/O stack, specific I/O or file system subroutines and storage systemstatus. With proper configures, it can trace the workflow of I/O calls across differentlayers of an I/O stack, and then present a clear I/O path to research the I/O requestsof end user’s applications. This automated tracing mechanism does not require userto change their application source codes. Instead of developing a brand new solu-tion, I am going to research this innovative tracing tool and try to make it supportSAIO.

Compared to the improvements that SAIO achieved, its training process con-sumes not many computing resources. However, it would be more efficient, if SAIO

6.2. Future Work 103

is able to map the optimal configurations from one HPC system to another HPC sys-tem. When SAIO collects enough tracing information from different HPC systems, Iwill try to find the connections and design a mapping mechanism to transfer optimalconfigurations among multiple HPC systems. In addition, trying to design machinelearning algorithms and use deep learning to predict I/O performance will be a bigchallenge and also the focus on my future research.

105

Appendix A

Code Segments

1 extern i n t errno ;2

3 // data s t r u c t u r e of sa io_mpi_ info_t4 typedef s t r u c t sa io_mpi_ info_t {5 char name[MPI_MAX_INFO_KEY + 1 ] ;6 char value [MPI_MAX_INFO_VAL + 1 ] ;7 } sa io_mpi_ info_t ;8

9 // data s t r u c t u r e of sa io_mpi_ info_conf_t10 typedef s t r u c t sa io_mpi_ info_conf_t {11 i n t sa io_mpi_ info_s ize ;12 char ∗mpi_info_st r ;13 sa io_mpi_ info_t ∗ saio_mpi_infos [ MAX_SAIO_MPI_INFO_SIZE ] ;14 } sa io_mpi_ info_conf_t ;15

16 // data s t r u c t u r e of s a i o _ c o n f _ t17 typedef s t r u c t s a i o _ c o n f _ t {18 double bandwidth [ MAX_SAIO_FILE_TYPES ] ;19 double durat ion [ MAX_SAIO_FILE_TYPES ] ;20 double time_stamp [ MAX_SAIO_FILE_TYPES ] ;21 sa io_mpi_ info_conf_t ∗ sa io_mpi_info_confs [ MAX_SAIO_FILE_TYPES ] ;22 } s a i o _ c o n f _ t ;23

24 // data s t r u c t u r e of s a i o _ r e c o r d _ t25 typedef s t r u c t s a i o _ r e c o r d _ t {26 double op_time_stamp ;27 char ∗op_name ;28 long long i n t bytes ;29 double op_duration ;30 double bandwidth ;31 i n t sa io_mpi_ info_s ize ;32 sa io_mpi_ info_t ∗ saio_mpi_infos [ MAX_SAIO_MPI_INFO_SIZE ] ;33 i n t s a i o _ f i l e _ t y p e ;34 } s a i o _ r e c o r d _ t ;35

36 // data s t r u c t u r e of s a i o _ l o g _ t37 typedef s t r u c t s a i o _ l o g _ t {

106 Appendix A. Code Segments

38 i n t mpi_rank_size ;39 i n t s a i o _ r e c o r d s _ s i z e ;40 s a i o _ r e c o r d _ t ∗ s a i o _ r e c o r d s [MAX_NO_SAIO_RECORDS ] ;41 } s a i o _ l o g _ t ;

LISTING A.1: SAIO Data Structures

1 /∗2 ∗ Error c l a s s e s and codes3 ∗/4 # def ine SAIO_SUCCESS 05 //SAIO t r a c e e r r o r codes6 # def ine SAIO_ERR_TRACE_INIT 1007 # def ine SAIO_ERR_TRACE_FINALIZE 1018 # def ine SAIO_ERR_TRACE_OPEN 1029 # def ine SAIO_ERR_TRACE_READ 103

10 # def ine SAIO_ERR_TRACE_WRITE 10411 //SAIO record e r r o r codes12 # def ine SAIO_ERR_RECORD_INIT 20013 # def ine SAIO_ERR_RECORD_FINALIZE 20114 # def ine SAIO_ERR_RECORD_OP 20215 //SAIO log f i l e e r r o r codes16 # def ine SAIO_ERR_GENERATE_LOG_FILES 30017 # def ine SAIO_ERR_READ_LOG_FILE 30118 # def ine SAIO_ERR_PARSE_LOG 30219 //SAIO c o n f i g u r a t i o n e r r o r codes20 # def ine SAIO_ERR_SET_MPI_INFO 40021 # def ine SAIO_ERR_CREATE_CONF 40122 # def ine SAIO_ERR_READ_CONF 40223 //SAIO optimizing e r r o r codes24 # def ine SAIO_ERR_OPT_INIT 50025 # def ine SAIO_ERR_OPT_FINALIZE 50126 //SAIO t r a i n i n g e r r o r codes27 # def ine SAIO_ERR_TRAINING_FILE_SIZE_CSV 60028 # def ine SAIO_ERR_PARSE_TRAINING_CONF_FILE 60129 # def ine SAIO_ERR_READ_TRAINING_CONF_FILE 60230 # def ine SAIO_ERR_GEN_CONF_FROM_TRAINING_CONF_FILE 60331 # def ine SAIO_ERR_CONF_CARTESIAN_PRODUCT 60432 # def ine SAIO_ERR_CREATE_TRAINING_CONF_FILE 60533 //SAIO parse JASON e r r o r codes34 # def ine SAIO_ERR_GET_JSON_VALUE 70035 # def ine SAIO_ERR_GET_JSON_STRING_VALUE 70136 # def ine SAIO_ERR_GET_JSON_NUMBER_VALUE 70237 # def ine SAIO_ERR_GET_JSON_ARRAY_VALUE 70338 //SAIO t o o l s e r r o r codes39 # def ine SAIO_ERR_TOOL_STATISTIC 80040 # def ine SAIO_ERR_TOOL_STATISTIC_GEN_CSV 80141 //SAIO other e r r o r codes42 # def ine SAIO_ERR_FILE_TYPE 90043 # def ine SAIO_ERR_PARSE_JSON_MPI_INFO 90144 # def ine SAIO_ERR_OTHERS 999

Appendix A. Code Segments 107

LISTING A.2: SAIO Error Codes

1 /∗2 ∗ Defining d i f f e r e n t f i l e types3 ∗/4 # def ine SAIO_FILE_TYPE_DEFAULT 0 // For unknown f i l e s i z e5 # def ine SAIO_FILE_TYPE_1 1 // 0 bytes <= f i l e _ s i z e per MPI process

< 256 bytes6 # def ine SAIO_FILE_TYPE_2 2 // 256 bytes <= f i l e _ s i z e per MPI

process < 384 bytes7 # def ine SAIO_FILE_TYPE_3 3 // 384 bytes <= f i l e _ s i z e per MPI



















process < 61440 bytes


26 # def ine SAIO_FILE_TYPE_22 22 // 61440 bytes <= f i l e _ s i z e per MPIprocess < 71680 bytes
































57 # def ine SAIO_FILE_TYPE_53 53 // 102400000 bytes <= f i l e _ s i z e perMPI process < 204800000 bytes









66 # def ine SAIO_FILE_TYPE_62 62 // 1024000000 bytes <= f i l e _ s i z e perMPI process

LISTING A.3: SAIO File Type Definitions

1 # def ine _GNU_SOURCE2 # include <dl fcn . h>3 // P o i n t e r s of r e a l PMPI common f u n c t i o n s4 i n t (∗ __real_PMPI_Ini t ) ( i n t ∗argc , char ∗∗∗ argv ) = NULL;5

6 // Wrappers of r e l a t e d MPI common f u n c t i o n s7 i n t MPI_Init ( i n t ∗argc , char ∗∗∗ argv ) {8 i n t r e t ;9 // r e a l MPI funct ion c a l l

10 __real_PMPI_Ini t = dlsym (RTLD_NEXT, " PMPI_Init " ) ;11 r e t = __real_PMPI_Ini t ( argc , argv ) ;12 i f ( r e t != MPI_SUCCESS) {13 f p r i n t f ( s tderr , "PMPI funct ion c a l l with MPI e r r o r number : %d\n

" , r e t ) ;14 re turn r e t ;15 } e l s e {


16 r e t = saio_set_mode ( getenv ( "SAIO_MODE" ) , &saio_mode_int ) ;17 i f ( saio_mode_int == SAIO_MODE_TRONLY18 || saio_mode_int == SAIO_MODE_OPTTR19 || saio_mode_int == SAIO_MODE_SIZEONLY) {20 r e t = s a i o _ t r a c e _ i n i t ( ) ;21 }22 i f ( saio_mode_int == SAIO_MODE_OPTONLY23 || saio_mode_int == SAIO_MODE_OPTTR24 || saio_mode_int == SAIO_MODE_HDF5) {25 r e t = s a i o _ o p t _ i n i t ( ) ;26 }27 }28 re turn r e t ;29 }

LISTING A.4: Code Segment for MPI_Init() Wrapper

1 export SAIO_PATH=PATH_OF_YOUR_SAIO_INSTALLATION2 # ####### EITHER ONE OF THESE exports ########3 # FOR MPI , p a r a l l e l HDF5, p a r a l l e l NetCDF APPLICATIONS in C CODE4 export LD_PRELOAD=$SAIO_PATH/ l i b / l i b s a i o . so5

6 # FOR MPI APPLICATIONS in FORTRAN CODE7 export LD_PRELOAD=$SAIO_PATH/ l i b / l i b p s a i o . so8

9 # FOR p a r a l l e l HDF5, p a r a l l e l NetCDF APPLICATIONS in FORTRAN CODE10 export LD_PRELOAD=$SAIO_PATH/ l i b / l i b f h s a i o . so11 # ####### EITHER ONE OF THESE exports ########12

13 # ####### EITHER ONE OF THESE exports ########14 # GET DATA ACCESS SIZE FOR MPI , p a r a l l e l HDF5, p a r a l l e l NetCDF

APPLICATIONS WITHOUT INTERCOMMUNICATION BETWEEN PROCESSES15 export SAIO_MODE=size_only16

17 # OPTIMIZING MPI APPLICATIONS18 export SAIO_MODE=optimizing_only19

20 # TRACING MPI APPLICATIONS21 export SAIO_MODE=t ra c i ng _on ly22

23 # OPTIMIZING AND TRACING MPI APPLICATIONS24 export SAIO_MODE=optimizing_and_tracing25

26 # OPTIMIZING p a r a l l e l HDF5, p a r a l l e l NetCDF APPLICATIONS27 export SAIO_MODE=optimizing_hdf528

29 # OPTIMIZING COLLECTIVE I /O OPERATIONS30 export SAIO_MODE= o p t i m i z i n g _ c o l l31 # ####### EITHER ONE OF THESE exports ########32

33 export SAIO_REAL_TIME_OPT_FREQUENCY=1


34 export SAIO_CONFIG_DIR=$SAIO_PATH/c o n f i g s/35 export SAIO_LOG_DIR=$SAIO_PATH/logs/36 # ####### FOR TRAINING UTILITY ########37 export SAIO_TRAIN_DIR=$SAIO_PATH/resources/ s a i o _ t r a i n /38 # ####### FOR TRAINING UTILITY ########39 mpirun −n NUMBER_OF_MPI_PROCESSES ./YOUR_APPLICATION

LISTING A.5: Example of Shell Script for Using SAIO

113

Appendix B

Used SAIO Files

1 33554432 ,33554432 ,33554432 ,33554432 ,33554432 ,33554432 ,33554432 ,2 33554432 ,33554432 ,33554432 ,33554432 ,20176 ,19 ,4 ,19 ,19 ,5800 ,11020 ,3 5800 ,11020 ,5800 ,11020 ,5800 ,11020 ,6000 ,11400 ,6000 ,11400 ,6000 ,11400 ,4 6000 ,11400 ,5800 ,11020 ,5800 ,11020 ,200 ,380 ,200 ,380 ,5800 ,11020 ,5800 ,5 11020 ,5800 ,11020 ,5800 ,11020 ,5800 ,11020 ,5800 ,11020 ,200 ,380 ,200 ,380 ,6 19 ,4 ,19 ,19 ,5800 ,11020 ,5800 ,11020 ,5800 ,11020 ,5800 ,11020 ,6000 ,11400 ,7 6000 ,11400 ,6000 ,11400 ,6000 ,11400 ,5800 ,11020 ,5800 ,11020 ,200 ,380 ,8 200 ,380 ,5800 ,11020 ,5800 ,11020 ,5800 ,11020 ,5800 ,11020 ,5800 ,11020 ,9 5 8 0 0 , 1 1 0 2 0 , 2 0 0 , 3 8 0 , 2 0 0 , 3 8 0 , 1 9 , 4 , 1 9 , 1 9 , 5 8 0 0 , 1 1 0 2 0 , 5 8 0 0 , 1 1 0 2 0 , 5 8 0 0 ,

10 11020 ,5800 ,11020 ,6000 ,11400 ,6000 ,11400 ,6000 ,11400 ,6000 ,11400 ,5800 ,11 11020 ,5800 ,11020 ,200 ,380 ,200 ,380 ,5800 ,11020 ,5800 ,11020 ,5800 ,11020 ,12 5 8 0 0 , 1 1 0 2 0 , 5 8 0 0 , 1 1 0 2 0 , 5 8 0 0 , 1 1 0 2 0 , 2 0 0 , 3 8 0 , 2 0 0 , 3 8 0 , 1 9 , 4 , 1 9 , 1 9 , 5 8 0 0 ,13 11020 ,5800 ,11020 ,5800 ,11020 ,5800 ,11020 ,6000 ,11400 ,6000 ,11400 ,6000 ,14 11400 ,6000 ,11400 ,5800 ,11020 ,5800 ,11020 ,200 ,380 ,200 ,380 ,5800 ,11020 ,15 5800 ,11020 ,5800 ,11020 ,5800 ,11020 ,5800 ,11020 ,5800 ,11020 ,200 ,380 ,16 200 ,380 ,

LISTING B.1: SAIO Traced Data Transfer Size List of a WRF OnlineTutorial Process

1 4 ,19 ,200 ,380 ,5800 ,6000 ,11020 ,11400 ,20176 ,33554432 ,

LISTING B.2: Data Transfer Size List of a WRF Online Tutorial Processfor Training Utility

1 romio_cb_write :2 automatic , enable , d i s a b l e3 s t r i p i n g _ f a c t o r :4 4 , 8 , 1 2 , 1 6 , 2 05 s t r i p i n g _ u n i t :6 262144 ,524288 ,1048576 ,2097152 ,4194304 ,8388608 ,16777216

LISTING B.3: Configurations’ Searching Scope for a WRF OnlineTutorial Process

1 {"saio_file_type":0,"mpi_info":[{"romio_cb_write":"automatic"

},{"cb_nodes":"4"},{"striping_factor":"4"},{"striping_unit

":"8388608"}]}

114 Appendix B. Used SAIO Files


},{"cb_nodes":"4"},{"striping_factor":"4"},{"striping_unit

":"8388608"}]}


},{"cb_nodes":"12"},{"striping_factor":"12"},{"

striping_unit":"262144"}]}


{"cb_nodes":"20"},{"striping_factor":"20"},{"striping_unit

":"262144"}]}


{"cb_nodes":"20"},{"striping_factor":"20"},{"striping_unit

":"524288"}]}

6 {"saio_file_type":46,"mpi_info":[{"romio_cb_write":"disable"}

,{"cb_nodes":"16"},{"striping_factor":"16"},{"

striping_unit":"16777216"}]}

LISTING B.4: Generated SAIO Configuration File (for Writing) fromTraining the WRF Online Tutorial Process

115

Bibliography

[1] D. K. Panda and S. Sur, “InfiniBand”, in Encyclopedia of Parallel Computing,D. Padua, Ed. Boston, MA: Springer US, 2011, pp. 927–935, ISBN: 978-0-387-09766-4. DOI: 10.1007/978-0-387-09766-4_21. [Online]. Available:http://dx.doi.org/10.1007/978-0-387-09766-4_21.

[2] X. Ma and X. Ma, “I/O”, in Encyclopedia of Parallel Computing, D. Padua, Ed.Boston, MA: Springer US, 2011, pp. 975–984, ISBN: 978-0-387-09766-4. DOI: 10.1007/978-0-387-09766-4_290. [Online]. Available: http://dx.doi.org/10.1007/978-0-387-09766-4_290.

[3] Message Passing Interface Forum, MPI: A Message-Passing Interface Standard,Version 3.1. Nobelstr. 19, 70550 Stuttgart, Germany: High-Performance Com-puting Center Stuttgart, University of Stuttgart, 2015. [Online]. Available: http://mpi-forum.org/docs/.

[4] The Austin Common Standards Revision Group, C165: Base Specifications, Issue7, 2016 Edition, The Open Group, Ed. [Online]. Available: https://www2.opengroup.org/ogsys/catalog/c165.

[5] The HDF Group. (1997-2017). Hierarchical Data Format, version 5, [Online].Available: http://www.hdfgroup.org/HDF5/.

[6] Q. Koziol, “HDF5”, in Encyclopedia of Parallel Computing, D. Padua, Ed. Boston,MA: Springer US, 2011, pp. 827–833, ISBN: 978-0-387-09766-4. DOI: 10.1007/978-0-387-09766-4_44. [Online]. Available: http://dx.doi.org/10.1007/978-0-387-09766-4_44.

[7] M. Folk, G. Heber, Q. Koziol, E. Pourmal, and D. Robinson, “An Overview ofthe HDF5 Technology Suite and Its Applications”, in Proceedings of the EDBT/ICDT2011 Workshop on Array Databases, ser. AD ’11, Uppsala, Sweden: ACM, 2011,pp. 36–47, ISBN: 978-1-4503-0614-0. DOI: 10.1145/1966895.1966900. [On-line]. Available: http://doi.acm.org/10.1145/1966895.1966900.

[8] Unidata. (2017). Network Common Data Form (netCDF), [Online]. Available:http://doi.org/10.5065/D6H70CW6.

[9] R. Latham, “NetCDF I/O Library, Parallel”, in Encyclopedia of Parallel Com-puting, D. Padua, Ed. Boston, MA: Springer US, 2011, pp. 1283–1291, ISBN:978-0-387-09766-4. DOI: 10.1007/978-0-387-09766-4_235. [Online].Available: http://dx.doi.org/10.1007/978-0-387-09766-4_235.

https://doi.org/10.1007/978-0-387-09766-4_21

http://dx.doi.org/10.1007/978-0-387-09766-4_21

https://doi.org/10.1007/978-0-387-09766-4_290

https://doi.org/10.1007/978-0-387-09766-4_290

http://dx.doi.org/10.1007/978-0-387-09766-4_290

http://dx.doi.org/10.1007/978-0-387-09766-4_290

http://mpi-forum.org/docs/

http://mpi-forum.org/docs/

https://www2.opengroup.org/ogsys/catalog/c165

https://www2.opengroup.org/ogsys/catalog/c165

http://www.hdfgroup.org/HDF5/

https://doi.org/10.1007/978-0-387-09766-4_44

https://doi.org/10.1007/978-0-387-09766-4_44

http://dx.doi.org/10.1007/978-0-387-09766-4_44

http://dx.doi.org/10.1007/978-0-387-09766-4_44

https://doi.org/10.1145/1966895.1966900

http://doi.acm.org/10.1145/1966895.1966900

http://doi.org/10.5065/D6H70CW6

https://doi.org/10.1007/978-0-387-09766-4_235

http://dx.doi.org/10.1007/978-0-387-09766-4_235

116 Bibliography

[10] R. Rew and G. Davis, “NetCDF: an interface for scientific data access”, IEEEComputer Graphics and Applications, vol. 10, no. 4, pp. 76–82, 1990, ISSN: 0272-1716. DOI: 10.1109/38.56302.

[11] J. Li, W.-k. Liao, A. Choudhary, R. Ross, R. Thakur, W. Gropp, R. Latham, A.Siegel, B. Gallagher, and M. Zingale, “Parallel netCDF: A High-PerformanceScientific I/O Interface”, in Proceedings of the 2003 ACM/IEEE Conference onSupercomputing, ser. SC ’03, Phoenix, AZ, USA: ACM, 2003, pp. 39–, ISBN: 1-58113-695-1. DOI: 10.1145/1048935.1050189. [Online]. Available: http://doi.acm.org/10.1145/1048935.1050189.

[12] HLRS. (2017). Annual Report 2016 - High-Performance Computing CenterStuttgart, [Online]. Available: http://www.hlrs.de/about-us/media-publications/report/.

[13] Y. Lu, Y. Chen, P. Amritkar, R. Thakur, and Y. Zhuang, “A New Data SievingApproach for High Performance I/O”, in Proceedings of the 7th Int’l Confer-ence on Future Information Technology, ser. FutureTech ’12, 2012, pp. 1–8. [On-line]. Available: http://www.mcs.anl.gov/~thakur/papers/data-sieving.pdf.

[14] R. Thakur, W. Gropp, and E. Lusk, “Data Sieving and Collective I/O in ROMIO”,in Proceedings of the The 7th Symposium on the Frontiers of Massively Parallel Com-putation, ser. FRONTIERS ’99, Washington, DC, USA: IEEE Computer Society,1999, pp. 182–, ISBN: 0-7695-0087-0. [Online]. Available: http://dl.acm.org/citation.cfm?id=795668.796733.

[15] R. Thakur, E. Lusk, and W. Gropp, “On Implementing MPI-IO Portably andwith High Performance”, in Proceedings of the Sixth Workshop on I/O in Paralleland Distributed Systems, ser. IOPADS ’99, Atlanta, Georgia, USA: ACM, 1999,pp. 23–32, ISBN: 1-58113-123-2. DOI: 10.1145/301816.301826. [Online].Available: http://doi.acm.org/10.1145/301816.301826.

[16] M. Chaarawi, E. Gabriel, R. Keller, R. L. Graham, G. Bosilca, and J. J. Don-garra, “OMPIO: A Modular Software Architecture for MPI I/O”, in RecentAdvances in the Message Passing Interface, ser. EuroMPI ’11, Heidelberg, Ger-many: Springer Berlin Heidelberg, 2011, pp. 81–89, ISBN: 978-3-642-24448-3.DOI: 10.1007/978- 3- 642- 24449- 0_11. [Online]. Available: http://www.open-mpi.de/papers/euro-mpi-2011-ompio/euro-mpi-

2011-ompio.pdf.

[17] The HDF Group. (1997-2017). Parallel Hierarchical Data Format, version 5,[Online]. Available: http://support.hdfgroup.org/HDF5/PHDF5/.

[18] F. Schmuck and R. Haskin, “GPFS: A Shared-Disk File System for Large Com-puting Clusters”, in Proceedings of the 1st USENIX Conference on File and Stor-age Technologies, ser. FAST ’02, Monterey, CA: USENIX Association, 2002. [On-line]. Available: http://dl.acm.org/citation.cfm?id=1083323.1083349.

https://doi.org/10.1109/38.56302

https://doi.org/10.1145/1048935.1050189

http://doi.acm.org/10.1145/1048935.1050189

http://doi.acm.org/10.1145/1048935.1050189

http://www.hlrs.de/about-us/media-publications/report/

http://www.hlrs.de/about-us/media-publications/report/

http://www.mcs.anl.gov/~thakur/papers/data-sieving.pdf

http://www.mcs.anl.gov/~thakur/papers/data-sieving.pdf

http://dl.acm.org/citation.cfm?id=795668.796733


https://doi.org/10.1145/301816.301826

http://doi.acm.org/10.1145/301816.301826

https://doi.org/10.1007/978-3-642-24449-0_11

http://www.open-mpi.de/papers/euro-mpi-2011-ompio/euro-mpi-2011-ompio.pdf



http://support.hdfgroup.org/HDF5/PHDF5/



Bibliography 117

[19] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google File System”, SIGOPSOper. Syst. Rev., vol. 37, no. 5, pp. 29–43, Oct. 2003, ISSN: 0163-5980. DOI: 10.1145/1165389.945450. [Online]. Available: http://doi.acm.org/10.1145/1165389.945450.

[20] K. Potter, Methods for Presenting Statistical Information: The Box Plot, 2006.

[21] M. C. Wiedemann, J. M. Kunkel, M. Zimmer, T. Ludwig, M. Resch, T. Bönisch,X. Wang, A. Chut, A. Aguilera, W. E. Nagel, M. Kluge, and H. Mickler, “To-wards I/O Analysis of HPC Systems and a Generic Architecture to CollectAccess Patterns”, Comput. Sci., vol. 28, no. 2-3, pp. 241–251, May 2013, ISSN:1865-2034. DOI: 10.1007/s00450-012-0221-5. [Online]. Available: http://dx.doi.org/10.1007/s00450-012-0221-5.

[22] J. M. Kunkel, M. Zimmer, N. Hübbe, A. Aguilera, H. Mickler, X. Wang, A.Chut, T. Bönisch, J. Lüttgau, R. Michel, and J. Weging, “The SIOX Architec-ture — Coupling Automatic Monitoring and Optimization of Parallel I/O”,in Proceedings of the 29th International Conference on Supercomputing - Volume8488, ser. ISC 2014, Leipzig, Germany: Springer-Verlag New York, Inc., 2014,pp. 245–260, ISBN: 978-3-319-07517-4. DOI: 10.1007/978-3-319-07518-1_16. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-07518-1_16.

[23] B. Behzad, S. Byna, Prabhat, and M. Snir, “Pattern-driven Parallel I/O Tun-ing”, in Proceedings of the 10th Parallel Data Storage Workshop, ser. PDSW ’15,Austin, Texas: ACM, 2015, pp. 43–48, ISBN: 978-1-4503-4008-3. DOI: 10.1145/2834976.2834977. [Online]. Available: http://doi.acm.org/10.1145/2834976.2834977.

[24] H. Luu, B. Behzad, R. Aydt, and M. Winslett, “A multi-level approach forunderstanding I/O activity in HPC applications”, in 2013 IEEE InternationalConference on Cluster Computing (CLUSTER), 2013, pp. 1–5. DOI: 10.1109/CLUSTER.2013.6702690.

[25] B. Behzad, H. V. T. Luu, J. Huchette, S. Byna, Prabhat, R. Aydt, Q. Koziol, andM. Snir, “Taming parallel I/O complexity with auto-tuning”, in 2013 SC - In-ternational Conference for High Performance Computing, Networking, Storage andAnalysis (SC), 2013, pp. 1–12. DOI: 10.1145/2503210.2503278.

[26] D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning,1st. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1989,ISBN: 0201157675.

[27] B. Behzad, S. Byna, S. M. Wild, M. Prabhat, and M. Snir, “Improving ParallelI/O Autotuning with Performance Modeling”, in Proceedings of the 23rd In-ternational Symposium on High-performance Parallel and Distributed Computing,ser. HPDC ’14, Vancouver, BC, Canada: ACM, 2014, pp. 253–256, ISBN: 978-1-4503-2749-7. DOI: 10.1145/2600212.2600708. [Online]. Available: http://doi.acm.org/10.1145/2600212.2600708.

https://doi.org/10.1145/1165389.945450

https://doi.org/10.1145/1165389.945450

http://doi.acm.org/10.1145/1165389.945450

http://doi.acm.org/10.1145/1165389.945450

https://doi.org/10.1007/s00450-012-0221-5

http://dx.doi.org/10.1007/s00450-012-0221-5

http://dx.doi.org/10.1007/s00450-012-0221-5

https://doi.org/10.1007/978-3-319-07518-1_16

https://doi.org/10.1007/978-3-319-07518-1_16

http://dx.doi.org/10.1007/978-3-319-07518-1_16

http://dx.doi.org/10.1007/978-3-319-07518-1_16

https://doi.org/10.1145/2834976.2834977

https://doi.org/10.1145/2834976.2834977

http://doi.acm.org/10.1145/2834976.2834977

http://doi.acm.org/10.1145/2834976.2834977

https://doi.org/10.1109/CLUSTER.2013.6702690

https://doi.org/10.1109/CLUSTER.2013.6702690

https://doi.org/10.1145/2503210.2503278

https://doi.org/10.1145/2600212.2600708

http://doi.acm.org/10.1145/2600212.2600708

http://doi.acm.org/10.1145/2600212.2600708

118 Bibliography

[28] H. You, Q. Liu, Z. Li, and S. Moore, The Design of an Auto-Tuning I/O Frameworkon Cray XT5 System, 2011.

[29] Y. Chen, “Automated Tuning of Parallel I/O Systems: An Approach to PortableI/O Performance for Scientific Applications”, IEEE Trans. Softw. Eng., vol. 26,no. 4, pp. 362–383, Apr. 2000, ISSN: 0098-5589. DOI: 10.1109/32.844494.[Online]. Available: http://dx.doi.org/10.1109/32.844494.

[30] P. Carns, R. Latham, R. Ross, K. Iskra, S. Lang, and K. Riley, “24/7 Character-ization of petascale I/O workloads”, in 2009 IEEE International Conference onCluster Computing and Workshops, 2009, pp. 1–10. DOI: 10.1002/cpe.3125.[Online]. Available: http://dx.doi.org/10.1002/cpe.3125.

[31] S. Snyder, P. Carns, K. Harms, R. Ross, G. K. Lockwood, and N. J. Wright,“Modular HPC I/O Characterization with Darshan”, in Proceedings of the 5thWorkshop on Extreme-Scale Programming Tools, ser. ESPT ’16, Salt Lake City,Utah: IEEE Press, 2016, pp. 9–17, ISBN: 978-1-5090-3918-0. DOI: 10.1109/ESPT.2016.9. [Online]. Available: https://doi.org/10.1109/ESPT.2016.9.

[32] S. Snyder, P. Carns, K. Harms, R. Ross, and R. Latham, Performance Evaluationof Darshan 3.0.0 on the Cray XC30. Technical Memorandum ANL/MCS-TM-362, Argonne National Laboratory, 2016. [Online]. Available: http://www.mcs.anl.gov/papers/MCS_TM_362_1.pdf.

[33] P. Carns, K. Harms, W. Allcock, C. Bacon, S. Lang, R. Latham, and R. Ross,“Understanding and Improving Computational Science Storage Access ThroughContinuous Characterization”, Trans. Storage, vol. 7, no. 3, 8:1–8:26, 2011, ISSN:1553-3077. DOI: 10.1145/2027066.2027068. [Online]. Available: http://doi.acm.org/10.1145/2027066.2027068.

[34] B. Behzad, H.-V. Dang, F. Hariri, W. Zhang, and M. Snir, “Automatic Gener-ation of I/O Kernels for HPC Applications”, in Proceedings of the 9th ParallelData Storage Workshop, ser. PDSW ’14, New Orleans, Louisiana: IEEE Press,2014, pp. 31–36, ISBN: 978-1-4799-7025-4. DOI: 10.1109/PDSW.2014.6. [On-line]. Available: http://dx.doi.org/10.1109/PDSW.2014.6.

[35] F. Wang, S. Oral, G. Shipman, O. Drokin, T. Wang, and I. Huang, UnderstandingLustre Filesystem Internals. [Online]. Available: http://wiki.old.lustre.org/images/d/da/Understanding_Lustre_Filesystem_Internals.

pdf.

[36] J.-P. Prost, R. Treumann, R. Hedges, B. Jia, and A. Koniges, “MPI-IO/GPFS, anOptimized Implementation of MPI-IO on Top of GPFS”, in Proceedings of the2001 ACM/IEEE Conference on Supercomputing, ser. SC ’01, Denver, Colorado:ACM, 2001, pp. 17–17, ISBN: 1-58113-293-X. DOI: 10.1145/582034.582051.[Online]. Available: http://doi.acm.org/10.1145/582034.582051.

https://doi.org/10.1109/32.844494

http://dx.doi.org/10.1109/32.844494

https://doi.org/10.1002/cpe.3125

http://dx.doi.org/10.1002/cpe.3125

https://doi.org/10.1109/ESPT.2016.9




http://www.mcs.anl.gov/papers/MCS_TM_362_1.pdf

http://www.mcs.anl.gov/papers/MCS_TM_362_1.pdf

https://doi.org/10.1145/2027066.2027068

http://doi.acm.org/10.1145/2027066.2027068

http://doi.acm.org/10.1145/2027066.2027068

https://doi.org/10.1109/PDSW.2014.6

http://dx.doi.org/10.1109/PDSW.2014.6

http://wiki.old.lustre.org/images/d/da/Understanding_Lustre_Filesystem_Internals.pdf



https://doi.org/10.1145/582034.582051

http://doi.acm.org/10.1145/582034.582051

Bibliography 119

[37] T. Jones, A. Koniges, and R. K. Yates, “Performance of the IBM general parallelfile system”, in Proceedings 14th International Parallel and Distributed ProcessingSymposium. IPDPS 2000, 2000, pp. 673–681. DOI: 10.1109/IPDPS.2000.846052.

[38] T. White, Hadoop: The Definitive Guide, 4th. O’Reilly Media, Inc., 2015, ISBN:1491901632, 9781491901632.

[39] S. Wadkar, M. Siddalingaiah, and J. Venner, Pro Apache Hadoop, 2nd. Berkely,CA, USA: Apress, 2014, ISBN: 1430248637, 9781430248637.

[40] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop DistributedFile System”, in Proceedings of the 2010 IEEE 26th Symposium on Mass StorageSystems and Technologies (MSST), ser. MSST ’10, Washington, DC, USA: IEEEComputer Society, 2010, pp. 1–10, ISBN: 978-1-4244-7152-2. DOI: 10.1109/MSST.2010.5496972. [Online]. Available: http://dx.doi.org/10.1109/MSST.2010.5496972.

[41] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on LargeClusters”, Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008, ISSN: 0001-0782. DOI: 10.1145/1327452.1327492. [Online]. Available: http://doi.acm.org/10.1145/1327452.1327492.

[42] R. Thakur, W. Gropp, and E. Lusk, “Data Sieving and Collective I/O in ROMIO”,in Proceedings of the The 7th Symposium on the Frontiers of Massively Parallel Com-putation, ser. FRONTIERS ’99, Washington, DC, USA: IEEE Computer Society,1999, pp. 182–, ISBN: 0-7695-0087-0. [Online]. Available: http://dl.acm.org/citation.cfm?id=795668.796733.

[43] Y. Lu, Y. Chen, P. Amritkar, R. Thakur, and Y. Zhuang, “A New Data SievingApproach for High Performance I/O”, in Future Information Technology, Appli-cation, and Service: FutureTech 2012 Volume 1. Dordrecht: Springer Netherlands,2012, pp. 111–121, ISBN: 978-94-007-4516-2. DOI: 10.1007/978-94-007-4516-2_12. [Online]. Available: http://dx.doi.org/10.1007/978-94-007-4516-2_12.

[44] J. M. del Rosario, R. Bordawekar, and A. Choudhary, “Improved Parallel I/Ovia a Two-phase Run-time Access Strategy”, SIGARCH Comput. Archit. News,vol. 21, no. 5, pp. 31–38, Dec. 1993, ISSN: 0163-5964. DOI: 10.1145/165660.165667. [Online]. Available: http://doi.acm.org/10.1145/165660.165667.

[45] Z. Zhang, A. Espinosa, K. Iskra, I. Raicu, I. Foster, and M. Wilde, “Design andevaluation of a collective IO model for loosely coupled petascale program-ming”, in 2008 Workshop on Many-Task Computing on Grids and Supercomputers,2008, pp. 1–10. DOI: 10.1109/MTAGS.2008.4777908.

[46] R. Thakur and A. Choudhary, “An Extended Two-Phase Method for AccessingSections of Out-of-Core Arrays”, Scientific Programming, vol. 5, no. 4, pp. 301–317, 1996. DOI: 10.1155/1996/547186. [Online]. Available: http://dx.doi.org/10.1155/1996/547186.

https://doi.org/10.1109/IPDPS.2000.846052


https://doi.org/10.1109/MSST.2010.5496972

https://doi.org/10.1109/MSST.2010.5496972

http://dx.doi.org/10.1109/MSST.2010.5496972

http://dx.doi.org/10.1109/MSST.2010.5496972

https://doi.org/10.1145/1327452.1327492

http://doi.acm.org/10.1145/1327452.1327492

http://doi.acm.org/10.1145/1327452.1327492



https://doi.org/10.1007/978-94-007-4516-2_12

https://doi.org/10.1007/978-94-007-4516-2_12

http://dx.doi.org/10.1007/978-94-007-4516-2_12

http://dx.doi.org/10.1007/978-94-007-4516-2_12

https://doi.org/10.1145/165660.165667

https://doi.org/10.1145/165660.165667

http://doi.acm.org/10.1145/165660.165667

http://doi.acm.org/10.1145/165660.165667

https://doi.org/10.1109/MTAGS.2008.4777908

https://doi.org/10.1155/1996/547186

http://dx.doi.org/10.1155/1996/547186

http://dx.doi.org/10.1155/1996/547186

120 Bibliography

[47] K. Coloma, A. Ching, A. Choudhary, W. k. Liao, R. Ross, R. Thakur, and L.Ward, “A New Flexible MPI Collective I/O Implementation”, in 2006 IEEEInternational Conference on Cluster Computing, 2006, pp. 1–10. DOI: 10.1109/CLUSTR.2006.311865. [Online]. Available: http://dx.doi.org/10.1109/CLUSTR.2006.311865.

[48] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A High-performance, PortableImplementation of the MPI Message Passing Interface Standard”, Parallel Com-put., vol. 22, no. 6, pp. 789–828, Sep. 1996, ISSN: 0167-8191. DOI: 10.1016/0167-8191(96)00024-5. [Online]. Available: http://dx.doi.org/10.1016/0167-8191(96)00024-5.

[49] W. Gropp and E. Lusk, “A Test Implementation of the MPI Draft Message Pass-ing Standard”, 1992. DOI: 10.2172/7055517. [Online]. Available: https://doi.org/10.2172/7055517.

[50] D. K. Panda, K. Tomko, K. Schulz, and A. Majumdar, “The MVAPICH Project:Evolution and Sustainability of an Open Source Production Quality MPI Li-brary for HPC”, 2013. DOI: 10.6084/m9.figshare.791563.v5. [Online].Available: https://doi.org/10.6084/m9.figshare.791563.v5.

[51] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V.Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L.Graham, and T. S. Woodall, “Open MPI: Goals, Concept, and Design of a NextGeneration MPI Implementation”, in Recent Advances in Parallel Virtual Ma-chine and Message Passing Interface: 11th European PVM/MPI Users’ Group Meet-ing Budapest, Hungary, September 19 - 22, 2004. Proceedings, D. Kranzlmüller,P. Kacsuk, and J. Dongarra, Eds. Berlin, Heidelberg: Springer Berlin Heidel-berg, 2004, pp. 97–104, ISBN: 978-3-540-30218-6. DOI: 10.1007/978-3-540-30218-6_19. [Online]. Available: http://dx.doi.org/10.1007/978-3-540-30218-6_19.

[52] R. L. Graham, T. S. Woodall, and J. M. Squyres, “Open MPI: A Flexible HighPerformance MPI”, in Proceedings of the 6th International Conference on ParallelProcessing and Applied Mathematics, ser. PPAM’05, Poznań, Poland: Springer-Verlag, 2006, pp. 228–239, ISBN: 3-540-34141-2, 978-3-540-34141-3. DOI: 10.1007/11752578_29. [Online]. Available: http://dx.doi.org/10.1007/11752578_29.

[53] B. Barrett, J. M. Squyres, A. Lumsdaine, R. L. Graham, and G. Bosilca, “Anal-ysis of the Component Architecture Overhead in Open MPI”, in Proceedings ofthe 12th European PVM/MPI Users’ Group Conference on Recent Advances in Par-allel Virtual Machine and Message Passing Interface, ser. PVM/MPI’05, Sorrento,Italy: Springer-Verlag, 2005, pp. 175–182, ISBN: 3-540-29009-5, 978-3-540-29009-4. DOI: 10.1007/11557265_25. [Online]. Available: http://dx.doi.org/10.1007/11557265_25.

[54] T. Dontje, D. Kerr, D. Lacher, P. Lui, E. Mallove, K. Norteman, R. Vandevaart,and L. Wisniewski, “Sun HPC ClusterToolsTM 7+: A Binary Distribution of

https://doi.org/10.1109/CLUSTR.2006.311865

https://doi.org/10.1109/CLUSTR.2006.311865

http://dx.doi.org/10.1109/CLUSTR.2006.311865

http://dx.doi.org/10.1109/CLUSTR.2006.311865

https://doi.org/10.1016/0167-8191(96)00024-5

https://doi.org/10.1016/0167-8191(96)00024-5

http://dx.doi.org/10.1016/0167-8191(96)00024-5

http://dx.doi.org/10.1016/0167-8191(96)00024-5

https://doi.org/10.2172/7055517

https://doi.org/10.2172/7055517

https://doi.org/10.2172/7055517

https://doi.org/10.6084/m9.figshare.791563.v5

https://doi.org/10.6084/m9.figshare.791563.v5

https://doi.org/10.1007/978-3-540-30218-6_19

https://doi.org/10.1007/978-3-540-30218-6_19

http://dx.doi.org/10.1007/978-3-540-30218-6_19

http://dx.doi.org/10.1007/978-3-540-30218-6_19

https://doi.org/10.1007/11752578_29

https://doi.org/10.1007/11752578_29

http://dx.doi.org/10.1007/11752578_29

http://dx.doi.org/10.1007/11752578_29

https://doi.org/10.1007/11557265_25

http://dx.doi.org/10.1007/11557265_25

http://dx.doi.org/10.1007/11557265_25

Bibliography 121

Open MPI”, in Tools for High Performance Computing: Proceedings of the 2nd In-ternational Workshop on Parallel Tools for High Performance Computing, July 2008,HLRS, Stuttgart, M. Resch, R. Keller, V. Himmler, B. Krammer, and A. Schulz,Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 3–18, ISBN: 978-3-540-68564-7. DOI: 10.1007/978-3-540-68564-7_1. [Online]. Available:http://dx.doi.org/10.1007/978-3-540-68564-7_1.

[55] R. Thakur, R. Ross, E. Lusk, W. Gropp, and R. Latham, “Users Guide forROMIO: A High-Performance, Portable MPI-IO Implementation”, 2010. [On-line]. Available: ftp://ftp.mcs.anl.gov/pub/romio/users-guide.pdf.

[56] M. Chaarawi, S. Chandok, and E. Gabriel, “Performance Evaluation of Collec-tive Write Algorithms in MPI I/O”, in Computational Science – ICCS 2009: 9thInternational Conference Baton Rouge, LA, USA, May 25-27, 2009 Proceedings, PartI, G. Allen, J. Nabrzyski, E. Seidel, G. D. van Albada, J. Dongarra, and P. M. A.Sloot, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 185–194,ISBN: 978-3-642-01970-8. DOI: 10.1007/978-3-642-01970-8_19. [Online].Available: http://dx.doi.org/10.1007/978-3-642-01970-8_19.

[57] K. Kulkarni and E. Gabriel, “Evaluating Algorithms for Shared File PointerOperations in MPI I/O”, in Computational Science – ICCS 2009: 9th InternationalConference Baton Rouge, LA, USA, May 25-27, 2009 Proceedings, Part I, G. Allen,J. Nabrzyski, E. Seidel, G. D. van Albada, J. Dongarra, and P. M. A. Sloot, Eds.Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 280–289, ISBN: 978-3-642-01970-8. DOI: 10.1007/978-3-642-01970-8_28. [Online]. Available:http://dx.doi.org/10.1007/978-3-642-01970-8_28.

[58] Q. Liu, J. Logan, Y. Tian, H. Abbasi, N. Podhorszki, J. Y. Choi, S. Klasky, R.Tchoua, J. Lofstead, R. Oldfield, M. Parashar, N. Samatova, K. Schwan, A.Shoshani, M. Wolf, K. Wu, and W. Yu, “Hello ADIOS: The Challenges andLessons of Developing Leadership Class I/O Frameworks”, Concurr. Comput.: Pract. Exper., vol. 26, no. 7, pp. 1453–1473, May 2014, ISSN: 1532-0626. DOI:10.1002/cpe.3125. [Online]. Available: http://dx.doi.org/10.1002/cpe.3125.

[59] J. Lofstead, F. Zheng, S. Klasky, and K. Schwan, “Adaptable, Metadata Rich IOMethods for Portable High Performance IO”, in Proceedings of the 2009 IEEE In-ternational Symposium on Parallel&Distributed Processing, ser. IPDPS ’09, Wash-ington, DC, USA: IEEE Computer Society, 2009, pp. 1–10, ISBN: 978-1-4244-3751-1. DOI: 10.1109/IPDPS.2009.5161052. [Online]. Available: http://dx.doi.org/10.1109/IPDPS.2009.5161052.

[60] T. M. Mitchell, Machine Learning, 1st ed. New York, NY, USA: McGraw-Hill,Inc., 1997, ISBN: 0070428077, 9780070428072.

[61] Intel Corporation. (2010-2017). Lustre Manual 2.x, [Online]. Available: https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/

artifact/lustre_manual.xhtml.

https://doi.org/10.1007/978-3-540-68564-7_1

http://dx.doi.org/10.1007/978-3-540-68564-7_1

ftp://ftp.mcs.anl.gov/pub/romio/users-guide.pdf

ftp://ftp.mcs.anl.gov/pub/romio/users-guide.pdf

https://doi.org/10.1007/978-3-642-01970-8_19

http://dx.doi.org/10.1007/978-3-642-01970-8_19

https://doi.org/10.1007/978-3-642-01970-8_28

http://dx.doi.org/10.1007/978-3-642-01970-8_28

https://doi.org/10.1002/cpe.3125




http://dx.doi.org/10.1109/IPDPS.2009.5161052

http://dx.doi.org/10.1109/IPDPS.2009.5161052

https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml



122 Bibliography

[62] J. Michalakes, J. Dudhia, D. O. Gill-NCAR, T. B. Henderson, J. B. Klemp, W.Skamarock, and W. Wang, “The Weather Research and Forecast Model: soft-ware architecture and performance”, in Use of High Performance Computing inMeteorology, W. Zwieflhofer and G. Mozdzynski, Eds. Reading, UK: World Sci-entific, 2005, pp. 156–168, ISBN: 978-981-256-354-5. [Online]. Available: http://wrf-model.org/wrfadmin/docs/ecmwf_2004.pdf.

[63] A. Pogorelov, M. Onur Cetin, S. Mohsen Alavi Moghadam, M. Meinke, andW. Schröder, “Aeroacoustic Simulations of Ducted Axial Fan and HelicopterEngine Nozzle Flows”, in High Performance Computing in Science and Engineer-ing ’16: Transactions of the High Performance Computing Center, Stuttgart (HLRS)2016, W. E. Nagel, D. H. Kröner, and M. M. Resch, Eds. Cham: Springer In-ternational Publishing, 2016, pp. 443–460, ISBN: 978-3-319-47066-5. DOI: 10.1007/978-3-319-47066-5_30. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-47066-5_30.

[64] Q. Ye and O. Tiedje, “Investigation on Air Entrapment in Paint Drops UnderImpact onto Dry Solid Surfaces”, in High Performance Computing in Science andEngineering ’16: Transactions of the High Performance Computing Center, Stuttgart(HLRS) 2016, W. E. Nagel, D. H. Kröner, and M. M. Resch, Eds. Cham: SpringerInternational Publishing, 2016, pp. 355–374, ISBN: 978-3-319-47066-5. DOI: 10.1007/978-3-319-47066-5_24. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-47066-5_24.

[65] R. Ewert and W. Schröder, “Acoustic perturbation equations based on flowdecomposition via source filtering”, Journal of Computational Physics, vol. 188,no. 2, pp. 365 –398, 2003, ISSN: 0021-9991. DOI: http://doi.org/10.1016/S0021- 9991(03)00168- 2. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0021999103001682.

[66] S. J. Kim, Y. Zhang, S. W. Son, R. Prabhakar, M. Kandemir, C. Patrick, W.-k. Liao, and A. Choudhary, “Automated Tracing of I/O Stack”, in Proceed-ings of the 17th European MPI Users’ Group Meeting Conference on Recent Ad-vances in the Message Passing Interface, ser. EuroMPI’10, Stuttgart, Germany:Springer-Verlag, 2010, pp. 72–81, ISBN: 3-642-15645-2, 978-3-642-15645-8. [On-line]. Available: http://dl.acm.org/citation.cfm?id=1894122.1894132.

http://wrf-model.org/wrfadmin/docs/ecmwf_2004.pdf

http://wrf-model.org/wrfadmin/docs/ecmwf_2004.pdf

https://doi.org/10.1007/978-3-319-47066-5_30

https://doi.org/10.1007/978-3-319-47066-5_30

http://dx.doi.org/10.1007/978-3-319-47066-5_30

http://dx.doi.org/10.1007/978-3-319-47066-5_30

https://doi.org/10.1007/978-3-319-47066-5_24

https://doi.org/10.1007/978-3-319-47066-5_24

http://dx.doi.org/10.1007/978-3-319-47066-5_24

http://dx.doi.org/10.1007/978-3-319-47066-5_24

https://doi.org/http://doi.org/10.1016/S0021-9991(03)00168-2

https://doi.org/http://doi.org/10.1016/S0021-9991(03)00168-2

http://www.sciencedirect.com/science/article/pii/S0021999103001682

http://www.sciencedirect.com/science/article/pii/S0021999103001682



a light weighted semi-automatically i/o-tuning solution for engineering applications

Documents

Transcript of a light weighted semi-automatically i/o-tuning solution for engineering applications