Lehrstuhl für Informatik 10 (Systemsimulation)

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERGTECHNISCHE FAKULTÄT • DEPARTMENT INFORMATIK

Lehrstuhl für Informatik 10 (Systemsimulation)

Generation of LLVM Compute Kernels for High PerformanceSimulations

Jan Hönig

Masterarbeit

Generation of LLVM Compute Kernels for High PerformanceSimulations

Jan HönigMasterarbeit

Aufgabensteller: PD Dr. KöstlerBetreuer: Martin Bauer, M.Sc.(hons)Bearbeitungszeitraum: 11.01.2017 – 11.07.2017

Erklärung:

Ich versichere, dass ich die Arbeit ohne fremde Hilfe und ohne Benutzung anderer alsder angegebenen Quellen angefertigt habe und dass die Arbeit in gleicher oder ähnlicherForm noch keiner anderen Prüfungsbehörde vorgelegen hat und von dieser als Teil einerPrüfungsleistung angenommen wurde. Alle Ausführungen, die wörtlich oder sinngemäßübernommen wurden, sind als solche gekennzeichnet.

Der Universität Erlangen-Nürnberg, vertreten durch den Lehrstuhl für Systemsimulati-on (Informatik 10), wird für Zwecke der Forschung und Lehre ein einfaches, kostenloses,zeitlich und örtlich unbeschränktes Nutzungsrecht an den Arbeitsergebnissen der Mas-terarbeit einschließlich etwaiger Schutzrechte und Urheberrechte eingeräumt.

Erlangen, den 11. Juli 2017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Jan Hönig

AbstractThe main part of a computer simulation is the kernel, which determines the progress ofthe simulation. Accordingly, the performance of such a kernel is crucial and it is advisableto write the kernel so, that the computational performance is high. In any case, thisis a challenging task. There are numerous ways to speed up a kernel. The kernel canbe optimized with parallelization, SIMD-instructions, better arithmetic intensity, cacheaware algorithm and many more possibilities. However, many of the optimizations areclosely tied to a specialized kernel, i.e. only for a set of parameters. A specializedkernel performs better, but only for certain cases. This leads to a high number ofdifferent implementations in order to cover all possible cases of the kernel. Certainly,some parameters cannot be optimized because the set of options is too large. In orderto obtain such a high specialized kernel, it has to be tailor-made.This thesis presents the generation of a kernel from a mathematical description in

the new developed framework called pystencils.. The target language of the generationprocess is the LLVM intermediate representation. LLVM provides a sound compilertoolchain, with optimizations of the intermediate representation and backends for variouscomputer architectures included. On the way towards the LLVM code, the mathematicaldescription traverses multiple transformations. The central representation of the kernelis the abstract syntax tree, which is a structure for the representation of source code,used in most compilers. This compilation is carried out during the run-time of a Pythonprogram and the compilation is applied directly.The second part of this thesis is an exemplary application of such mathematical de-

scription. The phase field method is used for demonstration purposes of the framework.The phase field method is a technique, for the simulation of processes of a interfaceduring a phase transition. In this thesis two different phase field models are presentedand generated. The performance of the generated kernel is subsequently compared witha Python implementation, which uses NumPy as its basis.

i

ZusammenfassungDer Hauptbestandteil einer Simulation ist der Kernel, der den Ablauf einer Simulationwesentlich bestimmt. Deshalb ist die Leistungsfähigkeit eines solchen Kernels entschei-dend. Es ist daher ratsam den Kernel so zu implementieren, dass die Rechenleis-tung so hoch wie möglich ist. Dies ist keine leichte Aufgabe, denn es gibt zahlreicheMöglichkeiten, um die Leistung eines Kernels zu verbessern und damit zu beschleuni-gen. Der Kernel kann beispielsweise optimiert werden durch Parallelisierung, SIMD-Instruktionen, bessere arithmetische Intensität oder Cache-bewusste Algorithmen. Je-doch sind viele Optimierungen an spezialisierte Kernel gebunden, also Kernel mit einembestimmten Satz an Parametern. Dies führt zu einer großen Anzahl an verschiedenen Im-plementierungen des imWesentlichen gleichen Kernels, um die zahlreichen Möglichkeitenabzudecken. Manche Parameter können gar nicht abgedeckt werden, da sie viel zuviele mögliche Ausführungsformen besitzen. Um einen so hochspezialisierten Kernel zubekommen, muss er exakt auf den jeweiligen Fall angepasst werden.Diese Arbeit präsentiert die Generierung eines Kernels aus einer mathematischen

Beschreibung in dem neu entwickelten Framework pystencils. Die Zielsprache des Gener-ierungsprozesses ist die LLVM Zwischenrepräsentation. LLVM stellt eine gut fundierteÜbersetzerwerkzeugkettte (compiler tool chain) mit Optimimierungen der Zwischen-repräsentation und einem Unterbau, der verschiedene Rechnerarchitekturen unterstützt,bereit. Auf dem Weg zum LLVM Code durchläuft die mathematische Beschreibungmehrere Transformationen. Die zentrale Repräsentation des Kernels ist der abstrakteSyntaxbaum, der eine Struktur ist, um den Quellcode darzustellen und in vielen Überset-zern benutzt wird. Die Übersetzung von der mathematischen Beschreibung zu der LLVMZwischenrepräsentation und dann zum Maschinencode, läuft während der Laufzeit voneinem Python-Programm ab und das Kompilat wird direkt im Programm verwendet.Der zweite Teil dieser Arbeit ist eine beispielhafte Anwendung von einer solchen math-

ematischen Beschreibung. Zu Demonstrationszwecken wird die Phasen-Feld Methodeverwendet. Die Phasen-Feld Methode ist ein Verfahren, um Vorgänge in den Gren-zflächen während eines Phasenübergangs zu simulieren. Zwei verschiedene Phasen-FeldModelle werden in dieser Arbeit präsentiert und generiert. Die Leistung des gener-ierten Kernel wird im Anschluss mit einer in Python geschriebenen Implementierungverglichen, die NumPy als Basis benutzt.

ii

ContentsAbstract i

Zusammenfassung ii

Contents iii

Acronyms v

Nomenclature vi

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Structure of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 LLVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.1 LLVM-IR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4.1 SymPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4.2 llvmlite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 AST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.6 Phase field method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.7 Framework - pystencils . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 AST fabrication 102.1 SymPy extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 AST composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Jacobi method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.2 Kernel definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.3 AST generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Generation of LLVM 203.1 LLVM specific AST transformations . . . . . . . . . . . . . . . . . . . . . 20

3.1.1 Transferring away from SymPy nodes . . . . . . . . . . . . . . . . 203.1.2 Type Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 LLVM generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.1 SymPy’s Printing System . . . . . . . . . . . . . . . . . . . . . . 233.2.2 Extension of llvmlite . . . . . . . . . . . . . . . . . . . . . . . . . 233.2.3 LLVM generation & JIT compilation . . . . . . . . . . . . . . . . 243.2.4 LLVM optimizations . . . . . . . . . . . . . . . . . . . . . . . . . 26

iii

3.2.5 Utilization of compiled functions . . . . . . . . . . . . . . . . . . 27

4 Application: Phase Field Method 294.1 Examples of phase field models . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.1 Spinodal decomposition of binary alloy . . . . . . . . . . . . . . . 314.1.2 Dendritic solidification . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.2.1 Spinodal decomposition . . . . . . . . . . . . . . . . . . . . . . . 364.2.2 Dendritic solidification . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Conclusion & Future Work 445.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2.1 Extension of the supported scope . . . . . . . . . . . . . . . . . . 455.2.2 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Acknowledgements 47

Bibliography 48

List of Listings 49

List of Figures 50

iv

AcronymsAST Abstract Syntax Tree. 2–7, 10–12, 17, 19–23, 25, 26, 44–46, 48, 49

BLAS Basic Linear Algebra Subprograms. 39, 44

CFG Control Flow Graph. 27

CPU Central Processing Unit. 3, 27

DAG Direct Acyclic Graph. 10, 20

FDM Finite Difference Method. 13, 14

FFI Foreign Function Interface. 24, 44

FFT Fast Fourier Transform. 45

GPU Graphics Processing Unit. 3

HPC High Performance Computing. 1

IR Intermediate Representation. 3–7, 23–26, 44, 46

JIT Just In Time. 5, 6, 23, 25

MRO Method Resolution Order. 23

SIMD Single Instruction Multiple Data. 39, 44

SSA Static Single Assignment Form. 4, 5

v

Nomenclatureevtk Exports data to binary VTK files for visualization and data analysis. 38

LLVM collection of modular and reusable compiler and toolchain technologies. 2–6,8–12, 15, 17, 19, 20, 22–27, 44–46, 48

llvmlite A lightweight LLVM python binding for writing JIT compilers. 2, 5, 6, 11,22–27, 44

NumPy NumPy is the fundamental package for scientific computing with Python. 11,15, 27, 39, 41, 44

ParaView Multi-platform data analysis and visualization application. 38, 41

Polly High-level loop and data-locality optimizer and optimization infrastructure forLLVM. 46

pystencils Framework for generation of stencil kernels in LLVM and C/C++. 2, 8, 9,12, 17, 20, 23, 25, 26, 44–46, 49

SymPy Python library for symbolic mathematics. 2, 5, 6, 9–12, 16, 17, 20, 23, 37, 44,45, 48, 49

vi

1 Introduction

1.1 MotivationComputer simulation accompanies the rapid growth of the computer itself since itsbeginning. It is an useful tool in many fields of science, since it gives a tremendousinsight into a model, which is being simulated. Simulation is mainly used, when theexperiment on a real system is either impossible or impractical. Further aspects may be:for validation or for prototyping. For realistic and continuous simulations it is desiredto have a model which is as detailed as possible, in order to simulate a physical systemon true and well. The lack of detail might result into imprecise or a completely wrongsimulation.For any large simulation, i.e. any simulation which is not on an atomic level, there is no

upper bound for resolution. The limiting factor is the memory and computational powerof the available computers. Because of that it is common, that computer simulation istightly connected to High Performance Computing (HPC).Since the lack of computational power is limiting, it is highly necessary to program

a well performing simulation. A slow code will reduce the performance even furtherresulting in even lower resolution or higher hardware requirements and time consump-tion. However writing a fast simulation is not by any means simple. Since performanceis the main objective, it is common to optimize1 the simulation in order to reduce thetime spent simulating or reduce the hardware requirements. In most of the simulationsthere is only a part of the code, called kernel, which is responsible for the majority ofexecution time. Hence optimizing the kernel is essential for any large scale simulation.The kernel is usually a mathematical model such as a physical equation describing

a system, which has to be simulated. Thus for every application a special kernel isneeded in order to cover the particular physical equation. The kernel might also bespecialized for some parameters, which enables further optimizations. However thispractice results into multiple kernels, where every one of them covers a different set ofparameters. Although each kernel of the whole set is unique, their basis is nonethelessthe same, resulting into massive code duplicity. This complicates and slows down thewhole development and primarily the maintainability.Yet the optimization possibilities, which are enabled exactly through the employment

of more specific kernels is essential. In order to speed up the code, it is even desiredto program even more specific kernels, which utilize the given parameters for furtherimprovements. Such parameters might be ordinary, like constants or the size of an

1Code optimization is usually just an improvement for given goals. Truly optimal code is unreach-able.

1

array, more complex e.g. the discretization strategy, or even as extensive as an equation.However it is impossible to write a specialized kernel for every combination of givenparameters. And as mentioned earlier it slows down the development and increasesmaintainability of the program.A possible solution is meta-programming. It is a programming technique where a

program has the ability to treat another program as its data. The program can read,analyze, transform or generate other programs. This technique has a very wide rangeof different subclasses. So instead of implementing each and every individual kernel,the kernels are generated. Of course, this means there has to be a program which cangenerate the code. Implementing such tool means building a special compiler, whichtranslates a source language into target language. The source language is either anexisting programing language, or some kind of new descriptive language. The targetlanguage can be again an existing programing language like C/C++, or a low-levellanguage like LLVM or assembly. The target language is then further processed byother compilers and tools, resulting into the desired machine code. Stencil compilerslike Pochoir [1] and PATUS [2] have the same goal, however none of them compiles intoLLVM and they use a C like source language.The main focus for this thesis lies in automatic programming and program synthe-

sis. The goal is to generate LLVM code from a mathematical description of a kernelin SymPy. This goal is demonstrated with a generation of phase field models. Thegeneration process is implemented in a framework called pystencils.

1.2 Structure of this ThesisThis thesis is structured into 5 chapters. This chapter contains a brief introductioninto LLVM, SymPy, llvmlite, Abstract Syntax Tree (AST), the phase field method andshows the structure of pystencils the framework of this thesis. After the introduction amore detailed insight into the AST and the expression tree of SymPy is given, followedby the extensions needed for the purpose of generation of simulation kernels. In thethird chapter, the special transformations towards LLVM together with the generationof LLVM code itself are presented. The generation process is then completed with thecompilation, loading and usage of the just created kernel. The fourth chapter focuseson the application: Phase field method. This method is briefly introduced, which isfollowed by the implementation and generation of two distinct phase field models. Theapplication chapter is completed with a comparison of the just generated kernels with aPython implementation.

1.3 LLVMLLVM [3] itself is a collection of projects utilizing the LLVM core. The name LLVMitself was an acronym for Low Level Virtual Machine, however because of the project’sgrowth, the LLVM project changed its objective. Today LLVM is the official name of the

2

project. LLVM is widely used in commercial, open-source and academic projects. Thecore itself provides target-independent optimizer and machine code generator, which arebuild around the LLVM-Intermediate Representation (IR). Furthermore LLVM consistsof projects like a C/C++ compiler (Clang), debugger (LLDB), C++ Standard Library,OpenMP runtime and many more.Altogether LLVM provides a reusable complier and toolchain technologies. The ap-

pliance of LLVM allows a faster development of a high-level language. A compiler of alanguage is usually separated into a frontend 1 and a backend, where the IntermediateRepresentation is their common data structure.

Figure 1.1: LLVM toolchain

Targeting the LLVM-IR as the frontend’s target Intermediate Representation of thehigh-level language, allows the utilization of LLVM’s optimizer and backend. This allowsthe development to focus on the frontend itself. Similarly, it is possible to expand thebackend, if a processor architecture is not yet supporter. This might be the case fornew CPU or GPU architectures. Figure 1.1 demonstrates this behavior for exemplaryfrontends and backends.In the case of this thesis, whose main objective is to generate LLVM-IR code from a

mathematical description, the development of some kind of a frontend is unavoidable.LLVM itself was selected because of its optimizations performance, which can competewith other compilers, like the GCC, because of a stable and low-level IR, which can bewell generated, and because of the overall good support for diverse backends for machinecode generation.

1.3.1 LLVM-IRWith the generation of the Intermediate Representation code, it is important to knowwhat the special features of the IR are. First of all, the LLVM-IR has three differentforms. It can be an in-memory compiler IR, an on-disk bit code representation for an ef-ficient storage, and as a human readable assembly language representation. The IR itselfis a low-level programing language, which is similar to the assembly language. Howeverit is still abstract and therefore machine independent. Furthermore each instruction is

1A compiler’s frontend, as described here, is composed of multiple steps and other IRs, for examplethe AST which will be discussed later.

3

in the Static Single Assignment Form (SSA). This means that every variable has to beassigned once. In terms of an AST, which will be explained later in detail, it means thata definition of a variable dominates all its uses. Limiting the instructions to the SSAallows many optimizations done by LLVM.

%0 = add i32 1, 1%1 = add i32 1, %0

Listing 1.1: add-Instruction

The example from Listing 1.1 shows a simple add instruction, which sums up twooperands. It also defines the type of the operation to be an integer of 32 bits. Noticethat another assignment to the variable “%0” is syntactically correct, however it is notwell formed, because it contradicts the SSA.In order to allow control flow in an SSA based language, it requires a special instruc-

tion, called Φ. In the case of a simple if-then-else statement a Φ instruction is essentialat the merge point of the two branches, i.e. after the if-then-else statement.

(a) Conditional flow diagram (b) Flow diagram with LLVM-IR

Figure 1.2: Conditional

The Figure 1.2a shows the conditional flow diagram for an if-then-else statement,which is valid in many languages. The Φ instruction or node in LLVM makes sure thatthe variable “i” has the right value after the if-then-else statement. The Φ node selectsa value, depending on the predecessor of the current block which was executed. A blockrepresents a section of code, which has only one entry (at the beginning) and one exit(at the end). Afterwards it stores the value into the register or variable “%i.0”. However

4

in none of the real computer architecture is such an instruction available and thus theΦ instruction is completely virtual. It is the task of the backend to remove the Φ nodes,as it transforms the code from the SSA of the LLVM-IR into a machine code, removingthe SSA on the way.This was only a very brief introduction into the IR of LLVM. For further information

please refer to the Language Reference Manual [4] of LLVM.

1.4 PythonThe whole framework is implemented in Python. Python is the language of choice mainlybecause it has many modules (libraries) available, which are needed for the generationof LLVM code. C/C++ and other are suitable as well, however the implementation inPython tends to be faster, which allows a quick prototyping and changing of the wholeframework, if necessary.The main modules used are SymPy and llvmlite. SymPy is a library for symbolic

mathematics, and thus it is very suitable for mathematical description of simulationkernels. Llvmlite is a library which supports LLVM-IR generation and features Pythonbindings for the important functions in the LLVM core.

1.4.1 SymPyAs stated above, SymPy is a library for symbolic mathematics, completely written inPython. Its goal is obtain a full-featured computer algebra system. It is an open sourcelibrary under the BSD license and thus completely free. SymPy is also lightweight andsupports interactive usage. There are many computer algebra systems available, howeverSymPy, since it is completely written Python and is open source, can be easily extendedfor the needs of the framework. Because of the possibility of code extension SymPy waschosen as the mathematical description language for the framework.The most import feature of SymPy is the expression tree. The mathematical descrip-

tion of the simulation kernel will be composed of one or more expressions. SymPy’sexpressions allow to express most of the desired kernels. SymPy will be extended fora greater expression capabilities. In Figure 1.3 shows the expression 2x + x in the ex-pression tree form. The expression tree form closely resembles an AST. This is a greatadvantage, since it allows the framework to embed SymPy’s expression tree into theAST. In fact it will represent the majority of the AST, till it will be replaced.

1.4.2 llvmliteSince the whole framework is written in Python, and the focus lies on LLVM generation,a library for this particular generation is necessary. Llvmlite is also an open sourcelibrary under the BSD license. It is mainly used by the Numba project, which is a JustIn Time (JIT) compiler [5]. Llvmlite is the successor of another Python library, llvmpy.The big difference is that the generation of LLVM-IR is implemented purely in Python.

5

Figure 1.3: SymPy expression

Llvmpy had a huge implementation overhead with wrapping and exposing the necessaryLLVM functions, which are implemented in C++, to Python. Llvmlite only makes fewfunctions available, like the optimization of LLVM and generation of machine code.Having the majority reimplemented in Python has its downsides. First of all the whole

code is slower, which might cause problems in the future, if the simulation kernel has tobe generated during the simulation run, similar to a JIT execution. Second of all notall features of LLVM are implemented. For example the vector type is not supported.However since the library is in Python it can be extended, like in the case of SymPy. Yetsome extension implementations would be too extensive, since it would require to writeor rewrite large parts of this library. So although it is possible to build new functionsand modules with llvmlite, an editing of an existing one is not possible. The generationof code and the JIT behavior don’t need such features.Llvmlite is split under two main modules. The “IR layer” module is responsible

for the generation of LLVM code. This is the part, which is completely rewritten inPython. The “binding” module is in turn responsible for the optimization passes andcompilation of LLVM. It also provides JIT capabilities. With llvmlite the generatedcode can be directly executed from Python.

1.5 ASTAbstract Syntax Tree is a representation of the abstract syntactic structure of a sourcecode. The syntax is “abstract” because not every detail appears in the representation.A program undergoes multiple analysis phases before it compiles into a AST. For a goodcomprehension of the necessity and value of the AST, it is important to know what stepsare important inside a compiler, in order to create the AST and subsequently the IRand machine code.First the lexical analysis, called lexer, traverses the program and creates a sequence

of tokens. The tokens represent different statements, like a for-loop, an expression etcetera. Variables and functions will also be transformed into tokens, with an additionallink to further information of the particular identifier. Comments and unneeded white

6

spaces are ignored. The whole lexical analysis is not absolutely necessary, however itmakes the next step easier. In the theory of computation the lexer is a deterministicfinite automaton. Therefore implementing a lexer means defining an deterministic finiteautomaton.After the lexical analysis the sequence of tokens is parsed. This is called the syntax

analysis. The parser creates from the token sequence a parse tree. This is not the ASTyet. Usually the parse tree is defined with a grammar, which in turn is defined by thelanguage itself. The grammar type is usually defined with the Chomsky hierarchy andit vastly defines the art of the parser, i.e. its complexity. In general there are two typesof parsing, the top-down and the bottom-up parsing. The parse tree contains uselessnodes like punctuation and delimiters. In the AST such nodes are deleted, since theirsemantic value is represented by the tree itself. More modern parsers scan the tokensequence, picture the parse tree only in the execution process and create directly theAST.The last phase is the semantic analysis. It checks the program for semantic errors,

like types, variable definitions and uses and function definitions and calls. The semanticanalysis also edits the AST and enhances it with further information such as annotationsand properties. After this last step of analysis, the AST is completed and can be usedonward.The AST is another Intermediate Representation of a program. There are multiple

transformations which can or have to be applied on the AST. After every transformationan edited AST is produced. There are two different types. Either the transformation isneeded for the generation of IR code. So it is a necessary step towards the target. Orthe transformation is an optimization, which is easier or more efficient to apply on theAST and not on the IR.The research around lexical and syntax analysis, i.e. lexer and parser has a long

history. Thus for both applications there are tools which can generate the needed pro-grams. Parser is mainly defined by the grammar of the language. Lexer can be definedwith special functions or regular expressions. The most established generators of lexerand parser programs are lex and yacc [6, 7]. They have many different implementa-tions. Only the last analysis step, the semantic analysis, has no tools. Usually the ASTtransformations are very specific for the given language, its grammar and its usage.After this prearrangement the Intermediate Representation is generated1. As already

mentioned the IR code undergoes further optimizations. Afterwards the machine codeis created. The machine code itself can then be optimized as well. The last step of thecompilation process is linking of one or multiple compiled files into an executable.

1.6 Phase field methodThe phase field method is a mathematical model for simulation of problems with inter-face, like a phase transformation. This contrasts with the sharp interfaces model. The

1Some compilers with memory limitations overleap this step and generate directly machine code.

7

sharp interface model tracks the free boundaries along the different phases. This meansthat the interface has to be tracked the whole simulation long. It can be described as anevolving surface. The interface will change over time, which makes it particularly diffi-cult to track it computationally. However tracking the interface is the most importanttask, since this is the region of interest and the boundary conditions have to be applied.The phase field method was introduced to replace the sharp interface model, in order

to overcome this difficulty. Its diffuse interface is described by a field variable φ, alsocalled the order parameter 1. The phase field φ has two distinct values, e.g. 0 in thesolid and 1 in the liquid phase.

Distance

orderpa

rameter

(a) Sharp interface

Distanceorderpa

rameter

(b) Diffuse interface

Figure 1.4: Interface Comparison

Figure 1.4 compares the two approaches. In Figure 1.4a the sharp interface is locatedat the discontinuous jump between the solid and liquid state. The interface in Figure 1.4bis drawn as a blue region. φ is a continuous variable and cannot have discontinuousjumps. The region is called transition layer and the exact surface is therefore blurred.The phase field is governed by a partial differential equation. In Chapter 4 a moredetailed overview on the phase field method is given.The phase field method represents an example for the generation of simulation kernels,

because for a special application there is a different partial differential equation, whichneeds to be simulated.

1.7 Framework - pystencilsPystencils is the name of the framework which was used, expanded and improved for thisthesis. It is developed by the chair for system simulation in the University of Erlangen-Nuremberg. Its purpose is to translate the mathematical description from SymPy into atarget language. Currently C/C++, CUDA and LLVM are available. For developmentand debugging reasons it is possible to output the tree of the representation, which

1Do not confuse with the Φ instruction in LLVM from the previous section. The φ is a variablefrom the physics domain. The equivalent Greek variables of the instruction and of the variable arecoincidental.

8

makes the graph description language DOT the third backend. This makes pystencils adomain-specific compiler with main focus on stencil code.

Figure 1.5: pystencils toolchain

Figure 1.5 shows the toolchain of pystencils. The internal processes will be explainedin Chapter 2 and Chapter 3.

9

2 AST fabricationThis chapter describes the transition from a SymPy expression into a full AST. Theneeded extensions are described as well.

2.1 SymPy extensionsAs already mentioned, SymPy stores the mathematical in a tree data structure1. Thisis taken advantage of. SymPy’s expression tree is therefore the basis of the whole AST.However for the purposes of the generation of simulation kernels, the descriptive featuresof SymPy are not complete. Being a symbolic mathematical library, the symbols arethe core of SymPy. The symbol is interpreted as a scalar variable. It also may be aconstant, or a parameter which is substituted later in the equation and is kept in thesymbol form for better readability.Yet for the generation of code it is essential to know what type the symbols are

going to be. This is necessary for generation of C/C++, LLVM. The data type has tobe certain for every variable an operation, i.e. boolean, integer, double, pointer, etc.SymPy has types for numbers. It can differentiate between floats, integers, rationalsand few special types. However the Symbol class does not have any type. For thisreason TypedSymbol extends the SymPy class structure. The main difference is thatthe TypedSymbol contains another object, the Type, as shown in Figure 2.1.

Figure 2.1: Extensions of SymPy - Type System

1Although the data structure looks like a tree, in reality it is a Direct Acyclic Graph (DAG). Thismakes it more efficient.

10

The class Type is an abstract class. The valid classes for a type are BasicType in thecase of a scalar, PointerType in the case of a pointer and StructType in the case of astructure of types. The whole type system has further features. It can create types fromstrings, SymPy or NumPy types. It also can transform the type into a C/C++ andmost importantly into llvmlite types. The type system can also sort the types, whichis important in the LLVM specific AST transformation for type casting, promotion andconversion, which is discussed in Chapter 3.Although it was emphasized in the introduction that extending a python module is

not a hard task, with SymPy it is not a trivial task. SymPy’s internal code is rathercomplex especially in contrast to the usual python programming style. It is complexbecause of the optimizations in speed and in memory requirement.The next thing, which is needed for the description of the simulation kernel is an array

class. Or to be more precise a class, which knows about dimensions, memory layout andoptionally about the size of the array. Furthermore an array indexing has to be possiblefor modelling stencil-like kernel. Although SymPy has an array, tensor and indexedclasses, they still would need to be extended. Since the extension of an existing class isnot possible and an inheritance of those special classes would cause more maintenanceproblems, a new class was introduced. The Field class has like the TypedSymbol itsown type. It has a shape, which contains a special and an index dimension. It knowsabout the memory layout, i.e. stride. And it can have a fix size.The most important part of the Field is its nested class Access. The class Access is

yet again a Symbol since it represents a scalar value of the Field, as shown in Figure 2.2.However it is not a C-like access. The Access class doesn’t represent a particular elementbeing accessed. It stands for offset in the spacial and index dimension. This allows thedefinition of stencil-like kernels.

Figure 2.2: Extensions of SymPy - Field and Access

11

2.2 AST compositionWith the extended SymPy tree structure, the next task is to create and complete a fullAST, which is then used threw the whole code generation. As explained in Section 1.5the AST is an tree which represents the semantic of a program. The AST of pystencilsconsists of objects from SymPy and of its own objects.

Figure 2.3: Class diagram of AST

The class diagram in Figure 2.3 shows classes, which can be part of the AST. For thesake of simplicity most of the classes are not shown in the diagram. The Node class isan abstract class, from which all of the AST nodes inherit. Another abstract class, theExpr class, is a super class of different expressions like addition or multiplication. Theobjects of Expr subclasses will be used in the LLVM specific AST transformations andwill substitute the SymPy objects.In the upcoming sections the AST will be composed. An example kernel is going to

accompany the composition of the AST for a better understanding of the process. Theweighted Jacobi method will serve as an suitable example, which is easy enough to followand yet going threw all necessary steps.

2.2.1 Jacobi methodThe Jacobi method is an iterative algorithm for obtaining the solution of a diagonallydominant system of linear equations:

Ax = b (2.1)

12

where

A =

a11 a12 · · · a1na21 a22 · · · a2n... ... . . . ...an1 an2 · · · ann

, x =

x1x2...xn

, b =

b1b2...bn

(2.2)

The matrix A can be decomposed into diagonal component D and the remainder R:

A = D +R, where D =

a11 0 · · · 00 a22 · · · 0... ... . . . ...0 0 · · · ann

and R =

0 a12 · · · a1na21 0 · · · a2n... ... . . . ...an1 an2 · · · 0

(2.3)

One iterative step is then obtained with:

x(k+1) = D−1(b−Rx(k)) (2.4)

where x(k) is the kth approximation of x. This step is repeated either for a predefinednumber of times, or until the residual r(k) = b− Ax(k) is small enough.The weighted Jacobi method uses an additional parameter ω, which relaxes the iter-

ation:x(k+1) = ωD−1(b−Rx(k)) + (1− ω)x(k) (2.5)

The Laplace’s equation ∇2u = 0 will serve as an example, which is going to be solvedby the iterative algorithm. The equation is derivated with Finite Difference Method(FDM) on an Cartesian grid. The grid’s spacing is ∆x = ∆y = h = 1. The second-ordercentral difference for the space derivative at position ui in the x direction is defined as:[

∂2u

∂x2

]i

= ui+1 − 2ui + ui−1

(∆x)2 (2.6)

For a two-dimensional grid, where the ∆x = ∆y = h, the Laplace operator ∇2 at pointi, j becomes:

[∇2u]i,j =[∂2u

∂x2

]i,j

+[∂2u

∂y2

]i,j

= ui+1,j − 2ui,j + ui−1,j

(∆x)2 + ui,j+1 − 2ui,j + ui,j−1

(∆y)2

= ui+1,j + ui−1,j + ui,j+1 + ui,j−1 − 4ui,jh2

(2.7)

which is the discretized Laplace’s equation.

13

In matrix representation this equation in FDM becomes:

1h2

−4 1 11 −4 1 1

. . . . . . . . . . . .· · · · · ·

· · · · · ·1 −4

︸︷︷︸

A

u1,1u1,2...u2,1...un

︸︷︷︸

x

= b (2.8)

This matrix is a diagonally dominant system and thus it complies with the conditionof the Jacobi method. With the appliance of the Jacobi method, the equation can berewritten as

u(k+1)i,j = 1

4(u

(k)i+1,j + u

(k)i−1,j + u

(k)i,j+1 + u

(k)i,j−1

)(2.9)

which in turn can be transformed into the weighted Jacobi method:

u(k+1)i,j = ω

4(u

(k)i+1,j + u

(k)i−1,j + u

(k)i,j+1 + u

(k)i,j−1

)+ (1− ω)u(k)

i,j (2.10)

The Equations (2.9) and (2.10) can be expressed as a stencil code. Stencil codes area special class of iterative kernels. The kernel updates its array elements, dependingon some fixed pattern. This pattern is called a stencil, in this case a 5-point stencil,as shown in Figure 2.4. In the case of the Jacobi method, the implementation needstwo arrays, a destination and a source array. The red point is therefore a write accessto the destination array and in the case of weighted Jacobi method a read access fromthe source array. The four blue points are read accesses from the source array. This isconsistent with the kth iteration superscript in the Equations (2.9) and (2.10).

Figure 2.4: 5-point stencil on a grid

However the stencil does not cover the boundary, i.e. the grid points at each edge.Thus a boundary condition has to be introduced. For the sake of simplicity this exampleuses the Dirichlet boundary condition:

14

North: un,j = 1, for j = 1, . . . , nWest: ui,1 = 1, for i = 1, . . . , nEast: ui,n = 0, for i = 1, . . . , n

South: u1,j = 0, for j = 1, . . . , n

(2.11)

Using a Dirichlet boundary means, that the values at the boundary remain permanent.Thus it is necessary to set it only during the initialization of the arrays.In the Equation (2.10) the offsets are the stencil defining statements. They are repre-

sented with the Field class, or to be more precise with the Access class, whose objectis created upon a __getitem__ or __call__ function call. In those function calls, onlythe offset is of interest, not the variable i, which will be automatically iterated over,giving a size of the array either during compile-time or during run-time.

2.2.2 Kernel definitionWith a finished mathematical description of the weighted Jacobi method, the kernelitself can be written with the framework 1.As shown in Listing 2.1, first of all the arrays, which will be used in the simulation

itself later, are created, followed up by the creation of the Field objects for the kerneldescription.

srcArr = np.zeros((30, 20))srcArr[0,:] = 1.0srcArr[:,0] = 1.0dstArr = srcArr.copy()

srcField = Field.createFromNumpyArray('src', srcArr)dstField = Field.createFromNumpyArray('dst', dstArr)

Listing 2.1: Creation of simulation arrays and of Field objects

Note that NumPy begins its indexing with a 0 and not with a 1, which was assumedin the mathematical description. During this construction the arrays are also initiated.Their north and west edge set to be 1. Everywhere else they are 0. The array size is30×20, which is a very small simulation domain, however it is suitable for this example.The Field objects are created directly from the information given by the NumPy

arrays. Thus the data type and the size are copied from the NumPy objects. Witha fixed length, LLVM’s backend might be able to achieve some optimization with looptransformations.

1In the following code, some uninteresting segments like the import of modules and their initializa-tion were omitted.

15

omega = sp.symbols("omega")updateRule = sp.Eq(dstField[0,0], omega * (srcField[0,1] +

srcField[0,-1] + srcField[1,0] + srcField[-1,0]) / 4 +(1.-omega)*srcField[0,0])

↪→

↪→

Listing 2.2: Definition of the weighted jacobi kernel

In Listing 2.2 the whole kernel is implemented. First the symbol ω is created, whichis a normal SymPy Symbol. With no further information untyped symbols will betransformed to TypedSymbol of the double type later.Afterwards, as in Equation (2.12) the kernel can be reviewed. The offsets [−1, 0, 1]

have be transformed into a more convenient style, with abbreviations to east, north,south and west.

dstC = srcC (−ω + 1.0) + ω

4 (srcE + srcN + srcS + srcW ) (2.12)

A more interesting structure for the generation is the now created expression tree. InFigure 2.5 the whole expression is represented as an expression tree. The ordering ofmathematical operation is represented by the tree itself. The children of a node which is

Figure 2.5: Expression tree of the weighted jacobi kernel

commutative, like the Add class, are completely interchangeable. They are usually storedinside one list. On the other hand, nodes which require ordering, or have to differentiatebetween its children, like the Equality class, have to store each child node separately.

16

SymPy’s expression tree will be the basis of the AST. All needed information is alreadypresent.

2.2.3 AST generationThe AST is build around the expression tree, which was produced from the kerneldefinition. From the user perspective this is just one function call to createKernel,which handles the generation of the AST. This is shown in Listing 2.3.

ast = createKernel([updateRule])

Listing 2.3: Creation of a kernel from SymPy’s expression

The createKernel function consists of multiple steps. This section will explain thegeneric steps. The LLVM specific steps will be examined in the next chapter. ThecreateKernel function can be adjusted with additional parameters. The general symboltype can be changed. It is also possible to pass an extensive definition of the arraytraversal, including amount of ghost layer cells.First of all, the untyped symbols of the equation are typed properly. The SymPy

Equation objects are transformed into SympyAssignment. The Equation class is apure mathematical equation with a left and right hand side. However for a generationof any programming language, it has to be clear what the left hand side represents. Itmight be a field access, a scalar assignment or a constant, and it might involve casting.After the transition to the SympyAssignment, createKernel continues with the pack-

ing of the expressions inside a Block. This is never directly visible in the graph. TheBlock object holds a list of expressions and makes an easier AST structure possible.The next step is the creation of the loops LoopOverCoordinate, which will make

traversing on the arrays possible, and a function KernelFunction. Figure 2.6 revealsthe transformation. With a fixed size array, the range of those loops is already known.Arrays of variable length however, will later include a size and stride parameter intothe function’s definition. Unless otherwise specified, the ghost layer is automaticallyintroduced. Although the node of the KernelFunction is already present, the requiredinformation is not yet available. The graph in Figure 2.6 is also not generated by theSymPy’s dotprint function anymore. That is why the names of the nodes differ from theprevious graphs. Pystencils provides its own dotprint function, which is just anotherbackend of the framework.After having the loop and function nodes on its places, the next step is to resolve

the field accesses. It is necessary to replace the Access objects with array indexing.With the introduction of the LoopOverCoordinate the AST now has the individualcoordinate variables for each dimension, which can be referenced to. This means, thatthe Access objects, which only state the offsets of the ith array field, can now be replacedwith a real index, depending on the variables which are looped over. It is replaced withIndexed class, which is a SymPy class for the representation of a matrix element.

17

Figure 2.6: Extended AST with loops and function

In addition to this transformation an optimization is processed. The pointer arithmeticat this step can be easily optimized by pulling out the computation of the first loop.This creates a set of pointers for every row in the kernel, i.e. array access with verticaloffset 1. In this example the row without offset is accessed three times, as shown inFigure 2.7. (The detailed nodes below the SympyAssignment level were omitted fordisplaying purposes.) The offset within the row for each inner iteration is then −1, 0, 1.Thus a pointer is created, which points to the beginning of the row. And in the innerloop this pointer is indexed with the inner loop variable and if necessary the offset.Since the weighted Jacobi method has a stencil, which also uses the row above and

the row below, two additional pointers are computed, which point at the beginning ofthe row above and of the row below. The two additional pointers are needed, becausethe arrays can have additional memory at the end of a line. This optimization wouldn’tbe necessary, however since the pointer arithmetic is done anyway, the optimization isobvious to be covered.The last step is yet again an optimization step. The pointer introduced in the inner

loop form a loop-invariant code 2, which can be therefore moved outside the inner loop.

1The extra pointers could also be created for every column, i.e. array access with horizontal offset.This is dependent on the ordering of the multidimensional array.

18

Figure 2.7: Resolve field access

Figure 2.8: Move constants before loop

In Figure 2.8 the loop-invariant statements of the inner loop are moved into the outerloop. Those cannot be moved further since the statements depend on the loop controlvariable, in this case the ctr_0.In Figure 2.8 is the final general AST. The means that the AST contains everything

necessary for generation. However some backends, like the LLVM backend, will needfurther transformations on the AST. Editing the AST is the simplest way of preparingthe program data for generation.

2Loop-invariant code are statements, which do not change the semantics of a program, if movedoutside the loop. Not to confuse with the loop-invariant as a noun, which is a property of a loop thatis true before and after each iteration.

19

3 Generation of LLVMFor the generation of LLVM further transformations are needed. It is the type checkingand casting. Being a low-level language, LLVM needs explicit type conversions. WhereasC/C++ can automatically cast variables and do pointer arithmetic, i.e. array access,in LLVM it has to be directly expressed in the code. Furthermore LLVM also needsspecific types for arithmetic instructions, since a floating point and integer instructionsdiffer. This requires the Add and Mul nodes to have a type as well.In this chapter the next transformations of the AST necessary for the LLVM generation

are explained. They are followed up with the generation of LLVM itself.

3.1 LLVM specific AST transformations3.1.1 Transferring away from SymPy nodesAs stated earlier, SymPy’s classes cannot be extended easily. However it is necessaryto annotate the individual nodes with further information. Furthermore so far no tra-ditional transformation of the AST was done. Until now only substitution were used inorder to alter the AST. Recursions were used as well, however they are especially chal-lenging with SymPy, because of the DAG structure of the SymPy’s expression tree. Atraditional compiler uses the visitor pattern or a recursion on an AST. For these reasonsthe information in the SymPy objects will be transferred into much simpler, yet exten-sible objects. They loose all the functionality, which the SymPy objects provided, sinceit is not needed anymore. The only thing that remains after this transformation is thedata structure of the AST, i.e. the replaced objects pointed to by the individual nodes.The AST is going to be a true tree. Figure 3.1 shows, that no significant change wasdone to the AST itself. However the previous graphs hid away advantages of SymPy.Since SymPy’s expression tree was a DAG and in the current representation the graphshould be a tree. It is not displayed as such because of room reasons and because thename of the different objects is identical 1.

3.1.2 Type ConversionWith simpler extensible AST nodes, it is now easier to add type for every node andadjust the AST itself. The pystencils framework follows C-style type conversion. Itallows only implicit conversions, explicit ones are not yet supported. The number of

1The objects can be differentiated from each other, e.g. with the memory location of the individualobjects. However this would make the large graph even more unreadable.

20

Figure 3.1: DeSymPyied AST

supported types is limited and thus the set of types is smaller, when it is compared toa C like language. There are the Integer, Float and Rational types, which form theBasicType. They also represent the base type of a pointer class PointerType. The fulltype system was explained in Section 2.1 and shown in Figure 2.1.The AST is traversed in a bottom-up manner. The leafs have already their type

information, because of the previous AST transformation. Leafs are either of classTypedSymbol or Number and all of them have fixed types. The interesting step happensat objects of the class and sub-class Expr and of the class SympyAssignment. On thoseobjects a type conversion might occur.

Figure 3.2: Conversion inside AST

21

A node, whose children have different types, inserts a conversion between the particu-lar child, which needs a conversion. The conversion node has therefore an own type, i.e.the target type. Figure 3.2 shows a sample conversion inside a multiplication. Becauseω has the double type, the originally integer number −1 has to be double as well, thusconverting the number into a double type.

Figure 3.3: Pointer arithmetic inside AST

Another conversion is introduced with pointer arithmetic, since pointer arithmetic isa constituent of the AST creation. Like in C, pointer arithmetic occurs only betweenone pointer and at least one integer. Other types will result into an error. In Figure 3.3the pointer arithmetic is displayed. The integer part is inside of the square brackets,whereas the pointer is only inside of the parentheses.The type conversion and pointer arithmetic were the last steps before the generation

itself. Now the AST is fully prepared for a straightforward LLVM generation.

3.2 LLVM generationIn the following the final steps for the generation of LLVM are discussed. First theprinting system is introduced followed up by the llvmlite extensions. Afterwards thecompilation of the just generated code, its optimization and its usage are discussed.

22

Since all those step are done during the run-time of one Python program, pystencilscould be also described as a JIT.

3.2.1 SymPy’s Printing SystemThe backend of pystencils closely resembles the printing system of SymPy. Actually thebase class of the individual backends is the generic printer class of SymPy. The printer1

receives the expression or AST as input data. The information of the given data isprinted and, if necessary, the the tree is traversed further. The concept of SymPy’sprinting system includes three steps.

1. Let the object print itself, if it knows how

2. Use the best fitting method defined in the printer

3. Fall-back to the emptyPrinter of the printer

Letting the object itself print out what is needed is unnecessarily complicated. Thiswould blow up the code of the AST, which was mainly designed as a data structure andthus any additional functionality of the nodes is unwanted. In addition the backendsstill use SymPy’s classes inside the AST. Since the extension of SymPy’s classes are notfeasible, the whole option is impossible. Therefore the second step of the printer is theused one. Every printer has a printing function for every class it needs to print. Withthis approach all the code regarding the particular backend is at one place, providinga good overview and also a good separation of the individual printers. The last optionin the form of a fall-back function is used only for raising an error, since if a printerwould ever get to the emptyPrinter, it would mean that there is a printing functionmissing and thus a correct behavior of the printer (and of the generated code) cannotbe assured.The second step is even more powerful, since it follows Python’s Method Resolution

Order (MRO). The MRO defines the search path, in order to select the right method touse in the case that the classes have multiple inheritance. Thus not every class has tohave necessarily its own printing function. If possible it is sufficient to have a printingfunction of a parent class. The function of the parent class will cover the sub-classes aswell. In Figure 3.4 the whole class diagram of the printing system is shown.

3.2.2 Extension of llvmliteLlvmlite consists of two layers, the IR layer and the binding layer. Only the binding layerprovides interaction with the LLVM library itself. The IR layer of llvmlite is completelyrewritten in Python and thus extendible with specific or missing features.In order to provide an easy loop generation and managing the code complexity a new

Loop class was introduced. It utilizes Python’s with statement and thus is only a class1There are multiple printers. SymPy itself provides over twenty different printer classes for various

backends. Pystencils has therefore four printer classes for C/C++, CUDA, LLVM and DOT

23

Figure 3.4: Class diagram of the printing system

which uses, and not extends, llvmlite. Its usage can be seen in Listing 3.1. The recursiveprinting system, which will be discussed in the next section, can be observer as well.The _print function returns the output of the generation of the underlying nodes. Soin the case of loop.start it is the initial value of the loop.

with Loop(self.builder, self._print(loop.start),self._print(loop.stop), self._print(loop.step),loop.loopCounterName, loop.loopCounterSymbol.name) as i:

↪→

↪→

#Operations with the loop counter i

Listing 3.1: Loop generation with the custom class Loop

Vector operations are a missing feature of llvmlite. However it is inserted easily, since itis sufficient to specify this particular type and mainly its string representation in LLVM.Llvmlite only supports for the IR generation the string representation, which is a trade-off between speed and maintainability. In this case the vector type extension proves thesimple extension properties of llvmlite. The class diagram is shown in Figure 3.5.

3.2.3 LLVM generation & JIT compilationWith a LLVMPrinter the LLVM code is generated. This could be the final step. Thegenerated program can be compiled with LLVM tools and linked together with somesimulation written in any frontend, which targets LLVM1. It is also possible to cre-ate a dynamic library from the generated code. The library can then be loaded intoprograming languages, which support a Foreign Function Interface (FFI).

1Apart from linking, for some programing languages it might be necessary to have a wrapper, inorder to use the generated code.

24

Figure 3.5: Vector type

There is yet another option. Since the generation already runs in python, it can bedirectly compiled and called, thus providing a JIT behavior. Just In Time compilation iscompilation done during the execution of a program. So the Python code, which succeedsthe generation and compilation can directly use the just created kernel. Pystencils hasmultiple ways of JIT compilation, depending on how much control of the process theuser needs. It also depends on how many functions there are for generation.In the case of one function, which has to be generated, compiled and loaded, pystencils

provides one simple way, in order to receive the pointer to the desired function, as shownin Listing 3.2. This hides all the necessary steps making it a simple and fast way of kernelcreation.

kernel = make_python_function(ast, {'omega': 2/3})

Listing 3.2: Generation, compilation and wrapping of one AST

The whole process can be split into three steps. First the LLVM code has to begenerated with the LLVMPrinter. The LLVMPrinter needs a module and a builder tooperate on.LLVM programs are composed of modules. Each module consists of global variables,

functions and symbol table entries. Modules can be combined with the LLVM linker,which merges the definition of functions, global variables and symbol tables and resolvesforward declarations. In the case of pystencils, which only operates on one or morefunctions, one module is fully sufficient.The llvmlite’s IRBuilder is responsible for the generation of LLVM-IR. It internally

maintains the current block, which is operated on, together with a pointer on an in-struction inside the block. New instructions are added at that point. The builder alsoreferences meta-data, which are used at some instructions. The builder can repositionand create new instructions, thus guiding the generation process. This whole procedurecan be handled by the function in Listing 3.3.

25

llvm_code = generateLLVM(ast)

Listing 3.3: Generation of one AST

The next step after the generation of LLVM-IR is its compilation. Llvmlite’s sec-ond module provides classes, in order to interact with the LLVM library. AlternativelyLLVM’s executables can be used, however it is cumbersome, and llvmlite solves ex-actly this issue. Pystencils provides around the llvmlite’s binding module a class, whichcompiles and loads the generated LLVM code. First the LLVM code has to be parsed.Afterwards LLVM’s optimizations are possible. With the compilation the desired func-tion can be loaded. All those steps are again summarized inside one function call asshown in Listing 3.4. However with the Jit object, it is possible to do any modificationsmanually.

jit = compileLLVM(llvm_code.module)

Listing 3.4: Compilation of generated LLVM code

The generation and compilation step can yet again be summarized together as shownin Listing 3.5.

jit = generate_and_jit(ast)

Listing 3.5: Generation and compilation of LLVM code

The make_python_function in Listing 3.2 adds a wrapper for Python around thegenerated function. Parameters like the ω in the case of the weighted Jacobi methodcan be passed either during the wrapper creation, or later during its call. That is alsothe main purpose of the wrapper.

3.2.4 LLVM optimizationsOne of the reasons of the usage of LLVM was its optimization capabilities. Thus it isobvious to use this optimizations. LLVM offers a large number of different passes:

• Analysis Passes

• Transform Passes

• Utility Passes

26

The utility passes are mainly responsible for the visualization of various propertiesof the code. The Control Flow Graph (CFG) and the dominance tree are the mostprominent examples.The analysis passes are necessary to enable some transformations. Depending on the

kind of the individual pass, it is either meant again as an analysis tool, which providesrich information for the user or the pass might be used for some further transformations,because in order to transform a CFG, it has to be certain, that the semantic meaningpersists, i.e. the transforming operation is legal. Analysis passes provide the necessaryinformation, whether or not a transformation is legal.The transform passes change the code itself. The goal is to optimize the program.

There might be different optimization targets, like the speed of the program or the sizeof the program. Even optimizations are possible, which take into account the cache ofa Central Processing Unit (CPU). The biggest challenge of all those possible transfor-mation is when which transformation should be applied. This is crucial because sometransformations might degrade the performance of the program, but they are necessaryto use other transformations at all. On the other hand, after some transformations oth-ers might not be possible at all. And for even more complexity, some transformationsneed to be executed multiple times.Llvmlite supports some of the transform passes, but not all of them. The optimization

passes are scheduled between the parsing of the generated LLVM code and the compi-lation. Only llvmlite’s passes are used. This offers the opportunity for future work, inorder to utilize the passes, which are not covered by llvmlite.

3.2.5 Utilization of compiled functionsThe simulation uses the arrays from Listing 2.1. The NumPy arrays are used by thesimulation. Those arrays are strided views on memory and thus they work with the justcreated function. Python’s own build-in list data structure won’t work at all. The kernelis run 1000 times, where after each iteration the arrays are swapped. NumPy ensuresthat it is just a swap of pointers, no copying is present.

for i in range(1000):kernel(src=srcArr, dst=dstArr)srcArr, dstArr = dstArr, srcArr

Listing 3.6: Running the simulation

Last but not least, the output has to be visualized. The initial arrays in Listing 2.1had a Dirichlet boundary of 1.0 at the north and west side. At the east and south edgesthe Dirichlet boundary was 0.0. With the Laplace’s equation the gradient of the field hasto be as smooth as possible. Thus the gradient won’t have any sudden spikes. Instead,since the sides of the field are different, the gradient will smoothly transition form 1.0in the north/west part into 0.0 in the south/east part.

27

Figure 3.6: Visualized output of the weighted jacobi computation

Matplotlib, a Python library, is used for visualization purposes, since everything isalready implemented in Python. In Figure 3.6 the smooth gradient is displayed. Thedark red cells are 1.0, whereas the dark blue cells are 0.0. Using the jet color-map ofmatplotlib, the smoothness is indicated with the slow color transition, from red overyellow and green into blue.This is the last step, which closes the whole path. After the definition of the kernel

and its generation, the generated code was compiled, loaded and executed. Finally theoutput of the whole procedure was shown.

28

4 Application: Phase Field MethodThe phase field method, as explained in the introduction, is a mathematical model fornumerical simulation of two or more phases and their boundary surfaces. It is used forthe determination of structures and of the path of the interface. Because of the diffusenature of the interface it is particularly difficult task, if solved with the sharp-interfacemethod. Consolidation processes are the most prominent example, however the methodis also applied to the formation of patterns between two fluids or fracture dynamics.The simulation is governed by a partial differential equation for evolution of the phasefield φ 1. The phase field φ is also called the order parameter and it can sometimesdescribe physical properties like a concentration of a solute.The phase field method and its various models have a long history. In the following

a very short introduction into the phase field method is given. For a more detailedintroduction refer to [8], [9] or [10].The goal of a simulation that undergoes a phase transformation is generally to mini-

mize the total free energy. In order to do so, the equation of the total free energy witha dependence on the phase field variable φ is needed. Such equation in the most basic(and incomplete) form is:

F (φ) =∫f(φ)dV (4.1)

where F is the total free energy and f is the double well free energy density. InFigure 4.1 such double well free energy is shown. That means it contains two phases, αand β. They are separated by a barrier hype W . The graph displays the equation:

f(φ) = Wφ2(1− φ)2 (4.2)However, with the minimization of the total free energy, the domain would have

clusters of α and β, which is the desired behavior, but the interface between those isnot continuous. It is discontinuous, because in order to be minimal, φ is either at α andβ. This is exactly what the sharp-interface method does as shown in Figure 1.4a. Withthis behavior it would be necessary to track the interface, which is exactly what shouldnot be done. Therefore the total energy F has to be extended in order to describe thecorrect interface behavior between α and β.

F (φ) =∫ [

f(φ) + ε2|∇φ|2]dV (4.3)

1In the literature there is no singular nomenclature, so various Greek letters are used for the phasefield. So regarding the phase field φ, Φ, or ϕ refer to the same phase field.

29

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

W

φα φβ

φ

f(φ

)

Figure 4.1: Free energy density

The total free energy F , which is a function, may look different depending on theapplication. This gradient ensures the continuous transition at the interface between thetwo phases α and β, thus the gradient enforces the behavior, as shown in Figure 1.4b. εis a coefficient of the gradient. The gradient represents the small effects of the interactionof atoms. The minimum of F is therefore given by:

δF

δφ= 0⇒ ∂f

∂φ− 2ε2∇2φ = 0 (4.4)

which is a functional derivative with respect to φ.The total free energy F , which is a functional can look depending on the application

differently. For instance the free energy density could be dependent on the concentrationof the solute c, or on the temperature T , which is used in [9]:

F =∫ [

f(φ, c, T ) + ε2c

2 |∇c|2 +

ε2φ

2 |∇φ|2]dV (4.5)

This, of course, leads to a set of different equations. For example the free energydensity f(φ, c, T ) becomes much more complex. This also depends on the number ofdifferent materials.For time-dependent simulations is the simplest equation which guarantees the decrease

of the total free energy with respect to time:

30

∂φ

∂t= M∇2

(∂f

∂φ− 2ε2∇2φ

)(4.6)

where M is a positive mobility parameter, which is related to the solute diffusion andinterface kinetic coefficient. Equation (4.6) is called Cahn-Hilliard equation. Dependingon the F the Cahn-Hilliard equation could also look like:

∂c

∂t= ∇ ·

[Mc(1− c)∇

(∂f

∂c− εc∇2c

)](4.7)

Another possible equation is:

∂φ

∂t= −M

(∂f

∂φ− 2ε2∇2φ

)(4.8)

which is the Allen-Cahn equation.The difference between Equation (4.6) and Equation (4.8) is that the Cahn-Hilliard

equation is conserved, whereas Allen-Hilliard is non-conserved. For example the con-servation of the concentration is crucial. Depending on the application, either of thoseequations are chosen.An interesting aspect of the numerical simulation is the thickness of interface. Again,

depending on the chosen equation, the thickness l can take the form of:

l =( 2W

)1/2ε (4.9)

It is desired to keep l as small as possible, because then the interface is clearly visible.This value is a balance between two different effects. On the one hand, the interface isas sharp as possible, in order to minimize the free energy, i.e. the volume of α or β is aslarge as possible. On the other hand, the introduced gradient makes the interface to bediffuse, in order to reduce the energy.Depending on the application a different model has to be constructed. Thus having

a different set of partial differential equations and parameters. The parameters definethe details of the simulation, like coefficients, constants, or concentration values. In thefollowing section two different models are presented.

4.1 Examples of phase field modelsIn the following two different examples of phase field models are presented. The ideasfor these examples were provided by [11]. The results are presented in Section 4.2

4.1.1 Spinodal decomposition of binary alloySpinodal decomposition is a mechanism which describes a swift demixing of a mixturefrom one thermodynamic phase to create two phases. For example in the case of hot

31

water and oil, the two fluids may blend at high temperatures, thus forming a singlethermodynamic phase. If the mixture is cooled down suddenly, the phase will separateinto oil-rich and water-rich phase. Yet this will only happen, if the mixture is at athermodynamic equilibrium, which permits this coexistence.The total free energy in this example is in the more simpler form:

F =∫ [

f(c) + 12ε(∇c)

2]dV (4.10)

where ε is the gradient coefficient and f(c) is the chemical energy density is representedas:

f(c) = Wc2(1− c2) (4.11)which is the same double-well potential as shown in Figure 4.1. However now the

energy function density depends on c and not on φ The concentration c has the sameequilibrium phases, at 0.0 and at 1.0.This model uses the time evolution of Cahn-Hilliard:

∂c

∂t= ∇2M(δF

δc) (4.12)

where M is the mobility. The functional derivative of F is:

δF

δc= µ(c)− ε∇2c (4.13)

where µ in turn is the functional derivative of the chemical energy:

µ(c) = W (2c(1− c)2 − 2c2(1− c)) (4.14)The thickness of the interface is then:

l = ( εW

) 12 (4.15)

Those are the necessary equations for the simulation. They need to be discretizedfor the simulation. For the discretization the finite difference method is used. This isthe same method as described in Section 2.2.1 for the discretization of the weightedjacobi method. This means that the simulation domain is a 2D grid of Nx × Ny gridpoints, where dx and dy is the spacing between the grid points. The point i, j of theconcentration is then cij.First of all the explicit Euler scheme is chosen for the time evolution of the Cahn-

Hilliard Equation (4.10):

cn+1ij − cnij

∆t = ∇2M

(δFijδcij

)n

cn+1ij = cnij + ∆t∇2M

(δFijδcij

)n (4.16)

32

where ∆t is the size of the time step. n is the current time step, whereas n+ 1 is thenext one. The discretization of:(

δFijδcij

)n= µ(cij − ε∇2cnij (4.17)

consists of two parts.The functional derivative of the chemical energy is discretized as:

µ(cnij) = W (2cnij(1− cnij)2 − 2(cnij)2(1− cij)) (4.18)Whereas the gradient becomes a Laplacian and is discretized as:

∇2cnij = ci+1,j − 2cij + ci−1,j

(∆x)2 + ci,j+1 − 2cij + ci,j−1

(∆y)2 (4.19)

The whole equation of δFij/cij has an Laplacian in the Cahn-Hilliard equation as well.For this purpose the functional derivative has to be stored in an extra array, in order tocompute it at all.

4.1.2 Dendritic solidificationA dendrite, which means “tree” in Greek, is a tree-like structure, which is used in manyfields to describe a branching behavior. Dendrites are formed when a liquid is sufficientlycooled so that solidification can occur. This fractal pattern is mostly interesting inmetals, however the most prominent example is the freezing of water, i.e. the formationof snow flakes. The dendrites form because of the interfacial anisotropy, i.e. thereare preferred directions of growth of the crystal. This is due to the surface energyat the solid-liquid interface, or due to the ease of attachments of atoms in differentdirections. The dendrites grow differently under different environmental conditions.The control of the growth is essential for many applications, especially in metallurgy. Adifferent microstructure of a material causes different properties, e.g. ductility, hardness,toughness, resilience or stiffness. This is even more crucial, if more than one material ispresent, i.e. in the case of alloys. The simulation of dendritic growth can therefore giveimportant insights, which are then used in the creation of a special material.The following model is based on [12]. The model consists of two variables, the tem-

perature T and the phase field variable φ. φ is either solid (1) or liquid (0) and thusnon-conserved, since the amount can change, i.e. the liquid freezes. The total free energyis:

F (φ,m) =∫ [1

2ε2|∇φ|2 + f(φ,m)

]dV (4.20)

The first part of the term is already familiar from the previous examples. Again, itis the gradient energy, which is responsible for the continuity in the interfaces. Howeverin this model ε depends on the direction of the normal vector at the interfaces:

33

ε = ε̄σ(θ) (4.21)where σ(θ) represents the anisotropy:

σ(θ) = 1 + δcos(j(θ − θ0)) (4.22)where j is the mode number of anisotropy, δ is the strength of the anisotropy and θ0

is the initial offset angle. Thus they are constant parameters. The angle θ however isnot constant and is defined by:

θ = tan−1(∂φ/∂y

∂φ/∂x

)(4.23)

The second part of the total free energy is the double-well local free energy, but it isfurther influenced by an parameter m, which is dependent on the temperature.

m(T ) = α

πtan−1 (γ(Teq − T )) (4.24)

The free energy functional is shown in Figure 4.2 for various possibilities of m. Theequation of the free energy is:

f(φ,m) = 14φ

4 −(1

2 −13m

)φ3 +

(14 −

12m

)φ2 (4.25)

For a time evolution the functional derivative of Equation (4.20) is used:

τ∂φ

∂t= −δF

δφ(4.26)

This results into a large partial differential equation:

τ∂φ

∂t= ∂

∂y

(ε∂ε

∂θ

∂φ

∂x

)− ∂

∂x

(ε∂ε

∂θ

∂φ

∂y

)+∇ · (ε2∇φ) + φ(1− φ)

(φ− 1

2 +m)

(4.27)

In addition to the Allan-Cahn equation, which controls the evolution of the phase fieldφ, a second equation for the regulation of the temperature T is needed:

∂T

∂t= ∇2T + κ

φ

∂t(4.28)

where κ is the dimensionless latent heat, i.e. the heat required to convert a solid intoa liquid or vapor, or a liquid into a vapor, without the change of temperature.The discretization of the equations is long but straight forward. The Laplacian oper-

ator is yet again discretized by a 5-point stencil:

∇2φnij = φi+1,j − 2φij + φi−1,j

(∆x)2 + φi,j+1 − 2φij + φi,j−1

(∆y)2 (4.29)

whereas gradient in the x and the y direction is the centered difference:

34

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

liquid solid

φ

f(φ

)m = −0.40m = −0.30m = −0.20m = −0.10m = 0.00m = 0.10m = 0.20m = 0.30m = 0.40

Figure 4.2: Free energy density

∂φi,j∂x

= φi+1,j − φi−1,j

∆x∂φi,j∂y

= φi,j+1 − φi,j−1

∆y

(4.30)

The time discretization is the explicit Euler scheme. This leads to the equation forthe temperature development:

T n+1i,j =T ni,j + ∆t ∗ T

ni+1,j − 2T nij + T ni−1,j

(∆x)2 +T ni,j+1 − 2T nij + T ni,j−1

(∆y)2 +

+ κ(φn+1i,j − φni,j)

(4.31)

The equation, which governs the phase field variable φ is discretized in similar matter:The Laplacians and gradients are exactly the same as before, so they are not coveredagain. The ∂εij/∂θ is derivated analytically:

35

θij = atan2(∂φij∂y

,∂φij∂x

)εij = ε̄(1 + δcos(j(θij − θ0))

∂εij∂θ

= −ε̄jδsin(j(θij − θ0))

mij = α

πatan(γ(Teq − Tij))

(4.32)

where m, ε and ∂εij/∂θ are then used in the equation of time evolution: ∂εij/∂θ isfurther shortened as ∂θεij. ∂φij/∂x and ∂φij/∂y are shortened accordingly as well:

φn+1ij =φij + ∆t

τ

[εi,j+1∂θεi,j+1∂xφi,j+1 − εi,j−1∂θεi,j−1∂xφi,j−1

∆y

− εi+1,j∂θεi+1,j∂yφi+1,j − εi−1,j∂θεi−1,j∂yφi−1,j

∆x+ ε2

ij∇2φij + φij(1− φij)(φij −12 +m)

] (4.33)

On the right side of this equation, all fields are in the n-th time step. This informationwas omitted, in order to get a better overview.In Equation (4.31) there are two φ from two different time steps. This means that the

old φ has to be preserved and not overwritten by the current time step. Thus φold = φn

is a simple copy operation.

4.2 ResultsIn the following the results of the previously discussed examples are presented.

4.2.1 Spinodal decompositionIn order to simulate, a domain size is necessary. For this simulation, the domain sizeis 64 × 64. This, however does not include the boundary. In this example the periodicboundary is used. The arrays have therefore an additional layer of cells around them.Their value is equal to the cells inside the domain of the slice on the opposing side:

c0,j = cN,j

ci,0 = ci,N

cN+1,j = c1,j

ci,N+1 = ci,1

(4.34)

where N is the size of the domain, e.g. 64. This periodic boundary has to be enforcedafter every step.For fixed domain sizes given, the arrays must be initiated. The array of concentration

is initialized with a concentration constant and with an additional noise. The noise is

36

between [−0.01, 0.01] as shown in Listing 4.1. The content of the array is visualized inFigure 4.3.

def init(Nx, Ny, c0):noise = 0.02con = np.zeros((Nx+2, Ny+2)) # +2 -> Periodic boundariesfor i in range(1, Nx+1):

for j in range(1, Ny+1):con[i,j] = c0 + noise * (0.5 - random.random())

return con

Listing 4.1: Initialization of the concentration array

Figure 4.3: Initial array content

The next step is to create the equations for the kernel generation. This is shown inListing 4.2. The definitions of symbols and fields are omitted. The functionlaplacian_sympy creates a discretized Laplacian as defined in Equation (4.19) from anyField. This makes the following equations easier to define. The equations are definedin lines 4−9. In addition the equations are already simplified with SymPy. For instancethe free energy density, whose initial definition closely resembles the mathematical de-scription:

W(−2.0srcU2

C (−srcUC + 1) + 2.0srcUC (−srcUC + 1)2)

(4.35)becomes much simpler:

srcUCW(4.0srcU2

C − 6.0srcUC + 2.0)

(4.36)The equations and the kernel can be created with parameters, which will be given

during the call of the kernel. Another way is to substitute these parameters before thegeneration of the kernel. In the lines 11, 12 this substitution is performed. Later theruntime difference will be discussed.

37

1 def laplacian_sympy(field, dx, dy):2 return (field[1,0] - 2.0 * field[0,0] + field[-1,0])/(dx*dx) +

(field[0,1] - 2.0*field[0,0] + field[0,-1])/(dy*dy)↪→

3

4 mu = (W * (2.0 * c[0,0] * (1-c[0,0])**2 - 2.0 * c[0,0]**2 *(1-c[0,0]))).simplify()↪→

5 lap = laplacian_sympy(c, dx, dy)6 derivation = mu - kappa * lap7 deriv_eq = sp.Eq(dF[0,0], derivation)8 lap_deriv = laplacian_sympy(dF, dx, dy)9 c_eq = sp.Eq(c[0,0], M * lap_deriv * dt + c[0,0])

10

11 deriv_eq = deriv_eq.subs([(kappa, grad_coef), (dy, dy_val), (dx,dx_val), (dt, dtime), (M, mobility), (W, w_val)])↪→

12 c_eq = c_eq.subs([(kappa, grad_coef), (dy, dy_val), (dx, dx_val), (dt,dtime), (M, mobility), (W, W_val)])↪→

Listing 4.2: Definition of equations

The simulation is then carried out for 20000 time steps. At the beginning of eachtime step the periodic boundary is applied, which is still written only in python. Thesimulation results are outputted via evtk. The visualization itself is done with ParaView.In Figure 4.5 the simulation progression is shown. As can be seen the microstructure

is in the beginning very fine and contains a high number of precipitates. During thesimulation the number of precipitates decreases and their size increases. This verycomplex evolution with many interfaces disappearing as their precipitate disappears,shows the convenience of the phase field method.

0 2 4 6 8 10 12

calledgeneratedNumPy

11.65.5

2

Seconds

Figure 4.4: Speed performance

In addition to the simulation itself, a simple performance measurement was done. Thespeed was measured without any data output during the simulation. In Figure 4.4 thereare three different bars. First of all, the performance of the generated kernel without thesubstitution of the parameters is shown. As expected it is considerably slower, than theversion with the substitution of the parameters. In this case the performance is twice asfast.

38

The code was also implemented in NumPy. NumPy exceeds the generated kernel.However NumPy is not a pure Python library. In fact NumPy uses Basic Linear AlgebraSubprograms (BLAS), which is a specification for low-level routines in linear algebra.BLAS’s implementations are high-performant code capable of Single Instruction MultipleData (SIMD) instructions. Since the generated code doesn’t use SIMD instructions, theperformance is therefore reasonable.

39

(a) 2000 (b) 3000

(c) 5000 (d) 8000

(e) 15000 (f) 20000

Figure 4.5: Simulation output of spinodal decomposition for given time steps

40

4.2.2 Dendritic solidificationThe preparations for the simulation are very similar to the previous ones. The periodicboundary is used on the φ array as well as on the temperature T array, which is enforcedafter every step. The domain size is 300× 300, without the boundary cells.The T array is initialized with zero everywhere. The φ array however needs a nucleus,

in order to grow the crystal. For this reason the initial φ array has a little dot, wherethe φ is equal to 1.The kernel definition follows exactly the same steps as in Section 4.2.1. Therefore

they are not examined again.The simulation is carried out for 4000 steps. At the beginning at each time step the

periodic boundary is applied. The simulation output is visualized with ParaView.In Figure 4.7 the evolution of the φ field is shown. The “snowflake” evolves during the

simulation. Some of the dendrites merge together. This would again be a difficult taskwith a sharp-interface model. The anisotropy parameter j mainly controls the type ofthe “snowflake”. Here the j is equal to 6, so the “snowflake” form six main dendrites.

0 20 40 60 80 100 120

calledgeneratedNumPy

11945

60

Seconds

Figure 4.6: Speed performance

The performance evaluation was made for kernel with and without the parametersubstitution and for a pure NumPy kernel. In contrast to the Section 4.2.1, in this casethe generated kernel is the fastest, as shown in Figure 4.6. This is explainable due to alarger number of parameters, which slow down the NumPy kernel too much. The kernelwhere the parameters can be transferred as its arguments is the slowest version.

41

(a) 0 (b) 500

(c) 1250 (d) 2000

(e) 3000 (f) 4000

Figure 4.7: Simulation output of φ of dendritic solidification for given time steps

42

(a) 0 (b) 500

(c) 1250 (d) 2000

(e) 3000 (f) 4000

Figure 4.8: Simulation output of T dendritic solidification for given time steps

43

5 Conclusion & Future WorkThis chapter summarizes briefly this thesis, which is followed up by a short discussionabout its effectiveness and responds to some issues, which arised. In addition some ideasabout possible future work are presented afterwards.

5.1 ConclusionIn this thesis the generation of the LLVM-IR with pystencils was presented. The tran-sition from the mathematical description in SymPy into the AST was explained inChapter 2. In Chapter 3 the whole process of LLVM generation from the description inAST was described. The Chapter 4 was an exemplary application utilizing the wholeframework.The pystencils framework works well. The generation is successful and thus the target

of this thesis was achieved.. However it was disappointing that the generated kernelin Section 4.2.1 was not faster than a version in NumPy, although the kernel had theadvantage of the substituted parameters. This indicates that the optimizations in pys-tencils are not sufficient enough. Although it was argued that the linear algebra libraryBLAS used by NumPy is very performant, it should be outperformed in the future. Thegeneration of LLVM does not yet support SIMD instructions or any way of paralleliza-tion.The boundaries were another issue. The LLVM generation part of pystencils does not

support it yet, and thus the boundary handling was done in pure Python 1. This shouldnot hurt the performance, since it seems that NumPy outperforms pystencils in exactlythis simple kernels.Last but not least the generation with llvmlite can be challenging. Llvmlite was

primarily designed for the needs of Numba [5]. This means that some features aremissing. The lack of instructions or types can be easily extended. However, the bindingto the LLVM does not cover the entire functionality required. For example many of theoptimizations are not accessible. However there is no other viable FFI of the LLVMlibrary for python. A possible new alternative might be llvmcpy [13].Altogether the definition, transformation and generation is still successful. The issues

mentioned above leave space for improvements in the future.

1Strictly speaking, since it is involving NumPy, it is not pure Python

44

5.2 Future WorkAs concluded the most of the issues regarding the generation of LLVM with pystencilsare just the lack of features, which should be implemented in the future. They can besplit up into two groups: extension of the supported scope and optimizations.

5.2.1 Extension of the supported scopeThe generation possibilities of kernels should be expanded. This means that the set ofpossible kernels should be increased. First of all the LLVM generation needs a bettersupport for some features, which the C-backend already offers. The handling of theboundary and the conditions are much needed features.Furthermore the linking of other libraries, compiled by LLVM should be supported,

because it increases the usability of the whole framework. The access to existing librariesoffers already implemented functionalities. For example the implementation of linearalgebra operations is usually highly optimized. Other functions like the Fast FourierTransform (FFT) or sorting operations could be available as well. On the other handthe generated kernel should be made available for different kind of programing languages,not just Python and C/C++.

5.2.2 OptimizationsThere are great possibilities for optimizations in pystencils, nearly on every representa-tion of the framework. They can be separated into three subgroups:

• SymPy optimizations

• AST optimizations

• LLVM optimizations

Although partitioned into three groups, many optimizations are executable in morethen one group.

SymPy optimizations

SymPy’s simplify method offers a powerful tool for simplification of mathematicalequations. Although the simplification works well from the mathematical perspective, nsome cases it creates an equation that does not work well from a computational pointof view. For example an exponential with an exponent equal to 2 can be changed intoa multiplication of the base with itself. Even though the user writes the kernel as such,SymPy tenaciously changes it into an exponential. Therefore it would be advantageousto simplify for performance and not for mathematical convenience.

45

AST optimizations

There are huge possibilities in optimizing the AST, since the AST itself can be expanded,in order to express new semantic meaning. For example vectorization or parallelizationcould be achieved with AST transformations. Both of them would need new nodes, inorder to express the particular meaning. However optimizing on the AST should be thesimplest way, because the AST possesses all available information about the kernel.

LLVM optimizations

Last but not least, optimizations on the generated LLVM have to be examined in moredetail. There is not much space for optimization during the generation, because the gen-eration of LLVM code is mainly the transition from AST into the LLVM-IR. Having theIR however opens up optimization possibilities. First of all, there are many optimizationpasses available, which could increase the performance and which are not yet applied.Moreover, since pystencils directly decides which instructions it chooses, it might turnout that an optimization pass created for pystencils purposes might be advantageous.Apart from the LLVM’s official optimization tool there are other possibilities of further

analysis and optimization of the LLVM code. For example Polly is another way ofoptimization of LLVM code.

46

AcknowledgementsI would like to thank Martin Bauer for his great support and his advising role in thecreation of this thesis. Furthermore I would like to thank Claudia Tüns for her help inproofreading and Sebastian Eibl for his assistance with the printing. Last but not least Iwould like to thank my fiancée Laura for giving me motivation and enduring me duringthe writing of this thesis.

47

Bibliography[1] Yuan Tang et al. “The Pochoir Stencil Compiler”. In: Proceedings of the Twenty-

third Annual ACM Symposium on Parallelism in Algorithms and Architectures.SPAA ’11. San Jose, California, USA: ACM, 2011, pp. 117–128. isbn: 978-1-4503-0743-7. doi: 10.1145/1989493.1989508. url: http://doi.acm.org/10.1145/1989493.1989508.

[2] Matthias Christen, Olaf Schenk, and Helmar Burkhart. “PATUS: A Code Gen-eration and Autotuning Framework for Parallel Iterative Stencil Computationson Modern Microarchitectures”. In: Proceedings of the 2011 IEEE InternationalParallel & Distributed Processing Symposium. IPDPS ’11. Washington, DC, USA:IEEE Computer Society, 2011, pp. 676–687. isbn: 978-0-7695-4385-7. doi: 10.1109/IPDPS.2011.70. url: http://dx.doi.org/10.1109/IPDPS.2011.70.

[3] LLVM Compiler Infrastructure. url: http://llvm.org/.[4] LLVM Language Reference Manual. url: http://llvm.org/docs/LangRef.html.[5] Numba. url: http://numba.pydata.org/.[6] Lex - A Lexical Analyzer Generator. url: http://dinosaur.compilertools.

net/lex/.[7] Yacc: Yet Another Compiler-Compiler. url: http://dinosaur.compilertools.

net/yacc/.[8] Adam A Wheeler, William J Boettinger, and Geoffrey B McFadden. “Phase-field

model for isothermal phase transitions in binary alloys”. In: Physical Review A45.10 (1992), p. 7424.

[9] W. J. Boettinger et al. “Phase-Field Simulation of Solidification”. In: Annual Re-view of Materials Research 32.1 (Aug. 2002), pp. 163–194. doi: 10.1146/annurev.matsci.32.101901.155803.

[10] Nele Moelans, Bart Blanpain, and Patrick Wollants. “An introduction to phase-field modeling of microstructure evolution”. In: Calphad 32.2 (2008), pp. 268–294.

[11] S Bulent Biner. Programming Phase-Field Modeling. Springer, 2017.[12] R. Kobayashi. “Modeling and numerical simulations of dendritic crystal growth”.

In: Physica D Nonlinear Phenomena 63 (Mar. 1993), pp. 410–423. doi: 10.1016/0167-2789(93)90120-P.

[13] Python bindings for the LLVM project. url: https://rev.ng/llvmcpy.

48

https://doi.org/10.1145/1989493.1989508

http://doi.acm.org/10.1145/1989493.1989508

http://doi.acm.org/10.1145/1989493.1989508

https://doi.org/10.1109/IPDPS.2011.70

https://doi.org/10.1109/IPDPS.2011.70

http://dx.doi.org/10.1109/IPDPS.2011.70

http://llvm.org/

http://llvm.org/docs/LangRef.html

http://numba.pydata.org/

http://dinosaur.compilertools.net/lex/

http://dinosaur.compilertools.net/lex/

http://dinosaur.compilertools.net/yacc/

http://dinosaur.compilertools.net/yacc/

https://doi.org/10.1146/annurev.matsci.32.101901.155803

https://doi.org/10.1146/annurev.matsci.32.101901.155803

https://doi.org/10.1016/0167-2789(93)90120-P

https://doi.org/10.1016/0167-2789(93)90120-P

https://rev.ng/llvmcpy

List of Listings1.1 add-Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 Creation of simulation arrays and of Field objects . . . . . . . . . . . . 152.2 Definition of the weighted jacobi kernel . . . . . . . . . . . . . . . . . . . 162.3 Creation of a kernel from SymPy’s expression . . . . . . . . . . . . . . . 17

3.1 Loop generation with the custom class Loop . . . . . . . . . . . . . . . . 243.2 Generation, compilation and wrapping of one AST . . . . . . . . . . . . 253.3 Generation of one AST . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4 Compilation of generated LLVM code . . . . . . . . . . . . . . . . . . . . 263.5 Generation and compilation of LLVM code . . . . . . . . . . . . . . . . . 263.6 Running the simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1 Initialization of the concentration array . . . . . . . . . . . . . . . . . . . 374.2 Definition of equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

49

List of Figures1.1 LLVM toolchain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Conditional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 SymPy expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Interface Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.5 pystencils toolchain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1 Extensions of SymPy - Type System . . . . . . . . . . . . . . . . . . . . 102.2 Extensions of SymPy - Field and Access . . . . . . . . . . . . . . . . . . 112.3 Class diagram of AST . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 5-point stencil on a grid . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.5 Expression tree of the weighted jacobi kernel . . . . . . . . . . . . . . . . 162.6 Extended AST with loops and function . . . . . . . . . . . . . . . . . . . 182.7 Resolve field access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.8 Move constants before loop . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1 DeSymPyied AST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2 Conversion inside AST . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3 Pointer arithmetic inside AST . . . . . . . . . . . . . . . . . . . . . . . . 223.4 Class diagram of the printing system . . . . . . . . . . . . . . . . . . . . 243.5 Vector type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.6 Visualized output of the weighted jacobi computation . . . . . . . . . . . 28

4.1 Free energy density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2 Free energy density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.3 Initial array content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.4 Speed performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.5 Simulation output of spinodal decomposition for given time steps . . . . 404.6 Speed performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.7 Simulation output of φ of dendritic solidification for given time steps . . 424.8 Simulation output of T dendritic solidification for given time steps . . . . 43

50

Jan Hönig

Persönliche DatenGeburtsdaten 13. Juni 1992 in Prag

Staatsbürgerschaft TschechischFamilienstand ledig, kinderlos

Schulausbildung2004–2011 Abitur, Gymnasium Herzogenaurach, Herzogenaurach, 2.7.2011–2015 Bachelor of Science, Friedrich-Alexander Universität, Erlangen, 2.4.

Berufserfahrung2012–2013 Werkstudent, Siemens, Nürnberg.

Unterstützung im IT-Bereich2014–2015 Studentische Hilfskraft, Friedrich-Alexander Universität, Erlangen.

Erstellung von GUI2016 Studentische Hilfskraft, Friedrich-Alexander Universität, Erlangen.

Erstellung der druckfertigen Version eines Studienbegleitbuchs2016-2017 Studentische Hilfskraft, Friedrich-Alexander Universität, Erlangen.

Korrektur von Übungszettel

SprachenTschechisch Muttersprache

Deutsch Verhandlungssicher Deutsches AbiturEnglisch Verhandlungssicher B2+/C1Russisch Gute Kenntnisse

Französisch Grundkentnisse B1+

ZusatzqualifikationenFührerschein B

Händelstraße 3 – 91074 – HerzogenaurachÆ +49 1575 210 57 64 • Q [email protected]

Lehrstuhl für Informatik 10 (Systemsimulation)

Documents

Transcript of Lehrstuhl für Informatik 10 (Systemsimulation)