Identification of Hadronic Tau Lepton Decays at the ATLAS...

133
CERN-THESIS-2015-279 12/10/2015 Fakultät Mathematik und Naturwissenschaften Fachrichtung Physik, Institut für Kern- und Teilchenphysik Master-Arbeit Identification of Hadronic Tau Lepton Decays at the ATLAS Detector Using Artificial Neural Networks Nico Madysa Geboren am: 14. September 1990 in Hoyerswerda Matrikelnummer: 3572486 Immatrikulationsjahr: 2012 zur Erlangung des akademischen Grades Master of Science (M.Sc.) Erstgutachter Prof. Dr. Arno Straessner Zweitgutachter Prof. Dr. Kai Zuber Eingereicht am: 12. Oktober 2015

Transcript of Identification of Hadronic Tau Lepton Decays at the ATLAS...

CER

N-T

HES

IS-2

015-

279

12/1

0/20

15

Fakultät Mathematik und Naturwissenschaften Fachrichtung Physik, Institut für Kern- und Teilchenphysik

Master-Arbeit

Identification of Hadronic Tau LeptonDecays at the ATLAS Detector UsingArtificial Neural Networks

Nico MadysaGeboren am: 14. September 1990 in HoyerswerdaMatrikelnummer: 3572486Immatrikulationsjahr: 2012

zur Erlangung des akademischen Grades

Master of Science (M.Sc.)

Erstgutachter

Prof. Dr. Arno StraessnerZweitgutachter

Prof. Dr. Kai Zuber

Eingereicht am: 12. Oktober 2015

AbstractTau leptons play an important role in a wide range of physics analyses at the LHC, suchas the verification of the Standard Model at the TeV scale or the determination of Higgsboson properties. For the identification of hadronically decaying tau leptons with theATLAS detector, a sophisticated, multi-variate algorithm is required. This is due to thehigh production cross section for QCD jets, the dominant background. Artificial neuralnetworks (ANNs) have gained much attention in recent years by winning several patternrecognition contests. In this thesis, a survey of ANNs is given with a focus on develop-ments of the past 20 years. Based on this work, a novel, ANN-based tau identificationis presented which is competitive to the current BDT-based approach. The influence ofvarious hyperparameters on the identification is studied and optimized. Both stabilityand performance are enhanced through formation of ANN ensembles. Additionally, ascore-flattening algorithm is presented that is beneficial to physics analyses with nodefined working point in terms of signal identification efficiency.

ZusammenfassungTau-Leptonen spielen eine wichtige Rolle in vielen physikalischen Analysen am LHC,u. A. in der Überprüfung des Standardmodells an der TeV-Skala und der Bestimmungvon Eigenschaften des Higgs-Bosons. Um hadronisch zerfallende Tau-Leptonen mitdem ATLAS-Detektor zu identifizieren bedarf es komplexer, multivariater Algorithmen.Grund dafür ist der hohe Produktionswirkungsquerschnitt für QCD-Jets, welche dendominanten Untergrund darstellen. Künstliche neuronale Netze (KNN) haben in derjüngeren Vergangenheit mehrere Wettbewerbe zur Mustererkennung gewonnen und soallgemeine Aufmerksamkeit auf sich gezogen. Diese Arbeit präsentiert eine Übersichtder Theorie künstlicher neuronaler Netzwerke mit besonderem Fokus auf die Entwick-lungen der letzten 20 Jahre. Darauf basierend wird eine neuartige, KNN-basierte Tau-Identifikation vorgestellt, die Ergebnisse vergleichbar zu dem derzeitigen BDT-basiertenAnsatz erzielt. Weiterhin erfolgt eine Untersuchung des Einflusses diverser Hyperpara-meter sowie deren Optimierung. Sowohl die Stabilität als auch die Leistung der Iden-tifikation wird durch die Bildung von KNN-Ensembles gesteigert. Zusätzlich wird einScore-Flattening-Algorithmus vorgestellt, der die Ensemble-Ausgabe zugunsten einervorhersagbaren Signalidentifikationswahrscheinlichkeit transformiert.

Contents

Abstract iii

1 Introduction 1

2 Theoretical Foundations 32.1 The Standard Model of Particle Physics . . . . . . . . . . . . . . . . . 32.2 The Electroweak Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Physics With Tau Leptons . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Artificial Neural Networks 73.1 Binary Classification and Training . . . . . . . . . . . . . . . . . . . . 83.2 Artificial Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2.1 The Combination Function . . . . . . . . . . . . . . . . . . . . 93.2.2 The Activation Function . . . . . . . . . . . . . . . . . . . . . . 10

3.3 Feed-Forward Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4 Single-Layer Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.5 Multi-Layer Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.5.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.5.2 Nonlinear Activation Functions . . . . . . . . . . . . . . . . . . 173.5.3 Decision Boundaries . . . . . . . . . . . . . . . . . . . . . . . . 173.5.4 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 18

3.6 Training Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.6.1 Back-Propagation . . . . . . . . . . . . . . . . . . . . . . . . . 193.6.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . 203.6.3 RPROP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.6.4 SARPROP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.6.5 Broyden–Fletcher–Goldfarb–Shanno Algorithm . . . . . . . . . 26

1

Contents

3.7 Weight Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.7.1 Random Weight Initialization . . . . . . . . . . . . . . . . . . . 283.7.2 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . 29

3.8 Error Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.8.1 Sum-of-Squares Error . . . . . . . . . . . . . . . . . . . . . . . 303.8.2 Cross-Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.8.3 Error and Activation Functions . . . . . . . . . . . . . . . . . . 32

3.9 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.9.1 The Bias–Variance Decomposition . . . . . . . . . . . . . . . . 333.9.2 Weight Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.9.3 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.9.4 Neural Network Ensembles . . . . . . . . . . . . . . . . . . . . 37

4 The LHC and the ATLAS Experiment 414.1 The Large Hadron Collider . . . . . . . . . . . . . . . . . . . . . . . . . 414.2 The ATLAS Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2.1 The ATLAS Coordinate System . . . . . . . . . . . . . . . . . . 454.2.2 The Inner Detector . . . . . . . . . . . . . . . . . . . . . . . . . 454.2.3 The Calorimeter System . . . . . . . . . . . . . . . . . . . . . . 484.2.4 The Muon Spectrometer . . . . . . . . . . . . . . . . . . . . . . 504.2.5 The Forward Detectors . . . . . . . . . . . . . . . . . . . . . . . 514.2.6 The Trigger System . . . . . . . . . . . . . . . . . . . . . . . . 52

5 Tau Reconstruction and Identification 535.1 Reconstruction of Tau Leptons . . . . . . . . . . . . . . . . . . . . . . 545.2 Identification of Tau Leptons . . . . . . . . . . . . . . . . . . . . . . . 56

6 Optimization of ANN-Based Tau Identification 656.1 Signal and Background Samples . . . . . . . . . . . . . . . . . . . . . . 656.2 Evaluation of Identification Algorithms . . . . . . . . . . . . . . . . . . 66

6.2.1 Figures of Merit . . . . . . . . . . . . . . . . . . . . . . . . . . 676.2.2 Independence of Transverse Momentum and Pileup Effects . . . 686.2.3 Benchmark Algorithm . . . . . . . . . . . . . . . . . . . . . . . 70

6.3 Input Data Standardization . . . . . . . . . . . . . . . . . . . . . . . . 706.4 Stabilization of Training Results . . . . . . . . . . . . . . . . . . . . . . 726.5 Event Reweighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.6 Score Flattening Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 79

2

Contents

6.7 Choice of Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . 826.8 Choice of Error Function . . . . . . . . . . . . . . . . . . . . . . . . . . 846.9 Choice of Activation Function . . . . . . . . . . . . . . . . . . . . . . . 866.10 Optimization of Network Topology . . . . . . . . . . . . . . . . . . . . 896.11 Performance of the Optimized Classifier . . . . . . . . . . . . . . . . . 92

7 Summary and Outlook 95

A Training Algorithms 97

B Derivation of the Back-Propagation Equations 99

C Bias–Variance Decomposition for the Cross-Entropy 103

List of Figures 105

List of Tables 107

3

1 Introduction

The physics of elementary particles seeks to answer two principle questions: What ismatter made of and what are the fundamental forces acting on it? It emerged aroundthe beginning of the twentieth century[1] with the discovery of the electron by J. J.Thomson in 1897; the scattering experiments by Rutherford, Geiger, and Marsdenfrom 1908 till 1913; and Bohr’s proposal of an atom model in 1914.

In the following century, an abundance of new elementary and composite particleswas discovered. Many theories which aimed to organize the hundreds of new particleswere both devised and discarded. By the early 70s, a set of theories now known as theStandard Model (SM) had been established. The established elementary particles wereorganized into four leptons and four quarks. Within each group, there were two familiesof two particles each. The interactions between these particles were described by a setof local gauge theories.

In 1975, a new particle was discovered by Perl et al.[2]. It turned out to be thecharged lepton of a then-unknown third family of particles[3]. Its discovery incitedsubsequent searches for the remaining members of this family: The bottom quark (foundin 1977[4]), the top quark (found in 1995[5]), and the tau neutrino (found in 2001[6]). Atpresent, there is no evidence that suggests the existence of a fourth family of elementaryparticles[7, ch. 10].

The Large Hadron Collider (LHC)[8] currently is the largest and most powerful parti-cle accelerator in existence. It is designed to search for rare processes at unprecedentedlyhigh energies of up to 14 TeV. Its primary goals are the discovery of the Higgs boson1

and the search for extensions of the Standard Model at high energies. The physics oftau leptons play a vital role in these searches. It is hence important to identify thetau leptons which are produced in the particle collisions at the LHC as efficiently andcorrectly as possible.

1The Higgs boson is the last particle required by the SM that had not been observed at the time ofthe LHC’s construction. In 2012, a boson was found whose properties are compatible with the SMpredictions for the Higgs boson[9, 10].

1

1 Introduction

In the past years, multi-variate analysis (MVA) methods have emerged as an alter-native to cut-based approaches to identification[11] due to their higher efficiency andflexibility. Artificial neural networks (ANNs) are an MVA method that has gainedattention both outside of[12, 13] and within[14, 15, 16] the field of particle physics. Ar-tificial neurons were first described by McCulloch and Pitts in 1943[17] and researchinto networks of such neurons soon followed. Since then, there has been continuousand ongoing research in this field with particular advances in the 90s and early 2000s.This makes ANNs interesting for the task of tau identification as well.

In this thesis, a modern survey of ANNs, their properties, and training methods ispresented. The survey covers insights of the past 20 years into the theory of feed-forward neural networks. Based on this work, a novel, ANN-based tau identificationis presented, which allows distinction between hadronically decaying tau leptons andparticle jets from pure QCD processes at a center-of-mass energy of

√s = 8 TeV. The

signal sample of tau leptons is Monte-Carlo generated whereas the background samplehas been extracted from a QCD di-jet selection of LHC data collected in 2012. Thisidentification presents a first step of research into the application of deep learning toparticle physics.

Several parameters of the ANNs and their effect on the signal identification efficiencyare investigated. Furthermore, it is studied how the formation of ANN ensembles affectsthe identification performance and stability w.r.t the initial guess for the ANN’s synapseweights. The ANNs’ performance is compared to a state-of-the-art identification basedon Boosted Decision Trees and achieves competitive results. Also, a score-flatteningalgorithm has been developed that is beneficial for physics analyses with no definedworking point in terms of signal identification efficiency.

Chapter 2 gives a brief introduction into the theory of particle physics. The Stan-dard Model, the Brout–Englert–Higgs mechanism, and the physics of tau leptons aredescribed. Chapter 3 surveys the theory of ANNs and presents advances on the topicsof training and generalization that have been made in the past 20 years. Chapter 4presents the LHC particle collider and the multi-purpose particle detector ATLAS asthe experimental framework. Chapter 5 gives an overview of the procedure of tau leptonreconstruction and identification. It also introduces the identification variables used byMVA methods. Chapter 6 presents an ANN-based approach to tau identification andthe results of its optimization. The performance of different setups are compared andthe score-flattening algorithm is presented. Chapter 7 gives a summary of this thesisand its results as well as an overview over possible further studies in this direction.

2

2 Theoretical Foundations

This chapter gives a brief summary of the theoretical concepts of modern particlephysics. Most prominently, the Standard Model (SM) is introduced in Section 2.1.In Section 2.2, the electroweak model is described in more detail with particular atten-tion to the Brout–Englert–Higgs mechanism. Finally, a list of processes involving tauleptons will be outlined in Section 2.3.

2.1 The Standard Model of Particle PhysicsThe Standard Model[18, 19, 20] currently is the most successful theory of particlephysics. It has been verified many times and with great accuracy[5, 6, 9, 21]. TheStandard Model is a relativistic quantum field theory (QFT), where matter particlesare described as excitations of Dirac-spinor-valued quantum fields Ψ . These matter par-ticles are: the charged leptons (e, µ, τ), the neutral leptons or neutrinos (νe, νµ, andντ ), and the quarks (u, d, c, s, t, and b).

The kinematics of these particles can be derived from a Lagrangian density L via theEuler–Lagrange equation. L depends on the quantum fields and their derivatives and ispostulated to satisfy two distinct kinds of symmetries: First, the space-time symmetries,which are described by an invariance of L under transformations of the global Poincarégroup; and secondly, the local gauge symmetries.

Local gauge symmetries arise as an invariance of L under transformations of a gaugegroup. For the Standard Model, this group is SU(3)C × SU(2)L × U(1)Y , where theindices C, L, and Y stand for color, left chirality, and weak hypercharge respectively.Gauge invariance requires the introduction of boson fields, which can be interpreted asmediators of three of the four fundamental interactions1.

Requiring the invariance of L under SU(3)C transformations gives rise to eight gluons,1Gravity, the fourth fundamental interaction, is not described by the Standard Model as it is expected

to be far too weak to play any significant role in ordinary particle processes[1, ch. 2].

3

2 Theoretical Foundations

the mediating particles of the strong interaction, which governs e.g. the physics insideprotons and neutrons. The group SU(2)L×U(1)Y comes from the electroweak theory (cf.Section 2.2), which unifies the electromagnetic and weak interaction. Electromagnetismis mediated by the photon and governs most every-day phenomena. (e.g. the physics ofatoms and molecules) The weak interaction, mediated by the three massive bosons W±

and Z, causes the β-decay of the proton as well as a plethora of other particle decays.According to Noether’s theorem[22], each of the continuous symmetries satisfied by L

leads to a conserved current. For the Poincaré group, 4-momentum and a generalizedangular momentum are conserved. For the gauge symmetries, each current is relatedto a conserved charge associated with its respective interaction.

Finally, the Standard Model postulates a scalar complex field ϕ, which gives rise toparticle masses as well as the scalar Higgs boson. The mechanism through which thisoccurs is explained in more detail in the following section.

2.2 The Electroweak Theory

The theory of the electroweak interaction is based on the work of Glashow, Salam,and Weinberg[18, 19, 20]. While Glashow’s model initially could not account for themasses of the W and Z bosons, Salam and Weinberg introduced the Brout–Englert–Higgs mechanism and let both fermion and boson masses arise through spontaneoussymmetry breaking. The Brout–Englert–Higgs mechanism had previously been devel-oped independently by Robert Brout, François Englert, Peter Higgs, Gerald Guralnik,Carl Hagen and Tom Kibble[23, 24, 25].

In this model, the Lagrangian is invariant under local gauge transformations of thegroup SU(2)L×U(1)Y , which gives rise to the gauge fields W i (i = 1, . . . , 3) – which onlycouple to the left-handed projection of fermion fields – and B0, which couples to bothleft- and right-handed particles. The charges associated with these gauge symmetriesare the weak isospin −→

T and the weak hypercharge YW.The model additionally postulates a Higgs field ϕ which is an isospin doublet (ϕ+, ϕ0)

of scalar complex fields. The potential term of this field, V (ϕ), is constructed in sucha way that it has degenerate ground states with the same non-vanishing vacuum ex-pectation value. Choosing any one of these ground states breaks the initial symmetry.This process leads to a mixing of the original gauge fields to new fields W±, Z, and A.At the same time, it causes three degrees of freedom introduced by the field ϕ to beabsorbed by the fields W± and Z, resulting in mass terms for each of them. The fourthdegree of freedom remains as a massive field H of the Higgs boson.

4

2.3 Physics With Tau Leptons

Electroweak symmetry breaking leaves an unbroken gauge symmetry described by thegroup U(1)Q. Its conserved charge is the electric charge Q := T3 − YW

2, and its gauge

boson A, the photon, remains massless.Fermion masses may be introduced in a systematic manner by postulating Yukawa

coupling terms between each fermion field and the Higgs field ϕ[7, ch. 11]. These Yukawaterms additionally lead to a coupling between each fermion and the Higgs boson H.The strength of each coupling is proportional to the mass of the participating fermion,which means that the Higgs boson is most likely to be found in processes involving heavyparticles. Among others, the tau lepton is a promising candidate for such processes.

2.3 Physics With Tau Leptons

With a mass of (1776.82 ± 0.16)MeV[7], the tau lepton is the third-heaviest elementaryfermion known, with only the top and bottom quarks being heavier. Its short lifetimeof only (290.3 ± 0.5)× 10−15 s results in a proper decay length of about 87.0 µm. Dueto this, most tau leptons decay before reaching the detector and may be detected onlyindirectly. (Cf. Page 53 for their decay channels.)

Tau leptons are produced in several processes that are important to physics analyses.These processes include:

H → ττ : Due to the high tau mass, this channel is predicted to have a relatively highbranching ratio of 6.32 %[26, p. 267]. This makes it attractive for the investigationof general Higgs boson properties (e.g. spin and CP quantum numbers). For thesame reason, it is also the most accessible channel for studying the coupling of theHiggs boson to the charged leptons[27, 28]. Finally, H → ττ is used in physicsanalyses because the leading decay channel, H → bb, suffers from a high QCDmultijet background[29].

Z → ττ : This channel is of high interest for two reasons: first, it complements theexisting measurements for the electron and muon decay modes and thus allowsverification of the Standard Model at very high energies; secondly, it gives anirreducible contribution to the H → ττ background and thus need to be measuredaccurately in order to reduce uncertainties[30].

W → τντ : Similarly to Z → ττ , this channel is an important background process insearches for the Higgs boson or supersymmetry[31, 32] and complements existingmeasurements in the electron and muon decay mode[33].

5

2 Theoretical Foundations

Z ′ → ττ : Some hypothetical extensions of the Standard Model, such as the TopcolorAssisted Technicolor model[34], predict a new heavy vector boson Z ′ which couplespreferentially to third-generation fermions, i.e. the top quark and the tau lepton.This makes searches for the channel Z ′ → ττ a focus of ongoing research[35].

6

3 Artificial Neural Networks

Artificial neural networks (ANNs) are a computational model for machine learninginspired by biological neural networks as encountered in many animals’ central nervoussystems and, ultimately, the human brain[36, p. 30].

Research into modeling biological nervous systems began as early as 1943, whenMcCulloch and Pitts presented the Threshold Logic Unit (now also known as McCulloch–Pitts neuron) as the first model of a formal neuron[17]. They proved that a network ofsuch units could model any boolean function.

The next step in the development of ANNs was the perceptron, invented by FrankRosenblatt in 1957[37]. The perceptron can be viewed as a more general model ofthe McCulloch–Pitts neuron, accepting arbitrary real numbers as input and having lessconstraints on its adjustable parameters. In modern parlance, the perceptron is a single-layer feed-forward neural network with the Heaviside step function as its activationfunction.

Although the perceptron first seemed to give promising results, a combination of exag-gerated expectations and the popularity of the highly critical book Perceptrons[38, 39]led to a sudden decline in interest and funding for neural networks research throughoutthe 70s[40].

In 1986, the topic of neural networks gained traction again through works of Rumel-hart and McClelland[41, 42]. They popularized the method of back-propagation1 andthe usage of multi-layer neural networks to overcome the limitations of Rosenblatt’sperceptron.

Partially spurred by the increase in computational power, neural networks gainedpopularity in the 90s. Many training algorithms were invented[46, 47], and optimizationtechniques studied in numerical mathematics were applied to the problem of ANNtraining[48, 49, 50].

1Back-propagation was first invented in 1969 by Bryson and Ho[43], and independently by Werbosin 1974[44, 45].

7

3 Artificial Neural Networks

Beginning in the early 2000s, both recurrent and deep neural networks gained popular-ity by winning international competitions in pattern recognition and machine learningin general[12, 13].

This chapter gives an overview of the theory of neural networks with a focus on thespecial case of binary classification tasks. It begins by explaining terms and conceptsof binary classification and continue by describing the structure of an artificial neu-ron. After that, the possibilities and limitations of Rosenblatt’s single-layer networkPerceptron are explored; the discussion is then extended to multiple layers. The sec-tions 3.6–3.9 are dedicated to exploring various concepts surrounding neural networks:training algorithms, error functions, and techniques to improve network generalization.

The final section closes with the introduction of neural network ensembles. Thisconcept is highlighted here because of its importance to this thesis, as it has been acrucial step towards finding a stable method of classification.

3.1 Binary Classification and TrainingThis thesis is concerned with the task of distinguishing between jets from hadronicallydecaying tau leptons and jets from pure QCD processes. (cf. Chapter 5) As the focusof this thesis lies on jets from tau decays, these will be called signal, while jets fromQCD processes will be called background.

For the purpose of classification, these jets are represented as patterns, i.e. as vectorsx ∈ Rn of n features of these jets. Features are various observables measured on eachjet. (Cf. Section 5.2 for an overview of the observables used in this thesis.)

The classes that these patterns should be assigned to are encoded using real numbers.The class of patterns representing background jets Cbkg is encoded with 0, while theclass of patterns representing signal jets Csig is encoded with 1.

Using these definitions, the problem can be reformulated as follows:

Find a discriminant function f : Rn → 0, 1 which assigns to each patternx ∈ R

n a code that represents one of two classes Csig and Cbkg, dependingon whether the pattern represents a signal or a background jet.

The concept of machine learning is to start out with a very general discriminantfunction f(x|w) which depends on a certain number of adaptive parameters w. Ina process referred to as training, the parameters w are adjusted using an iterativealgorithm and a data set D ⊂ R

n of already-labeled patterns in such a way that thediscriminant function f misclassifies as few of the training patterns x ∈ D as possible.

8

3.2 Artificial Neurons

Another important concept of machine learning is generalization, which is describedin more detail in Section 3.9. The easiest way to classify all training patterns correctlywould be to remember each of the presented patterns and which class it belongs to.However, for all practical applications, it is necessary that the discriminant functionalso correctly classifies patterns not present in the data set D. To achieve that, thetraining algorithm must approximate the underlying model of the data set and ignorethe statistical noise inevitably present in a data set of limited size.

3.2 Artificial NeuronsAn artificial neuron is a simplified model of a biological neuron[36, p. 30]. On amathematical level, an artificial neuron is represented as a vector-to-scalar mappingy : Rn → R with an additional set of n + 1 adjustable parameters. Of these parame-ters, n are the neuron’s synapse weights and the remaining one is the neuron bias orthreshold.

The function y describing the artificial neuron generally is a composition of twofunctions:

The combination function, which uses synapse weights and neuron bias to map theinput vector x to a scalar z called the summed or net input;

The activation function, which maps the net input to the scalar y, the neuron activa-tion or output. (The term transfer function is used as well.)

Figure 3.1 visualizes the flow of information from the input of a neuron to its output.

3.2.1 The Combination FunctionIn all cases relevant to this thesis,the combination function is the weighted sum:

z(x1, . . . , xn|b, w1, . . . , wn) :=n∑

i=1

wixi + b, (3.1)

where the xi are the neuron input variables, wi are the synapse weights, b is the bias,and z is the net input.

Many authors[51, p. 118][36, p. 43] find it convenient to define an additional, constantinput variable x0 := +1 and redefine the bias2 as the corresponding synapse weightw0 := b.

2Authors who call the bias threshold often define x0 := −1 instead.

9

3 Artificial Neural Networks

Σ φ

z y

b

x1

xn

x2

w1

wn

w2

Figure 3.1: Graphical representation of an artificial neuron. The combination func-tion

∑performs a weighted sum of the inputs x1, . . . , xn with the

synapse weights w1, . . . , wn and the bias b to produce the net input z.The net input is then passed through the activation function φ whichproduces the activation y.

Then, by defining the (n+ 1)-dimensional vectors

x := (x0;x) = (x0;x1, . . . , xn),

w := (w0;w) = (w0;w1, . . . , wn),

the combination function becomes the inner product of these vectors:

z(x|w) := w⊤x. (3.2)

For clearness’ sake, this thesis will not use this notation and instead explicitly distin-guish between synapse weights and biases. In sections where they are treated equiva-lently, it will be mentioned explicitly.

3.2.2 The Activation Function

The activation function maps a neuron’s net input to its activation. With it, the fullneuron function can be specified as:

y := φ(z) = φ(w⊤x + b

), (3.3)

where y is the neuron activation, φ is the activation function, and x, w, and b aredefined as above.

There are a lot of viable choices for an activation function, depending on the givenproblem. This section introduces only those relevant to this thesis.

10

3.2 Artificial Neurons

0

0,2

0,4

0,6

0,8

1

0

0,2

0,4

0,6

0,8

1

−4 −2 0 2 4

−4 −2 0 2 4

σ(x)σ'(x)

(a) The logistic function.

−1

−0,5

0

0,5

1

−1

−0,5

0

0,5

1

−4 −2 0 2 4

−4 −2 0 2 4

tanh(x)tanh'(x)

(b) The hyperbolic tangent.

Figure 3.2: Activation functions (solid lines) and their derivatives (dashed lines).

The logistic function3 σ : R→ (0; 1) is defined as

σ(z) :=1

1 + e−z. (3.4)

Its derivative satisfies:

σ′(z) = σ(z)(1− σ(z)). (3.5)

Both σ and σ′ are shown in Figure 3.2a. The logistic function is nonlinear, which isan important property for its usage in multi-layer networks. Subsection 3.5.2 describesthe influence of this property in more detail. The relevance of the logistic function tothis thesis is explained in Subsection 3.8.2.

Another common activation function is the hyperbolic tangent (cf. Figure 3.2b)

tanh z :=ez − e−z

ez + e−z, (3.6)

with the derivative:

tanh′ z = 1− tanh2 z. (3.7)

It is related to the logistic function through

σ(z) =1

2

(1 + tanh z

2

).

3Because of its S-shaped graph, σ is also called sigmoid function.

11

3 Artificial Neural Networks

Therefore, for each neural network with tanh-activated neurons, there exists a networkwith logistically activated neurons and modified synapse weights and biases that per-forms the same mapping yANN : Rn → R

m.Despite this isomorphic relationship, there are applications where the hyperbolic tan-

gent should be preferred over the logistic function due to numerical considerations.Section 3.7 will explore this matter in more detail.

There are many other activation functions that are useful for other tasks. Linearand exponential activation functions may be used in regression tasks and the so-calledsoftmax function is a good choice for multi-class classification tasks.

Finally, the Heaviside step function, defined as

Θ(z) :=

0 if z < 0

1 if z ≥ 0,

is of historical importance. It is the activation function used by the early neuron modelsby McCulloch, Pitts, and Rosenblatt. This function, however, is found undesirable formost modern training algorithms because its derivative doesn’t exist for z = 0 and iszero anywhere else. Therefore, training algorithms cannot use the derivative to gaininformation about how to adjust the parameters of the neural network.

3.3 Feed-Forward NetworksOne of the most-used variants of neural networks are the feed-forward neural networks.Feed-forward means that the network – when understood as a directed graph – does notcontain any cycles. The network’s output then is a deterministic function of its inputwith no dependence on a time parameter, as would be the case if internal loops wereallowed.

Feed-forward networks are typically constructed in such a way that their neurons areorganized in layers, where each neuron receives its inputs only from the neurons of thepreceding layer. (cf. Figure 3.3) The first layer is called the input layer and consistsof dummy neurons with no adjustable parameters; they do nothing but forward thenetwork input to the neurons of the second layer. The final layer, called the outputlayer, consists of neurons whose outputs are not connected to other neurons. Instead,their activations are used as the whole network’s output. All other layers are calledhidden layers because both their inputs and outputs are only connected to other neuronsand not to the outside. Input, output, and hidden neurons are consequently defined tobe neurons which are part of the corresponding layers.

12

3.4 Single-Layer Networks

x1

x2

x3

y

Input Layer Hidden Layers Output Layer

Figure 3.3: A typical feed-forward neural network. The three input neurons eachreceive and forward one input variable. The activation of the outputlayer neuron is the output of the whole neural network. Note that ineach layer, all neurons receive input from the whole previous layer.

From Figure 3.3, it is clear that a network with N input neurons can be passed apattern of N different variables. Then, by having each neuron calculate its respectiveoutput and let it pass forward through the network, the ANN will produce an output ofM different variables, where M is the number of output neurons. The network function

yANN : RN → RM ,

of a neural network describes this mapping from an input pattern to the output neurons’activations.

3.4 Single-Layer Networks

In this thesis, the term single-layer networks refers to feed-forward networks with asingle layer of adjustable parameters4, i.e. without any hidden layers.

As discussed in Section 3.1, the two classes of signal and background jets can beencoded with just the two numbers 0 and 1. That means, a single-layer network per-forming a binary classification task needs only one output variable and subsequentlyonly one output neuron. Using (3.3), the network function then simplifies to:

yANN(x|w; b) = φ(w⊤x + b), (3.8)

4When counting the layers of a neural network, some authors include the input layer. Under thisconvention, a network as described in this section would be a two-layer network.

13

3 Artificial Neural Networks

x1

x2

yANN(x) = 1

yANN(x) = 0

w

−b/∥w∥

SignalBackground

Figure 3.4: A visualization of the (linear) decision boundary created by a single-layer network. The task is to distinguish between signal and back-ground patterns. (squares and crosses respectively) The vector w de-termines the orientation of the decision boundary while the bias b de-termines its distance from the point of origin.

where φ, w, x, and b are defined as in Section 3.2.

Because the network function yANN maps Rn → R and all activation functions coveredin this thesis are monotonically increasing, it is straightforward to define the discrimi-nant function of a neural network as:

f(x|w; b) :=

0 if yANN(x|w; b) < ythrs

1 if yANN(x|w; b) ≥ ythrs. (3.9)

The measure ythrs is an arbitrarily chosen threshold below which patterns are consideredto be background. With a very high ythrs, the probability of false positives is very low,but for false negatives, it is very high. Vice versa, if ythrs is set to a low value, falsepositives are very likely and false negatives very unlikely. Hence, ythrs must be chosendepending on the given task.

For the example case of N = 2 and Θ as the activation function, Figure 3.4 visu-alizes the network function yANN in a 2D diagram. As can be seen, yANN divides thex-space into two disjoint regions: One where yANN(x|w1, w2, b) = 1 and one where

14

3.5 Multi-Layer Networks

yANN(x|w1, w2, b) = 0. This is equivalent to assigning all patterns of one region to Csig

and all patterns of the other region to Cbkg. Both regions are separated by a straightline, the decision boundary. Furthermore, the diagram shows that the synapse weights(w1, w2) determine the orientation of the boundary line while the bias b determines itsdistance from the point of origin.

For N > 2, the decision boundary is an (N − 1)-dimensional hyperplane in RN .

Using a continuous activation function – like the logistic function σ(z) – will make thenetwork’s output change continuously between the two regions (with ∥w∥ determiningthe steepness of the transition), but will not change the shape of the decision boundary.

Consequently, single-layer networks perform best on classification problems whereboth classes can be separated with a single hyperplane. This type of problems is calledlinearly separable. An example for this type of problems are two Gaussian probabilitydistributions with the same covariance matrix but different means[52].

3.5 Multi-Layer Networks

Feed-forward networks with at least one hidden layer are called multi-layer networks5.They may become arbitrarily complex since the only constraint on the number of neu-rons – given by the binary classification task – is that there be only one output neuron.

This section will give an overview over the properties and capabilities of multi-layernetworks.

3.5.1 Notation

Let L be the number of adaptive layers in the neural network (i.e. excluding the inputlayer). Let further ni (i = 0, . . . , L) be the number of neurons in the i-th layer. Notethat n0 is the number of input neurons and by extension, the number of features in thepatterns x ∈ Rn0 .

Let w(k)ij be the synapse weight that connects the j-th neuron of the (k − 1)-th layer

to the i-th neuron of the k-th layer. The indices iterate over the values k = 1, . . . , L,i = 1, . . . , nk, and j = 1, . . . , nk−1. Finally, let b

(k)i be the bias of the i-th neuron in the

k-th layer.

5Some authors use the term multi-layer perceptron (MLP), in reminiscence of Rosenblatt’sperceptron[37].

15

3 Artificial Neural Networks

The activation of the i-th neuron in the k-th layer is:

y(k)i := φ(k)

(nk−1∑j=1

w(k)ij y

(k−1)j + b

(k)i

), (3.10)

where φ(k) is the activation function of the neurons in the k-th layer6, and the y(0)i := xi

are the input variables xi.

The recursive formula (3.10) can be expressed more compactly by introducing thefollowing vectors and matrices:

y(k) :=(y(k)1 , . . . , y(k)nk

)⊤(3.11a)

b(k) :=(b(k)1 , . . . , b(k)nk

)⊤(3.11b)

ϕ(k)(z) :=(φ(k)(z1), . . . , φ

(k)(znk))⊤ (3.11c)

W(k) :=

w

(k)11 · · · w

(k)1nk−1... . . . ...

w(k)nk1

· · · w(k)nknk−1

∈ Rnk×nk−1 . (3.11d)

With them, (3.10) simplifies to:

y(0)(x) := x (3.12a)

y(k)(x) := ϕ(k)(

W(k)y(k−1)(x) + b(k)), (3.12b)

which is a series of simple matrix operations alternating with element-wise applicationof the activation function.

Using the fact that there is only one output neuron (nL = 1), the network functioncan be written as:

yANN(x) := y(L)1 (x), (3.13)

where the dependence of yANN on the weights W(k) and biases b(k) (k = 1, . . . , L) hasbeen suppressed for the sake of simplicity.

6Different layers may have different activation functions. For example, Chapter 6 presents an ANNwith whose activation function is tanh in the hidden layer and σ in the output layer.

16

3.5 Multi-Layer Networks

3.5.2 Nonlinear Activation Functions

As (3.12b) shows, each adaptive layer of a multi-layer network performs a linear trans-formation on the output of the previous layer and then applies its activation functionelement-wise. This has important implications for the choice of activation function.

Assume an L-layer network where the k-th hidden layer has an identity activationfunction7 φ(k)(z) = z.

According to (3.12b), this layer performs a linear transformation on the previouslayer’s output y(k−1):

y(k) = W(k)y(k−1)(x) + b(k).

However, the (k + 1)-th layer also performs a linear transformation on y(k) beforeapplying its activation function. Because the composition of two linear transformationsis a linear transformation as well, k-th layer may be eliminated through a suitabletransformation of the adjustable parameters in the (k + 1)-th layer.

The consequence of this property is that multi-layer networks can only gain additionalcapabilities beyond those of single-layer networks if at least one of their hidden layershas a nonlinear activation function. The discussion of the choice of activation functionis continued in Section 3.7 for the hidden layers and in Subsection 3.8.2 for the outputlayer.

3.5.3 Decision Boundaries

As Section 3.4 has shown, a single artificial neuron can create a hyperplanar decisionboundary. For multi-layer networks, the allowed complexity of the decision boundarygrows with the number of adaptive layers. For example, it is easily shown that a three-layer network may approximate any desired decision boundary to arbitrary precision,and two-layer network may still create any decision boundary that encloses a convexregion in the pattern space[51, section 4.2f.].

More generally, the general approximation theorem[53, 54] states that with enoughhidden neurons, two adaptive layers suffice to approximate any decision boundary toarbitrary precision.

7The following argument can easily be generalized to any linear activation function φ(k)(z) = Az+B(A,B ∈ R).

17

3 Artificial Neural Networks

3.5.4 Deep Neural Networks

Although a two-layer network with sigmoid activation functions can represent any de-cision boundary (and, in fact, can represent any continuous mapping Rn → R

m[55]),this representation may be inefficient in terms of the number of hidden neurons and,consequently, degrees of freedom. For example, Bengio[56] has presented a class offunctions fL,N : RN → R which can be approximated by an L-layer network with M

adaptive neurons, where M is polynomial in N . Approximating the same function withthe same accuracy using an (L− 1)-layer network, on the other hand, would require M

to be exponential in N .Another, more informal argument is that deep neural networks may be trained in

such a way that the first layer extracts features from the raw input data that facilitateclassification for subsequent layers. Each layer would then reach a higher level of ab-straction and make correlations between input variables visible that are nearly invisiblein the raw data. However, this argument is of little relevance for the task of hadronicjet classification presented in this thesis. This is because the feature extraction andchoice of input variables have been optimized by experimenters already and furtheroptimization of the data representation will provide minor improvements at best.

3.6 Training Algorithms

The concept of training is crucial to artificial neural networks in particular and machinelearning in general. Training a neural network means adjusting the parameters w of itsnetwork function yANN(x|w) in such a way that it fits a certain target function t(x) aswell as possible.

The similarity between yANN and t is usually measured with an error function E(w).Training a neural network then reduces to the task of minimizing the error functionwith respect to w. In most cases, however, it is expensive or outright impossible tocompute t(x) and only a set D of patterns xi ∈ D with the respective labels ti = t(xi)

is available.Most if not all training algorithms for neural networks are iterative and proceed in a

series of epochs τ . The basic scheme may be described as follows:

1. Divide the data set D into a training set S and a test set T with S ∩ T = ∅.

2. Initialize the vector w containing all synapse weights and biases of the ANN tosome w(τ = 0).

18

3.6 Training Algorithms

3. Train the neural network for a number of epochs Nepochs, where τ is the index ofthe current epoch using the following iterative procedure:

a) Forward propagation: Let some or all patterns xi ∈ S pass through thenetwork to get yANN(xi|w(τ)).

b) Backward propagation: Calculate some measure of error δ, which may de-pend on the training algorithm. Let δ pass backwards through the networkto get the vector of weight changes ∆w(τ).

c) Weight update: Perform w(τ + 1) = w(τ) + ∆w(τ).

d) Stop training if a certain, previously chosen stopping criterion is fulfilled.

4. Use the test set T to compute the error function E(w|T ) and compare it to E(w|S)to get a measure of the network’s ability to generalize on the given problem.

Note that sometimes, the data set D is split into three disjoint subset S, T , and V ,where the validation set V is used for the stopping criterion. (cf. Subsection 3.9.3)

3.6.1 Back-Propagation

On a mathematical level, training algorithms iteratively minimize the error functionE(w|S). The most straightforward approach to this is using the gradient ∇E. Com-puting the partial derivative of the error function with respect to a single weight isnontrivial for multi-layer networks due to their recursive nature. The back-propagationalgorithm[43] provides an efficient way to compute this partial derivative even for deepneural networks8.

This section will only give the final equations of back-propagation. Their derivationis given thoroughly in Appendix B. Note that the notation is the same as given inSubsection 3.5.1.

Let the error function E(w|S) be the sum of per-pattern errors:

E(w|S) =∑

(x,t)∈S

e (t,yANN(x|w)) . (3.14)

8In other publications, the term back-propagation may refer to a training method that combines thealgorithm described here with a simple gradient descent algorithm to find the minimum of the errorfunction.

19

3 Artificial Neural Networks

Then, the partial derivatives of e w.r.t. to each synapse weight and each bias are:

∂e

∂w(k)ij

= δ(k)i y

(k−1)j (3.15a)

∂e

∂b(k)i

= δ(k)i , (3.15b)

where the neuron deltas δ(k)i are auxiliary quantities and can be computed recursively:

δ(k)i = φ(k)′(z

(k)i ) ·

(Nk+1∑j=1

δ(k+1)j w

(k+1)ji

)(3.15c)

δ(L)i = φ(L)′(z

(L)i ) · ∂e

∂y(L)i

. (3.15d)

The only unspecified term in these equations, ∂e/∂y(L)i , is explored further in Section 3.8

3.6.2 Gradient Descent

With the ability to calculate the gradient ∇E of the error function E : RW → R overthe weight space, a straightforward approach to minimizing E is the method of gradientdescent. Because −∇E(w) can be interpreted as an arrow pointing towards the steepestdownwards slope, updating the adjustable parameters w of an ANN by:

w(τ + 1) = w(τ)− α∇E(w(τ)), (3.16)

will decrease E as long as the learning rate α is not too large.Choosing the correct learning rate is both difficult and important. If α is too small,

convergence will be very slow. (cf. Figure 3.5a) If α is too large, E might increase aftera weight update and even diverge over the course of the training. (cf. Figure 3.5b)Because there is no algorithm to determine the optimal learning rate, the only way todetermine it is by heuristics, i.e. by training several ANNs with different learning ratesto see which one works best.

Beside the choice of α, gradient descent is troubled by several other problems:

• The error function of multi-layer networks is highly nonlinear and contains manylocal minima. Because gradient descent follows the steepest slope, it is proneto getting stuck in local minima and, thus, failing to find the global minimum.Furthermore, starting from different points in the weight space often results in

20

3.6 Training Algorithms

E(w)

w

Δw(2) Δw(3) Δw(4)

Δw(1)

(a) If α is too small, convergence to a localminimum may be unacceptably slow.

E(w)

w

Δw(2)

Δw(3)

Δw(4)

Δw(1)

(b) If α is too large, gradient descent maystep over a local minimum or even di-verge.

Figure 3.5: The effects if the learning rate α is either too small or too large.

finding different local minima, so the performance of the trained ANN will varygreatly, depending on the initial weights w(τ = 0).

• The error function of ANNs often contains plateaus, i.e. large non-minimal regionsin weight space where the gradient is close to zero. Because ∆w ∝ ∇E, gradientdescent will travel very slowly through such regions.

• The vanishing gradient problem[57] is the observation that in deep multi-layernetworks, layers close to the input layer are almost unaffected by training. This isdue to the fact that the δ

(k)i of the back-propagation decrease exponentially with

each layer that they are passed through. Consequently, this imposes a strict limiton the number of layers for gradient-descent-trained ANNs.

• In ill-conditioned problems, the curvature of E close to a minimum differs greatlyalong two different directions. (cf. Figure 3.6) This results in a narrow, valley-like error surface in which gradient descent typically moves in a zig-zag patterntowards the minimum. This behavior slows down the convergence of the trainingprocess considerably.

Various ad-hoc modifications to the gradient descent have been developed to addressthese issues. While an extensive overview is presented by LeCun et al.[58], only a fewof them are discussed within the context of this thesis.

A method that addresses the problem of local minima is adding a momentum term

21

3 Artificial Neural Networks

w∗

w1

w2

Figure 3.6: In ill-conditioned problems, the gradient is nearly perpendicular to thedirection of the local minimum w∗. As such, it takes gradient descentmany weight updates to reach w∗. Depicted: Contours of an errorsurface E(w) (gray ellipses), a local minimum w∗ (red square), and aseries of consecutive weight updates ∆w(τ) (black arrows).

to the algorithm. In essence, the gradient-descent definition of the update step

∆w(τ) = −α∇E(w(τ)), (3.17)

is changed to:

∆w(τ) = µ ·∆w(τ − 1) + (1− µ) · [−α∇E(w(τ))] , (3.18)

where µ ∈ [0; 1] is the momentum parameter. This additional term adds inertia to themotion through the weight space, making it resist sudden changes in the gradient. Themomentum term also helps slightly in ill-conditioned problems as it reduces the risk ofoscillatory behavior.

Incremental learning9 is a technique that can be applied if the data set is large andcontains many redundant samples. The method described until now, often called batchlearning, uses the gradient ∇wE(w|S) to update the weights, effectively averaging overall per-pattern errors e(t, y). In incremental learning, the gradients ∇we(t, y(x|w)) areused instead and the weights are updated after each pattern. Because these gradientstypically are very different for different patterns x, incremental learning effectively

22

3.6 Training Algorithms

adds stochastic noise to the motion of w(τ) through weight space. This stochasticnoise allows the algorithm to escape bad local minima and has been shown to improvetraining results considerably[59].

3.6.3 RPROPThe RPROP algorithm (resilient back-propagation) was first published in 1992 by Ried-miller and Braun[47]. It is a response to the empirical finding that the gradient ∇E

behaves erratically in multi-layer networks and that its magnitude is an unsuitable pre-dictor for the step size ∆w

(k)ij . Instead, the RPROP algorithm determines ∆w

(k)ij directly

and for each weight w(k)ij separately, based only on the sign of the derivative ∂E/∂w

(k)ij .

There are several variants of RPROP, four of which were classified and evaluatedby Igel and Hüsken in 2000[60, 61]. This section will describe the iRprop− variant asintroduced in [60, 62, 63].

Let first the weights and biases wi(τ) be initialized to some values wi(0). (cf. Sec-tion 3.7) Note that the indices have been simplified for the sake of readability and i isassumed to enumerate all synapse weights and biases of the ANN. Further initialize allweight- and epoch-specific step sizes ∆i(τ) to the same initial value ∆0.

Then, each epoch of training can be divided into three steps:

1. Calculate the gradient ∇E via back-propagation;

2. use ∇E to update the step sizes ∆i(τ);

3. use ∆i(τ) and the sign of ∂E/∂wi to update the weights wi(τ).

The first step has been sufficiently covered in Subsection 3.6.1. For the second step,one proceeds as follows: First, compare the sign of ∂E/∂wi in this step to the signin the previous step; if it has not changed, one can assume that the direction of theminimum has not changed either. In that case, increase the step size so that plateausmay be crossed quickly. If however, the sign of ∂E/∂wi has changed w.r.t. the previousepoch, it is likely that the previous update has stepped over the local minimum. Inthat case, the step size should be decreased so that the minimum may be approximatedbetter. This results in the following formula:

∆i(τ) =

η+ ·∆i(τ − 1) if ∂E

∂wi(τ) · ∂E

∂wi(τ − 1) > 0

η− ·∆i(τ − 1) if ∂E∂wi

(τ) · ∂E∂wi

(τ − 1) < 0

∆i(τ − 1) otherwise, (3.19a)

9Naming conventions in literature are inconsistent; this technique may also be called sequential oron-line learning, which in turn can also refer to categories of learning that are similar, but notidentical to incremental learning.

23

3 Artificial Neural Networks

where η− and η+ are parameters of the training algorithm and 0 < η− < 1 < η+. Twoadditional parameters are the minimal and maximal step size, ∆min and ∆max.

An additional component of iRprop− is the weight update deferral. If the partialderivative changes its sign (and the step size is decreased), the weight update will beomitted in such an epoch. Instead, the algorithm ensures that the subsequent epochwill omit its step size update and instead, update the weights using the decreased stepsize10.

Finally, the third step of each epoch is:

wi(τ + 1) = wi(τ) +

∆i(τ) if − ∂E

∂wi(τ) > 0

−∆i(τ) if − ∂E∂wi

(τ) < 0

0 otherwise. (3.20)

A pseudocode implementation of the iRprop− algorithm is given in Listing 2 in theappendix.

The RPROP algorithm depends on five adjustable parameters in total. Their defaultvalues are η+ = 1.2, η− = 0.5, ∆0 = 0.1, ∆min = 1e − 6, and ∆max = 50.0. Oneof the advantages of RPROP lies in the fact that its convergence behavior is largelyindependent of these parameters; the default values already provide good results onmany problems[60].

3.6.4 SARPROP

While RPROP and its variants speed up the convergence of training significantly, theystill suffer from the problem of getting stuck in local minima far away from the globalminimum. This problem is addressed by the SARPROP algorithm, which combinesRPROP with the metaheuristic of simulated annealing[64].

Simulated annealing (SA) is a search technique designed to increase the probabilityof finding the global minimum[65, 66]. When applying SA, the minimization process isdisturbed through a stochastic noise term which represents a thermodynamic behaviorof the system. These disturbances may lead the algorithm away from a local minimumand thus allow for a more thorough search of the weight space. Annealing means thatthe influence of these stochastic perturbations is decreased in a controlled manner overtime, so that convergence of the underlying training algorithm is maintained.

10In terms of implementation, this deferral of the weight update can be achieved by setting the variableholding ∂E

∂wi(τ) to zero. This will activate the third branch of both (3.19a) and (3.20).

24

3.6 Training Algorithms

In SARPROP, the annealing process is realized through a temperature term:

T (τ |λ) := 2−λτ , (3.21)

where the cooling speed λ is a parameter of the algorithm and effectively determinesafter how many epochs the influence of the stochastic noise becomes negligible.

SARPROP adds two modifications to the RPROP algorithm: First, a noise termis added to the weights’ step size ∆i if the sign of ∂E/∂wi has changed between twoepochs and if the magnitude of ∆i is below a certain, temperature-dependent limit.This measure leaves the RPROP convergence procedure undisturbed if it’s far awayfrom local minima. When approaching a bad local minimum, however, the noise termdisturbs the motion through the weight space and allows the ANN to escape from theminimum.

Secondly, SARPROP regularizes the ANN using weight decay. Weight decay adds aterm to the gradient ∇E:

∇E(w) := ∇E(w) + Ω(w), (3.22)

which favors small synapse weights over large ones. (cf. Subsection 3.9.2) The weightdecay term of SARPROP depends on the temperature to ensure that its influencevanishes after a sufficient amount of epochs.

By penalizing large weights, the weight decay term makes sure that initially, thealgorithm only focuses on very general features. Later, when the temperature T hasdropped significantly, the influence of the weight decay term becomes negligible and theANN is allowed to specialize and converge to a minimum of the original error functionE(w).

The regulator term (cf. Subsection 3.9.2) used by SARPROP is:

Ω(w) = k1T (τ |λ) ·1

2

∑i

log(w2

i + 1), (3.23)

which leads to an adjustment of the partial derivative:

∂E

∂wi

=∂E

∂wi

+ k1T (τ |λ) ·wi

w2i + 1

, (3.24)

where ∂E/∂wi is computed using back-propagation.SARPROP adds a stochastic noise term ∆noise to the step size ∆i if the algorithm is

close to a local minimum. This is indicated by the derivative ∂E/∂wi changing its sign

25

3 Artificial Neural Networks

and by ∆i being small compared to a certain limit ∆thrs. This limit is given by:

∆thrs = k2 · T 2(τ |λ), (3.25)

and the noise term by:

∆noise = k3 · r · T 2(τ |λ), (3.26)

where r ∈ [0; 1] is a uniformly distributed random number.Beside the cooling speed λ, the SARPROP algorithm depends on three further pa-

rameters k1, k2, and k3. The value of k1 determines the initial strength of the weightdecay; k2 decides how close to a local minimum w(τ) has to be before the noise termis added to the step size; and k3 determines the initial size of the noise term. Theseconstants have been determined to be largely problem-independent. The SARPROPauthors recommend the default values k1 = 0.01, k2 = 0.4, and k3 = 0.8.

A pseudocode implementation of the SARPROP algorithm is given in Listing 1 inthe appendix. Comments indicate differences to iRprop−.

3.6.5 Broyden–Fletcher–Goldfarb–Shanno Algorithm

The BFGS algorithm was developed independently by Broyden, Fletcher, Goldfarb, andShanno in 1970[48, 67, 68, 69]. It is a second-order minimization procedure, in contrastto the training algorithms presented so far. BFGS is a quasi-Newton method, i.e. itapproximates the second-order information using first-order information collected overseveral training epochs.

First, consider the Taylor expansion of the error function E(w) around a point w upto second order:

E(w) ≈ E(w) + (w − w)⊤g +1

2(w − w)⊤H(w − w), (3.27)

where g := ∇E(w) is the gradient of E and

(H)ij :=∂2E(w)

∂wi∂wj

∣∣∣∣w=w

(3.28)

is the Hessian of E at the point w.Using the requirement:

∇E(w∗) = g + H(w∗ − w)!= 0, (3.29)

26

3.6 Training Algorithms

the extremum w∗ of this approximation is:

w∗ = w − H−1g, (3.30)

where −H−1g is called the Newton direction or Newton step. By constructing such anapproximation in every epoch and updating the weights vector w iteratively with therespective Newton step, one can hope to find a local minimum of E(w).

This approach has two disadvantages: First, computing and inverting the HessianH is a computationally expensive operation, especially for ANNs with many synapseweights[51, p. 286]. Secondly, there is no guarantee that H is positive definite; takinga step along the Newton direction might actually increase the error and in the worstcase, the method may even converge to a saddle point or local maximum.

Both problems are solved by quasi-Newton methods. These are derived from Newton’smethod by requiring that the Hessian be constant between two epochs τ and τ + 1:

w(τ + 1)− w(τ) = H−1 (g(τ + 1)− g(τ)) , (3.31)

which is called the secant condition. Quasi-Newton methods replace H−1 with anapproximate matrix G(τ) which also satisfies this condition. This matrix is build itera-tively, starting out with an initial guess G(0) and approximating H−1 with increasingaccuracy.

For the BFGS algorithm, one of several quasi-Newton methods, the update rule forG is defined[67] as:

G(τ + 1) = G(τ)− dv⊤G(τ)− G(τ)vd⊤

d⊤v+

(1 +

v⊤G(τ)vd⊤v

)dd⊤

d⊤v, (3.32)

where d := w(τ + 1)− w(τ) and v := g(τ + 1)− g(τ).It can be shown that if G(τ) is positive definite, G(τ + 1) is positive definite as well.

This guarantees that the weight update:

w(τ + 1) = w(τ)− αG(τ)g(τ), (3.33)

which utilizes the Newton step (3.30), will decrease the error function E(w). Therequirement for this is that the step size α is so small that the quadratic approximation(3.27) holds.

An optimal step size α can be found through a line search procedure, i.e. an algorithmwhich minimizes the simplified function ε(α) := E(w − αGg). Any one-dimensional

27

3 Artificial Neural Networks

optimization algorithm may be used for line search, with a particularly efficient onebeing Wolfe’s method[70].

A pseudocode implementation of the BFGS algorithm is given in Listing 3 in theappendix. Despite its apparent simplicity, BFGS is computationally more expensivethan the previous algorithms due to the great amount of matrix multiplications andnetwork function evaluations. This is especially true for networks with a large amountof weights and, by extension, a large Hessian matrix. Nocedal and Wright describefurther implementation details[71, p. 200].

3.7 Weight InitializationTraining a neural network is the task of minimizing an error function E : RW → R,where W is the total number of adjustable parameters in the network (synapse weightsand biases) and RW is the weight space. The weight space is high-dimensional even forsmall networks (e.g. 2 adaptive layers, ∼ 10 neurons per layer). This makes finding theglobal minimum of any function over it nontrivial.

Furthermore, neural networks are highly symmetrical w.r.t. the synapse weights[72].In a multi-layer network, one can switch any two hidden neurons within the same layerand get a different network with an identical network function. Similarly, if a neuron’sactivation function is anti-symmetric11, one can flip the signs of all ingoing and outgoingsynapse weights and get a different neural network with the same network function. Forexample, a two-layer network with M hidden neurons has symmetry factor of M !2M

for each point in weight space.

3.7.1 Random Weight Initialization

Training a multi-layer neural network is a difficult optimization problem with manylocal minima. Because of that, choosing a good starting point for the optimizationalgorithm can have a critical impact on the convergence speed and the quality of theminima found[58].

It is generally advisable to initialize a neural network’s parameters using small randomvalues due to two main reasons:

1. Training an ANN repeatedly with the same starting point w(0) results in findingthe same local minimum, which might or might not be near the global minimum.Because there usually is no prior knowledge about the global minimum of the error

11This is the case for the hyperbolic tangent.

28

3.7 Weight Initialization

function, there is no reason to prefer any starting point. Random initializationensures that a diverse subset of the weight space is searched for local minima.

2. If all synapse weights and biases of an ANN are initialized with the same value (e.g.1 or 0), any two neurons in the same layer will behave exactly identical becausethey match in all parameters and receive the same input. Choosing randominitial synapse weights breaks this intrinsic symmetry and allows the neurons ofeach layer to specialize on different features.

LeCun et al.[58] have given advice on the distribution from which to sample the initialweights. Assuming that the inputs to an individual neuron are uncorrelated and theirdistributions have zero mean and unit variance, the synapse weights and bias should besampled from a Gaussian-distributed random variable with zero mean and a variance:

σ2 =1

m, (3.34)

where m is the number of inputs to that neuron.

3.7.2 Activation Functions

The result in (3.34) can easily be generalized to multi-layer networks as long as eachlayer’s input meets the requirements of having a mean close to zero and a varianceclose to one. In that case, each layer’s parameters may be initialized separately with adifferent Gaussian distribution.

However, a problem arises if the hidden layers have a logistic activation function:

σ(z) :=1

1 + e−z,

because of its property that σ(0) = 0.5. If a logistically activated neuron’s net inputhas a mean E[z] = 0, then its output has a mean E[z] = 0.5 = 0. As a result, theinput of all subsequent layers will not be standardized and the assumptions for (3.34)no longer hold.

The easiest way to counter this effect is to use the hyperbolic tangent as an activationfunction in all hidden layers[58, section 4.4]. It is anti-symmetric and, as such, preservesthe zero mean of its input. The positive influence has been proven experimentallynumerous times[73, 74]. Section 6.9 verifies this for the task of tau identification.

29

3 Artificial Neural Networks

3.8 Error Functions

Training a neural network means minimizing an error function E(w|S). There areseveral error functions that can be used, each with a different theoretical motivation.In this thesis, the two most commonly used error functions are discussed: the sum-of-squares error (SSE) and the cross-entropy (CE). Both error functions have the commonproperty that they treat all training patterns equally:

E(w|S) =∑

(x,t)∈S

e(t, yANN(x|w)), (3.35)

where S is the training set with patterns x and desired outputs t, e is a per-pattern errorfunction, and yANN is the network function. This section discusses the theoretical as-pects of error functions. Their efficacy when applied to tau identification is investigatedin Section 6.8.

3.8.1 Sum-of-Squares Error

In statistical terms, an ANN is a model with parameters w ∈ RW , where W is the

number of adjustable parameters in the network. Let S = (xn, tn)|n = 1, . . . , N bethe training set, where the xn are the training patterns and tn ∈ 0, 1 are the respectivedesired outputs. Then, the likelihood L(w|S) is defined as:

L(w|S) := p(S|w), (3.36)

where p(S|w) is the model’s probability of reproducing the training set with the param-eters w. If each pattern in S is produced independently, the probability p(S|w) maybe factorized:

L(w|S) :=N∏

n=1

p(tn|xn,w). (3.37)

The goal of training is to find the parameter set w∗ that maximizes the likelihoodL(w∗|S) or, equivalently, minimizes its negative logarithm. This gives rise to the errorfunction:

E(w|S) := −N∑

n=1

ln p(tn|xn,w). (3.38)

30

3.8 Error Functions

Assume now that p(tn|xn,w) follows a Gaussian distribution, i.e.:

p(tn|xn,w) =1√2πσ

exp−1

2

(yANN(xn|w)− tn)2

σ2

. (3.39)

Inserting this into (3.38) gives:

E(w|S) = N

2ln 2πσ2 +

1

2σ2

N∑n=1

(yANN(xn|w)− tn)2 . (3.40)

For the purpose of minimization in w, both the first term and the factor 1σ2 may be

dropped. This gives the sum-of-squares error function:

ESS(w|S) = 1

2

N∑n=1

(yANN(xn|w)− tn)2 , (3.41)

which is one of the most commonly used error functions for training neural networks.In its global minimum w∗, the following relation holds[75]:

yANN(x|w∗) = p(Csig|x), (3.42)

i.e. for each pattern vector x, the network function gives the Bayesian posterior proba-bility p(Csig|x) that this pattern represents a signal jet.

3.8.2 Cross-Entropy

Assume again the case of a binary classification task, i.e. tn ∈ 0, 1 (n = 1, . . . , N).From the observations of the previous section, it is a sensible goal to find a parameterset w∗ such that yANN(x|w∗) approximates the Bayesian posterior probability p(Csig|x)that the pattern x represents a signal jet.

Using the fact that p(Csig|x) + p(Cbkg|x) = 1, one can write the likelihood:

L(w|S) =∏

x∈S∩Csig

p(Csig|x)∏

x∈S∩Cbkg

(1− p(Csig|x)) (3.43)

=∏

x∈S∩Csig

yANN(x|w)∏

x∈S∩Cbkg

(1− yANN(x|w)), (3.44)

31

3 Artificial Neural Networks

and again, its negative logarithm can then be used as an error function:

ECE(w|S) = −∑

x∈S∩Csig

ln yANN(x|w)−∑

x∈S∩Cbkg

ln(1− yANN(x|w)). (3.45)

This can be interpreted as the cross-entropy between the probability distribution gen-erated by the ANN and the true distribution given by the labels tn. An alternativenotation, which is also valid for continuous tn, is:

ECE(w|S) = −N∑

n=1

tn ln yANN(xn|w) + (1− tn) ln(1− yANN(xn|w)) . (3.46)

Similarly to ESS, ECE is minimal if yANN reproduces the Bayesian posterior probabilityp(Csig|x)[75].

3.8.3 Error and Activation FunctionsIn this section, the connection between the minimized error function and the outputlayer activation function is discussed. As described in Subsection 3.6.1. The neurondeltas of the output layer, defined in (3.15d), are quantities which are used to calculateall other derivatives:

δ(L)i = φ(L)′(z

(L)i ) · ∂e

∂y(L)i

.

where i is the index of the output neuron, z(L)i is its net input, φ(L)′ is the derivative ofits activation function, and ∂e/∂y

(L)i is the derivative of the per-pattern error function.

For the sum-of-squares error (3.41), it is easy to see that:

∂eSS(t, y)

∂y= y − t. (3.47)

Combining this with the identity activation function, i.e. setting φ(L)′ = 1, gives:

δ(L)i = y

(L)i − t, t ∈ 0; 1, y

(L)i ∈ R. (3.48)

In contrast, using the logistic activation function σ results in:

δ(L)i = y

(L)i (1− y

(L)i ) · (y(L)i − t), t ∈ 0; 1, y

(L)i ∈ [0; 1]. (3.49)

Recall that δ(L)i should be large if y(L)i − t is large, i.e. if there is a large difference

32

3.9 Generalization

between the desired and the actual output. (3.49) does not satisfy this requirement. Forexample, if t = 1 and y

(L)i ≈ 0, then δ

(L)i ≈ 0 as well. More generally, (3.49) vanishes

both if y(L)i − t is close to zero or very large.For the cross-entropy (3.46), it is:

∂eCE(t, y)

∂y=

y − t

y(1− y), (3.50)

and using the logistic activation function results in:

δ(L)i = y

(L)i − t, t ∈ 0; 1, y

(L)i ∈ [0; 1]. (3.51)

The identity activation function, on the other hand, gives:

δ(L)i =

y(L)i − t

y(L)i

(1− y

(L)i

) =

1

1−y(L)i

if t = 0

− 1

y(L)i

if t = 1, y

(L)i ∈ R. (3.52)

This expression diverges for y(L)i → 1− t. Additionally, it is non-zero for all finite valuesof y(L)i , which means that training an ANN will drive its output towards +∞ and −∞for signal and background patterns, respectively.

In summary, it is numerically sensible to use the identity activation function in theoutput layer when minimizing ESS. Conversely, one should use the logistic activationfunction when minimizing ECE. Other combinations results in slow or poor training forthe sum-of-squares error and in divergent behavior for the cross-entropy.

3.9 GeneralizationOne of the most crucial facets of machine learning models is the ability to generalize,i.e. to make correct predictions for patterns which have not been seen during training.

This section discusses generalization and how training an ANN will influence its abilityto generalize. The error function is first decomposed into three specific components inSubsection 3.9.1. The remainder of this section will then outline various methods toimprove the generalization of an ANN.

3.9.1 The Bias–Variance Decomposition

Many error functions can be split into three distinct contributions: the bias, the vari-ance, and an irreducible error. For the sum-of-squares error function, this decomposi-

33

3 Artificial Neural Networks

tion has been proven and exemplified in the literature numerous times[76, 77]. In 2000,Domingos[78] extended its validity to a wide class of binary classification tasks. InAppendix C, the decomposition for the cross-entropy error is derived.

In statistical terms, the generalization of an ANN is its ability to predict the class ofan arbitrary pattern x independently of the data set S that the ANN has been trainedon. Thus, a measure of generalization is:

ES[Et[e(t, y)]], (3.53)

where e(t, y) is the per-pattern error function for the desired output t and the actualoutput y. Et[·] is the expectation over t when treating t as a random variable withsample space Ω = 0; 1. ES[·] is the expectation over all possible training sets S of anarbitrary but fixed size.

The desired output t is treated as a random variable because in realistic applications,the classes Csig and Cbkg overlap. Thus, for a given pattern x in the overlap region, itis impossible to be perfectly certain whether x ∈ Csig or x ∈ Cbkg. Only probabilitiesp(Csig|x) and p(Cbkg|x) can be specified.

As shown in Appendix C, the term (3.53) can be written as:

ES[Et[e(t, y(x, S))]] = H(p) + dB(p, y(x)) + ES

[dV(y(x), y(x, S))

], (3.54)

where p := p(Csig|x) is the optimal ANN output for a pattern x considering statisticalnoise in t; y(x, S) is the actual output of an ANN trained on a training set S; andy(x) := ES[y(x, S)] is the expected ANN output when averaging over all possible S.The exact definition of the functions H, dB, and dV is given in Appendix C; they canbe interpreted in the following way:

• The entropy H(p) does not depend on the ANN’s performance and is an irreducibleerror. It stems from the fact that Csig and Cbkg overlap and there are patterns whichcannot be identified with absolute certainty.

• dB(p, y(x)) is the bias or underfitting error. It is a measure of distance betweenthe optimal network output p(Csig|x) and the expected output y(x).

The bias becomes large if the classification error does not depend on the trainingset, i.e. when the neural networks are biased. Typically, this is the case if theANNs do not have enough adjustable parameters to construct a decision boundaryappropriate for the classification problem. (cf. Figure 3.7a) This can be alleviatedby increasing the amount of hidden layers or neurons.

34

3.9 Generalization

x2

x1

Signal

Background

(a) With too few adjustable parameters, anANN is not able to resolve the structureof the optimal decision boundary. Theerror is almost independent of the usedtraining set.

x2

x1

Signal

Background

(b) With too many adjustable parameters,an ANN will fit the noise in the train-ing patterns instead of the underlyingstructure. The error greatly dependson the training set.

Figure 3.7: A visualization of under- and overfitting. The task is to distinguishbetween signal and background patterns. (squares and crosses respec-tively) The thick solid curve represents the decision boundary.

• ES[dV(y(x), y(x, S))] is the variance or overfitting error. It measures by how muchan individual network deviates from the expected output on average.

Variance vanishes exactly if an ANN’s output does not depend on the trainingset; it becomes large if an ANN focuses on the random noise in S instead ofthe underlying structure. This is typically the case if an ANN has too manyadjustable parameters and is able to construct a decision boundary with very finedetails. (cf. Figure 3.7b) Such an ANN is said to overfit the problem.

Consequently, for each classification task, there is an optimal point in neural networkcomplexity. Finding the correct network complexity and procedures to do so automat-ically are the topic of the remainder of this section.

3.9.2 Weight Decay

Weight decay in is a method to artificially reduce the complexity of a neural networkand, thus, prevent overfitting. A theoretical motivation for weight decay has been givenby Krogh and Hertz[79], while Gupta and Lam[80] as well as Hinton[81] have shown itsexperimental effectiveness.

Bartlett has shown[82] that overfitting in neural networks correlates with large ab-solute values of the synapse weights. An explanation is that such large values are

35

3 Artificial Neural Networks

needed to produce the high-curvature decision boundaries commonly associated withovertraining. (cf. Figure 3.7b)

To avoid such behavior, weight decay modifies the error function:

E(w) := E(w) + νΩ(w), (3.55)

where Ω is a regulator term which penalizes large weights and ν is a coefficient thatdetermines the importance of weight decay during training.

The simplest and most-commonly chosen regulator term is:

Ω(w) =1

2∥w∥2, (3.56)

which gives rise to the modified gradient:

∇E(w) = ∇E(w) + νw. (3.57)

This changes the weight update rule (3.17) to:

w(τ + 1) = (1− αν) · w(τ)− α∇E(w(τ)). (3.58)

In absence of a gradient, this rule exponentially decays synapse weights and biases tozero, which has given weight decay its name. Regularization adds an additional hyper-parameter ν to the training algorithm which controls the strength of weight decay’ssmoothing effect on the decision boundary and must be determined empirically.

3.9.3 Early Stopping

When training a neural network, it is often insightful to examine the value of the errorfunction E both for the training set S and an independent validation set V . Plottingboth E(w(τ), S) and E(w(τ), V ) over the training epoch τ will typically give curvessimilar to the ones shown in Figure 3.8.

The training error E(w(τ), S) decreases monotonically whereas the validation errorE(w(τ), V ) has a minimum and begins increasing after a certain number of epochs.This behavior visualizes the bias–variance trade-off described earlier in Subsection 3.9.1.While the bias typically decreases continuously, the training process gives yANN a de-pendence on S and thus increases the variance. The minimum of E(w(τ), V ) is wherethe increase in variation overtakes the decrease in bias.

The idea of early stopping is to stop the training process when E(w(τ), V ) begins

36

3.9 Generalization

E(w

(τ))

τ

Training set errorValidation set error

Minimal validation error

Figure 3.8: The typical behavior of training and validation set error during train-ing, plotted over the number of passed epochs τ .

increasing and to keep those synapse weights and biases that resulted in the minimalvalidation error. Wang et al. showed for gradient descent training that early stoppingeffectively reduces the complexity of an ANN[83].

The difficulty of early stopping lies in determining when the validation error is mini-mal. It is a nontrivial problem because in many realistic cases, the curve of E(w(τ), V )

is noisy and may have many local minima. Various stopping criteria and their perfor-mance under a noisy validation error were discussed by Prechelt in 1998[84].

3.9.4 Neural Network Ensembles

The bias–variance trade-off formula (3.54) is defined for an ensemble of ANNs whereeach network has the same architecture and has been trained by the same algorithm buton different training sets S. A central measure of this formula is y(x) := ES[y(x, S)], theexpected network output over all possible training sets. Its distance from the optimaloutput p(Csig|x) gives a measure for the bias of the ANN ensemble. At the same time,its error is a lower bound to the average error of each ANN individually.

This suggests that one could train a group of ANNs so that each has a low biasand a high variance and use their network functions to approximate y(x). Such acomposite classifier, whose output for a pattern x is y(x), might still have a low biasand additionally a decreased variance contribution.

These considerations lead to the concept of ANN ensembles or committees12. In anANN ensemble, NC different neural networks are trained separately from each other.After training, their output for a pattern x is combined with an averaging function, the

37

3 Artificial Neural Networks

simplest of which is the arithmetic mean:

yEns(x) =1

NC

NC∑i=1

yi(x), (3.59)

where yi is the network function of the i-th ensemble member. Other options are theweighted average[85] or, for classification only, the majority vote[86].

Let e(t, y) be the per-pattern error function which is minimized during training, wheret is the desired output for a pattern and y is the actual network output. If e is a convexfunction in y, it follows from definition that:

e

(t,

1

NC

NC∑i=1

yi

)≤ 1

NC

NC∑i=1

e(t, yi), (3.60)

i.e. the error of an arithmetic-mean ensemble is always below the average error of eachensemble member taken individually. Because both the sum-of-squares error and thecross-entropy are convex in y, this important result holds for both error functions.13

The regularizing effect of ensembles can be visualized as follows: an overfitted networkwill typically produce a decision boundary with high curvature in such a way that itcorrectly classifies patterns it has seen during training; patterns it has not seen before,in contrast, are likely to be misclassified. In a hypothetical ensemble of ANNs thataverages over all possible training sets, each pattern has been seen during training byat least some members of the ensemble. Averaging over those ANNs which have seena certain, critical pattern (and which will classify it correctly) and those ANNs whichhave not (and might misclassify it) results in an average value neither close to zero norclose to one; this effectively smoothens the decision boundary and, thus, reduces thevariance.

An important consequence of this behavior is that correlation between members of anANN ensemble gives a natural limit to the effectiveness of this technique; if all membersof an ensemble misclassify a pattern, the ensemble will do so as well. Such correlationcan be caused not only if the member networks are underfitting the problem, but alsoif the training set is not representative of the whole pattern space.

Methods to create ANN ensembles are a topic of ongoing research[87, 88]. In mostcases, ensembles are constructed by creating random subsets Si ⊆ S of the full training

12While the term committee rarely appears in the literature, it is used in Bishop’s popular textbookon neural networks[51, p. 364].

13For the sum-of-squares error, this result can be easily confirmed by evaluating e(t, y)[85]. For thecross-entropy, one can proceed similarly and use the inequality of arithmetic and geometric means.

38

3.9 Generalization

set S and training each ensemble member on one of them. For example, the baggingmeta-algorithm creates these subsets by sampling with replacement[89]. It has beenshown, however, that this procedure is approximately equivalent to creating smallersubsets by sampling without replacement[90].

A different method of building an ensemble is training several ANNs on the fulltraining set S, but initializing each of them with a different set of random synapseweights. This essentially means that each ANN minimizes the same error function, butstarts at a different point in weight space and, thus, typically converges to a differentlocal minimum. This method has been shown to yield results comparable to, thoughslightly worse than bagging[91].

39

4 The LHC and the ATLASExperiment

The Large Hadron Collider (LHC)[8] is currently the largest and most powerful particleaccelerator in existence. It is located at the Franco-Swiss border near Geneva, Switzer-land. It was built by CERN, the European Organization for Nuclear Research, from1998 until 2008. It is designed for proton and ion collisions to test the Standard Modelat the TeV scale and search for new physics beyond the Standard Model.

After its ten-years-long construction phase, the LHC’s first beam was launched inSeptember 2008[93]. On September 19, 2008, however, a faulty electrical connectioncaused several magnets to quench, resulting in mechanical damage and leakage of about2 t of liquid helium[94]. This incident postponed the first run to November 2009[95].From then on, the LHC was operating with only short breaks until Februrary 2013[96].During that time, the center-of-mass energy of the proton–proton collisions was in-creased continuously from

√s = 0.9 TeV to eventually 8 TeV. Over the course of 2012,

data was taken with a total integrated luminosity of 22.8 fb−1. (cf. Figure 4.1)From February 2013 until April 2015, the LHC was in the Long Shutdown I phase

of scheduled maintenance and upgrades[97]. These upgrades enhanced the precision ofthe detectors and enabled the LHC to perform particle collisions at its design energy of√s = 14 TeV with a luminosity of L = 1034 cm−2 s−1. On April 5, 2015, the first proton

beams of Run II were injected into the LHC[98]. In June, data-taking was resumed ata center-of-mass energy of 13 TeV and is expected to continue for three years[99].

4.1 The Large Hadron ColliderThe LHC is installed in the tunnel of its predecessor LEP, the Large Electron–PositronCollider[100]. The tunnel has a circumference of 26.7 km, lies between 45 m and 170 mbelow the surface and has a slope of 1.4 % towards Lake Geneva[8].

41

4 The LHC and the ATLAS Experiment

Day in 2012

-1fb

Tota

l Int

egra

ted

Lum

inos

ity

0

5

10

15

20

25

1/4 1/6 1/8 1/10 1/12

= 8 TeVs PreliminaryATLASLHC DeliveredATLAS RecordedGood for Physics

-1Total Delivered: 22.8 fb-1Total Recorded: 21.3 fb

-1Good for Physics: 20.3 fb

Figure 4.1: Cumulative luminosity over time as delivered by the LHC (green),recorded by the ATLAS detector (yellow), and certified to be of goodquality (blue). The luminosity represents the preliminary luminositycalibration[92].

The LHC is designed to search for rare processes at high energy scales. The eventrate of a process is given by:

∂N

∂t= L · σ, (4.1)

where σ is the cross section of the process and L is a experiment-dependent parametercalled luminosity. The LHC is designed to provide proton–proton collisions with a peakcenter-of-mass energy of

√s = 14 TeV and a luminosity of L = 1034 cm−2 s−1.

Because the LHC collides same-sign charged particles, the protons must be separatedinto two beams with separate magnetic bending systems. Furthermore, the ring of theLHC is separated into eight straight sections which are connected by eight arcs. On thearcs, superconducting dipole magnets bend the proton beams to keep them on track.On four of the eight straight sections, the beam pipes cross and the beams are allowedto interact. These are the locations of the four main experiments at the LHC. Of theother four straight sections, one contains the beam dump mechanism, two contain thebeam insertion points and and one houses the RF and feed-back systems[8, ch. 2].

The proton beams are injected into the LHC via a chain of several preaccelerators.(cf. Figure 4.2) First, hydrogen atoms are ionized, leaving bare protons, which then are

42

4.2 The ATLAS Detector

Figure 4.2: Schematic illustration of the CERN accelerator complex[101].

accelerated by the Linac2 to an energy of 50 MeV. After that, the protons are injectedinto the Proton Synchrotron Booster (PSB) and further accelerated to an energy of1.4 GeV. The PSB is followed by the Proton Synchrotron (PS) and the Super ProtonSynchrotron (SPS) with an proton energy at ejection of 25 GeV and 450 GeV respectively.The SPS eventually injects the proton beams into the LHC in two separate rings withopposite directions, where they are accelerated to their final energy.

The four main experiments at the LHC are located at the beam crossing points. Theseare ATLAS[102] and CMS[103], two general-purpose detectors for physics at the TeVscale; LHCb[104], which focuses on bottom quark physics and possible CP violations;and ALICE[105], which studies heavy-ion collisions for research on strongly-interactingmatter at high energies. The three smaller experiments at the LHC are: LHCf[106],which is built on both sides of the ATLAS detector and measures particles in the veryforward region of collisions; MoEDAL[107], which searches for magnetic monopoles andother exotic particles; and TOTEM[108], which aims to determine the total proton–proton cross section and to study elastic scattering as well as diffractive dissociation.

43

4 The LHC and the ATLAS Experiment

Figure 4.3: Cut-away view of the ATLAS detector and its subsystems[102, p. 4].It has a height of 25 m and a length of 44 m, weighing approximately7000 t in total.

4.2 The ATLAS Detector

The ATLAS detector is the largest experiment at the LHC and one of its two general-purpose detectors. Its purpose is to study Standard-Model physics at high energies andsearch for physics beyond the Standard Model.

A diagram of ATLAS is shown in Figure 4.3. The detector is forward–backwardsymmetric w.r.t. the nominal interaction point of the proton beams. Its subsystems arearranged concentrically around the beam axis and generally divided into a cylindricalbarrel region and two disk-shaped end-caps.

The ATLAS detector consists of three major components. The inner-most part is thetracking system which allows precise measurements on charged particles as well as re-construction of primary and secondary decay vertices. Surrounding it is the calorimetersystem which measures particle energies. A muon spectrometer finally surrounds theinner components and is used to measure the momentum of outgoing muons with ashigh a precision as possible. Additionally to these components, the ATLAS detectorhas three smaller detectors installed in the forward direction close to the beam pipe.A sophisticated trigger system filters the data produced by the detectors and enablesexperimenters to handle the enormous data flow of an estimated 1 PiB per second[109](raw, uncompressed).

44

4.2 The ATLAS Detector

4.2.1 The ATLAS Coordinate System

The ATLAS coordinate system is right-handed with its point of origin being the nominalinteraction point, i.e. the center of the detector[102]. The x-axis is defined to pointtowards the center of the LHC ring while the y-axis is defined to point upwards. Hence,the z-axis is parallel to the beam direction.

Due to the detector’s design, many quantities are expressed in cylindrical coordinates.The azimuth ϕ is defined as usual and the polar angle θ is measured from the positivez-axis. The rapidity, an important concept of particle collider physics, is defined as:

y :=1

2ln E + pz

E − pz. (4.2)

It should not be confused with the Cartesian coordinate y. If a particle’s mass isnegligibly small, the rapidity may be approximated by the pseudorapidity:

η := − ln(

tanh θ

2

), (4.3)

which is only a function of the polar angle θ. As a measure of angular distance betweentwo objects, it is common to use the quantity ∆R. It is defined as:

∆R :=√

(∆η)2 + (∆ϕ)2. (4.4)

Finally, it must be considered that, while the longitudinal momentum of the initialprotons’ partons is indeterminate, the vector sum of the partons’ transverse momentais zero (assuming a vanishing beam-crossing angle). Thus, particular attention is paidto a number of transverse variables, which are defined as the projection of their originalvariables onto the xy-plane. Examples are the transverse momentum pT, the transverseenergy ET and the missing transverse energy Emiss

T .

4.2.2 The Inner Detector

The inner-most component of the ATLAS detector is a tracking system of 7 m in lengthand 2.3 m in diameter, covering in total a range of |η| < 2.5. (cf. Figure 4.4) It consistsof three main components: a pixel detector, a semiconductor tracker (SCT) and atransition radiation tracker (TRT). These components are described in more detailfurther below. In order to measure the charge and momentum of the particles that aretraversing the inner detector, it is immersed in a magnetic field of approximately 2 T,generated by a surrounding solenoid. (cf. Figure 4.3)

45

4 The LHC and the ATLAS Experiment

(a)Cut-away view of the ATLAS inner detectorand its subsystems.

(b) Schematic view of the inner detec-tor as a charged particle passesthrough (red line).

Figure 4.4: The ATLAS inner detector[102, pp. 55–56].

The purpose of the tracking system is to measure the tracks of charged particles withthe highest possible precision. Using these tracks, primary and secondary vertices ofinteraction as well as particle momenta may be reconstructed. Doing so with highaccuracy is vital to many applications, e.g. b-tagging of hadron jets or identification oftau leptons. The accuracy of the tracking system has been evaluated in 2008[110].

The Pixel Detector

Of the three tracking subsystems, the pixel detector is the one closest to the beam pipe.With approximately 1000 particles emerging from the collision point during each bunchcrossing, it is also the most affected by irradiation effects. This gives strict constraintson its material and geometry as well as operating temperature and voltage[111].

Nonetheless, a high intrinsic accuracy can be reached. In the barrel region, it is 10 µmin the Rϕ-plane and 115 µm on the z axis[102]. In the end-caps, the accuracy is 10 µmin the Rϕ-plane and 115 µm in R. A total of 1744 pixel modules are used in the pixeldetector, with 47 232 pixels each. These sensors are arranged in three layers around thebeam pipe and three end-cap disks on each side, covering the full range of |η| < 2.5.

If a pixel is hit by a charged particle, it creates an electron–hole pair. Electron andhole are then separated and collected by the pixel’s electrodes, resulting in a measurableelectrical current. The pixels add up to a total of about 80.4 million channels whichneed to be read out by the detector electronics.

46

4.2 The ATLAS Detector

The Semiconductor Tracker

The semiconductor tracker consists of 4088 modules (2112 in the barrel, 1976 in theend-caps) with approximately 6.3 million read-out channels. The modules are arrangedin four cylindrical layers in the barrel region and nine disks on each end-cap, coveringthe whole fiducial range of |η| < 2.5 of the inner detector.

Each module consists of two layers of micro-strip sensors which are rotated againsteach other by an angle of 40 mrad[112]. This rotation allows precise detection of chargedparticles with an intrinsic accuracy of 17 µm in the Rϕ-plane and 580 µm on the z-axis(barrel region) and radially (in the end-caps).

The Transition Radiation Tracker

The transition radiation tracker is the outer-most component of the ATLAS trackingsystem. It consists of TRT straw tubes which are 144 cm long in the barrel and 37 cmlong in the end-cap and covers a range of |η| < 2.0. In the barrel region, the straw tubesare arranged in 73 layers parallel to the beam pipe; in the end-cap, they are arranged in160 layers and positioned radially. As a result, each charged particle crosses a minimumof 22 straws.

The straw tubes have a diameter of about 4 mm and have a length of 144 and 37 cmin the barrel and end-cap region respectively. The tubes are filled with a gas mixtureof 70 % xenon, 27 % carbon dioxide and 3 % oxygen and each contains a gold-platedtungsten anode. The tube walls are coated with a 0.2-µm-thick aluminum film and actas a cathode. A high voltage of 1530 V between the electrodes allows electron avalanchesto occur whenever a charged particle passes through a tube and ionizes the gas atomsin its path.

Additionally, the straw tubes are interleaved with polypropylene fibres (barrel) orfoils (end-cap region). Due to its differing refractive index, polypropylene causes ultra-relativistic particles (with a high Lorentz factor γ) to emit transition radiation at itssurface[113]. Although this radiation consists of low-energy photons, it is absorbed veryefficiently by the xenon atoms in the straw tubes and hence yields a much larger signalamplitude than electrons from ionization tracks. This allows distinction between thesetwo kinds of signals and, by extension, between light electrons (γ > 1000) and heaviercharged particles, e.g. pions (γ < 1000).

Due to its design, the TRT can only locate charged particles in the Rϕ-plane, in whichit has an intrinsic accuracy of 130 µm. Nonetheless, its data is highly important becausethe large number of crossed straw tubes per charged particle allows track-following and

47

4 The LHC and the ATLAS Experiment

Figure 4.5: Cut-away view of the ATLAS calorimeter system[102, p. 8].

precise measurements of particle momenta. The output of all straw tubes adds up to atotal of 351 000 read-out channels.

4.2.3 The Calorimeter System

The ATLAS calorimeter measures the location and energy of particles by absorbingtheir energy. The calorimeter stops all particles coming from the interaction point withtwo notable exceptions: The heavy, high-momentum muons, which are detected in themuon spectrometer (cf. Subsection 4.2.4); and the neutrinos, which generally leave thedetector completely undetected. The energy of the latter may only be reconstructedas missing transverse energy Emiss

T , i.e. using the negative sum of all other particles’four-momenta.

For all other particles, it is important that the calorimeter contains their electromag-netic and hadronic showers as reliably as possible in order to measure their energycorrectly and to avoid a punch-through to the muon spectrometer.

The calorimeter system surrounds the inner detector (cf. Figure 4.5) and coversa region of |η| < 4.9. It is commonly divided into two subsystems: the electromag-netic calorimeter, which is designed to stop electrons and photons; and the hadroniccalorimeter, which is specialized for strongly interacting particles. Both subsystems aredescribed briefly below.

48

4.2 The ATLAS Detector

The Electromagnetic Calorimeter

The electromagnetic calorimeter (ECAL) covers a region of |η| < 1.475 in the barrel and1.375 < |η| < 3.2 in the end-cap. The electromagnetic end-cap calorimeter (EMEC)is additionally divided at |η| = 2.5 into an outer and inner wheel due to the differentrequirements in e.g. radiation hardness. The barrel and end-cap calorimeters overlapto allow a smooth transition between the two regions.

The barrel ECAL consists of 1024 steel-enforced lead plates, which are bent in anaccordion shape in order to provide coverage uniform in ϕ[102, section 5.2]. The gapsbetween the plates are filled with liquid argon (LAr) – which is used as an activematerial – and copper electrodes to read out signals. Steel and lead act as an absorber,in which photons and charged particles deposit their energy through an alternationof bremsstrahlung and pair production. This results in characteristic electromagneticshowers of secondary particles. These particles ionize the liquid argon and produce anelectric signal in the electrodes which is proportional to the deposited energy.

The ECAL is divided into three radial layers of different granularity. The first layer isfinely segmented in η and ϕ, providing precise measurement of these observables. Thesecond layer is more coarse and collects the majority of the particles’ energy, while thethird layer is the least precise and serves to contain the tails of the electromagneticshowers.

Additionally, a thin presampler detector is used in the region |η| < 1.8 to correctfor the energy that photons and electrons lose in the inner detector. The presamplerconsists of an LAr layer and has a thickness of 1.1 cm and 0.5 cm in the barrel andend-cap region respectively.

The Hadronic Calorimeter

The hadronic calorimeter surrounds the electromagnetic calorimeter and serves thepurpose of stopping and measuring the energy of strongly interacting particles. Itconsists of three subsystems:

The tile calorimeter is divided into a barrel and an extended barrel component, whichcover a region of |η| < 1.0 and 0.8 < |η| < 1.7 respectively. It consists of stainless-steel absorbers and plastic scintillators as an active medium. Hadronic particlestraversing the steel plates interact with the medium and produce hadronic particleshowers. The electrically charged fraction of these secondary particles stimulatethe scintillator. Its resulting relaxation light is collected by photomultiplier tubesand produces a signal proportional to the deposited energy.

49

4 The LHC and the ATLAS Experiment

Figure 4.6: Cut-away view of the ATLAS muon spectrometer system[102, p. 11].

The hadronic end-cap calorimeter (HEC) covers a region of 1.5 < |η| < 3.2. Becauseof the higher radiation levels, it has been designed as a liquid argon (LAr) detectorwith copper plates acting as the absorber material. In each end-cap, the HEC isdivided into a front and a rear wheel, with the latter having a coarser granularityand thicker copper plates.

The forward calorimeter (FCAL) covers a region of 3.1 < |η| < 4.9 and is positionedcoaxially inside the HEC. The FCAL is divided into three distinct LAr detectors.The first one contains copper absorbers and is optimized for measurements onelectrons and photons. The second and third contain tungsten absorbers andprimarily measure the energy of hadronic particles.

4.2.4 The Muon Spectrometer

Due to their low interaction cross section, most muons with pT > 5 GeV exit thecalorimeter system without being stopped[102, ch. 6]. The muon spectrometer is theouter-most subsystem of the ATLAS detector and covers a region of |η| < 2.7. Itspurpose is to detect these particles and to measure their momentum as accurately aspossible.

Extending from a radius of about 5 to 10 m and in z-direction from 7.4 to 21.5 m,the muon spectrometer defines the dimensions of the detector as a whole. Its largeand open structure is necessary to minimize multiple-scattering effects. In its volume,

50

4.2 The ATLAS Detector

a strong magnetic field of up to 1 T is generated by three superconducting toroidalair-core magnets – one in the barrel and two in the end-cap region. This magnetic fieldbends the trajectories of incoming charges particles and allows measurement of theircharge and momentum.

The muon system is built from individual chambers, which are arranged in threecoaxial layers in the barrel region and in three concentric wheels in the end-cap region.Four distinct types of detector chambers are operated, with each serving a differentpurpose. (cf. Figure 4.6) Monitored drift tubes (MDT) deliver high-precision trackinginformation over the whole range of |η| < 2.7. Only in the inner-most end-cap wheels,in the range 2.0 < |η| < 2.7, they are replaced by cathode strip chambers (CSC) dueto the high counting rates intolerable by the MDTs.

In addition to these precision tracking chambers, a system of fast trigger chambers isnecessary in order to provide information to the L1 trigger (cf. Subsection 4.2.6) within∼ 20 ns. The muon trigger system consists of resistive plate chambers (RPC) in theregion |η| < 1.05 and thin-gap chambers (TGC) in the region 1.05 < |η| < 2.4.

Additionally, as the MDTs can determine track coordinates only in the plane in whichthe track is bent by the magnetic field, the tracking chambers are used to measure thesecond, complementary coordinate. Potential ambiguities in the match-up of these twomeasurements are resolved by matching the muon track candidates with a track in theinner detector.

4.2.5 The Forward Detectors

In addition to the main ATLAS detector, three smaller detector cover the very forwardregion, being positioned on both sides of the main detector at a distance of 17 m, 140 m,and 240 m from the interaction point respectively[102, section 1.5].

LUCID (Luminosity Measurement Using Cerenkov Integrating Detector) is a relative-luminosity detector and designed to detect inelastic proton–proton scattering. It is usedfor online monitoring of the luminosity and beam conditions.

The ZDC (Zero-Degree Calorimeter) has the primary purpose of detecting neutronsfrom heavy-ion collisions in the very forward region of |η| > 8.3. It was further used dur-ing the start-up phase of the ATLAS detector to improve its acceptance for diffractiveprocesses.

ALFA (Absolute Luminosity for ATLAS) measures the elastic-scattering amplitudein the very forward region (|η| > 13.4). Via the optical theorem, this allows calculatingthe total cross section and, by extension, the absolute luminosity.

51

4 The LHC and the ATLAS Experiment

4.2.6 The Trigger SystemWith an event rate of 40 MHz and an average data amount of 1.3 MiB per event aftercompression[102, section 1.6], the ATLAS detector produces about 50 TiB of data persecond. It is impossible to write data to disk at such high speeds; hence, a sophisticatedTrigger and Data Aquisition (TDAQ) system is applied, which retains only the eventsmost interesting to physics analyses.

The trigger system is separated into three levels: Level-1 (L1), Level-2 (L2), and eventfilter. The L1 trigger is hardware-based and implemented with custom electronics. TheL2 trigger and event filter – which together form the High-Level Trigger (HLT) – aresoftware-based and run on commercially available hardware[102, ch. 8].

The L1 trigger forms it decision on data with reduced granularity and from a subsetof the detectors (the calorimeters, the TGCs, and the RPCs). It selects events withhigh-pT hits in the muon spectrometer, high-ET objects in the calorimeter, large totaltransverse energy, or large amount of missing transverse energy. Furthermore, it allowspre-scaling, i.e. selecting only every n-th event of a category, where n ∈ [1; 224]. TheL1 trigger has to make a decision within a frame of 2.5 µs and reduces the event rateto 75 kHz.

The L1 trigger retains information about the location and type of identified featuresin an event. If an event passes the trigger, this information is sent to the L2 trigger asregions of interest (RoI). The L2 trigger then is seeded by these RoIs and further filtersevents. To do so, it may access the detector data at full precision within the RoIs. Ithas a latency of about 40 ms and reduces the event rate further to 3.5 kHz.

The event filter reduces the event rate to its final level of 200 Hz, with a processing timeof approximately 4 s per event. It is based on the standard off-line analysis algorithmsand performs almost a full reconstruction of each event. It additionally tags all passingevents based on the physics analyses to which they may be relevant. The events arethen written to several data streams based on their tags.

52

5 Tau Reconstruction andIdentification

For many physics analyses, it is of great importance to reconstruct and correctly identifytau leptons. (cf. Section 2.3) In this chapter, an overview will be given over the state-of-the-art tau reconstruction and identification at the ATLAS detector. Particularattention will be paid to the role of multi-variate analysis methods during this process.

Table 5.1 summarizes the branching ratios of the most common tau decay channels.Due to their high mass, tau leptons may decay both to leptonic and hadronic finalstates with branching ratios of 35.2 % and 64.8 % respectively. However, leptonic taudecays are not considered for tau reconstruction and identification because their decayproducts are nearly indistinguishable from prompt electrons and muons[114].

In the hadronic mode, tau leptons decay to a tau neutrino plus a number of chargedand neutral pions. Decays involving K mesons are possible, but rare. Based on thenumber of charged particles in the final state, hadronic tau decays are classified into1-prong and 3-prong decays.1 While decays into more than three charged particles arepossible, their branching ratio is small and they are hence typically not considered foridentification purposes.

The hadrons in the final state of tau decays are reconstructed as jets of particles. Thechallenge of tau reconstruction lies in the similarity between these jets from tau decaysand gluon- or quark-initiated jets. The production cross section for these so-called QCDjets exceeds the cross section for tau leptons by two orders of magnitude[115]. This,together with the fact that tau reconstruction provides almost no rejection against thisQCD background, is the reason why an additional identification step is performed aftertau reconstruction. Also, while not used for this thesis, a dedicated tau trigger[116]exists that may be used to further reduce the multi-jet background.

1Cf. Section 5.2 for an exact definition of 1- and 3-prong tau candidates.

53

5 Tau Reconstruction and Identification

Table 5.1: Decay channels and their branching ratios for leptonic and hadronictau decays[7]. The row “others” includes both decays to more than 3charged hadrons and decays involving at least one kaon.

Decay mode Branching ratio Γi/Γ

Leptonic decays τ± → e± + νe + ντ 17.82%τ± → µ± + νµ + ντ 17.41%total 35.23%

Hadronic decays τ± → π± + ντ 10.83%τ± → π± + π0 + ντ 25.52%τ± → π± + 2π0 + ντ 9.30%τ± → π± + 3π0 + ντ 1.05%1-prong decays 46.70%τ± → 2π± + π∓ + ντ 8.99%τ± → 2π± + π∓ + π0 + ντ 2.70%3-prong decays 11.69%others 6.38%total 64.77%

5.1 Reconstruction of Tau LeptonsDuring the reconstruction step, tau candidates are built and identification variables arecalculated based on the information given by the tracking and calorimeter systems[116].

To do so, the energy deposits in individual calorimeter cells are first combined intotopological clusters using a dedicated algorithm[117]. The calorimeter cells have beencalibrated using the Local Hadron Calibration[118, 119]. The anti-kt algorithm[120]combines these clusters into jets; its distance parameter is R = 0.4. Jets which fulfillthe requirements pT > 10 GeV and |η| < 2.5 serve as seeds for the dedicated taureconstruction.

Using the seed jet, the four-momentum of the tau candidate is reconstructed fromthree degrees of freedom: pT, η, and ϕ; the mass m of the tau candidate (and thusof all constituents of the jet) is defined to be zero[114]. First, a barycenter is formedby summing up the four-momenta of all constituent topological clusters. Then, for thecalculation of pT, η, and ϕ, only clusters with a distance ∆R < 0.2 to the barycenterare considered, where:

∆R :=√(∆η)2 + (∆ϕ)2 < 0.2. (5.1)

The variables η and ϕ are taken directly from the average over the considered clusters,whereas pT is calculated at the tau energy scale using a dedicated energy calibration

54

5.1 Reconstruction of Tau Leptons

scheme[116]. This calibration also corrects the η of the tau candidate by a few percent.2

Based on the seed jet and the intermediate tau axis given by pT, ϕ, and η, the Tau JetVertex Association (TJVA) algorithm[121] finds the most likely origin vertex for eachjet by maximizing the tau jet vertex fraction:

fTJVF(j|v) :=∑

t|v pt|vT∑

t ptT

, (5.2)

where j designates the jet and v is the candidate tau vertex.3 The sum in the denom-inator iterates over all tracks t in the jet j while the sum in the numerator only goesover those tracks in j which have been matched to the vertex v. Additionally, all trackst have to meet the following criteria:

• their distance to the intermediate axis is ∆R < 0.2 (core cone criterion);

• their transverse momentum is pT ≥ 1 GeV;

• they have a number of pixel hits Npixel ≥ 2; and

• the total number of pixel and SCT hits is Npixel+SCT ≥ 7.

Once a tau vertex as been chosen, a track association algorithm selects tracks for eachjet. To be associated with a jet, a track must fulfill the same requirements as listedabove plus the following two:

• The distance of closest approach to the tau vertex in the transverse plane is|d0| < 1.0 mm;

• The distance of closest approach to the tau vertex in longitudinal direction is|z0 sin θ| < 1.5 mm.

The track association procedure may give small corrections to the intermediate tau axisdepending on the chosen tau vertex.

Finally, the identification variables given in the following section are calculated. Todecrease the influence of pile-up effects, only tracks in the core region (∆R < 0.2) areconsidered when calculating calorimeter-based variables. For tracking-based variables,tracks in the isolation annulus (0.2 < ∆R < 0.4) are considered as well. The variablesbased on the number and properties of neutral pions in the jet are calculated with

2The reason that this is necessary is a insufficient instrumentation of the calorimeter in the transitionregion between the electromagnetic barrel and endcap. These regions tend to underestimate energieswhich leads to a bias in the pseudorapidity η.

3Candidate vertices are required to have at least four tracks[114]. More information on vertex recon-struction is given in [122].

55

5 Tau Reconstruction and Identification

the Neutral Pion Finder algorithm[123] which utilizes the multi-variate technique ofBoosted Decision Trees.

5.2 Identification of Tau LeptonsThe identification of hadronic tau decays is based on multi-variate analysis (MVA)methods. MVA methods exploit differences in the shape and kinematics of jets fromtau decays and QCD jets to discriminate between them; additionally, they are moreflexible than a cut-based approach, further increasing their efficiency. The current tauidentification procedure pursues two independent approaches based on different MVAmethods[114, 121]: One is based on Boosted Decision Trees (BDT) while the other oneuses a projective Log-Likelihood (LLH) function. Furthermore, an identification basedon artificial neural networks, a.k.a. multi-layer perceptrons (MLP) is explored in thisthesis.

Common to all three approaches is that they need to be trained, i.e. adjusted to theirtask using a training sample of tau candidates. In order to ensure clean signatures, thefollowing requirements need to be met by events in the training sample:

• at least one reconstructed primary vertex with at least four assigned tracks;

• at least one reconstructed tau candidate with pT > 15 GeV and |η| < 2.3;

• (for signal only) on generator level, the visible (i.e. ignoring the neutrino) truevariables are ptrue

T,vis > 10 GeV and |ηtruevis | < 2.5;

• (for signal only) the distance between the reconstructed jet and the true jet ongenerator level must be ∆R < 0.2 (truth-matching);

• (for background only) the event was recorded at a time when the detector wasfully operational and stable beam conditions were ensured.

After these cuts have been applied, tau candidates are categorized by their prongness:

1-prong tau candidates with one reconstructed track which each have been matchedto a generated tau that decays into one charged hadron;

3-prong tau candidates with three reconstructed tracks which have been matched toa generated tau that decays into three charged hadrons;

multi-prong tau candidates with two or three reconstructed tracks which have beenmatched to a generated tau that decays into three charged hadrons.

56

5.2 Identification of Tau Leptons

In order to improve their performance, the MVA methods are applied on decays intoone and three charged hadrons independently. While this increases computational com-plexity, it allows each instance to specialize on certain properties of the different decaymodes (e.g. use different identification variables). In order to ensure clean signaturesduring training, the MVA method instance applied on multi-prong candidates is trainedusing only 3-prong candidates.

Tau identification exploits various differences in the shape and characteristics of thereconstructed jets in order to discriminate against QCD jets. For example, jets fromtau decays have a characteristic distribution of energy between their constituents. (cf.Table 5.1) Furthermore, they tend to be narrower than QCD jets and contain fewerparticles. The following variables are used in tau identification and make use of thesedistinctive features.

corrcoref

0 0.2 0.4 0.6 0.8 1 1.2

Fra

ctio

n of

Can

dida

tes

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.41-prong

= 8 TeVs

Signal sample

Background sample

corrcoref

0 0.2 0.4 0.6 0.8 1 1.2

Fra

ctio

n of

Can

dida

tes

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.183-prong

= 8 TeVs

Signal sample

Background sample

Figure 5.1: The pile-up-corrected core energy fraction (f corrcore) for reconstructed 1-

prong (left) and 3-prong (right) tau candidates. The distribution ofthe signal sample is hatched in red while for the background sample,it is blotted with black dots. These colors are used in all subsequenthistograms of this chapter.

f corrcore , the pile-up-corrected core energy fraction: The core energy fraction fcore is

given as the ratio of the energy deposited in the central region (∆R < 0.1) andthe energy deposited in the core cone (∆R < 0.2) around the intermediate tauaxis:

fcore =

∑∆Ri<0.1i EEM

T,i∑∆Rj<0.2j EEM

T,j

. (5.3)

EEMT,i is the transverse energy, which has been deposited in cell i, calibrated at the

EM energy scale[124]. The indices i and j run over all calorimeter cells associatedwith the tau candidate which are within the specified radius.

57

5 Tau Reconstruction and Identification

The pile-up correction is given by:

f corrcore =

fcore + 0.003 ·Nvtx if pT < 80 GeVfcore otherwise

, (5.4)

where pT is the tau candidate’s transverse momentum (calibrated at the tau energyscale) and Nvtx is the number of reconstructed vertices with at least two associatedtracks.

Figure 5.1 shows that jets from tau decays are narrower than QCD jets, hencedepositing more energy close to the central axis. The value of f corr

core may be greaterthan 1.0 due to the pile-up correction.

corrtrackf

0 0.5 1 1.5 2 2.5 3

Fra

ctio

n of

Can

dida

tes

0

0.05

0.1

0.15

0.2

0.251-prong

= 8 TeVs

Signal sample

Background sample

corrtrackf

0 0.5 1 1.5 2 2.5 3

Fra

ctio

n of

Can

dida

tes

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14 3-prong = 8 TeVs

Signal sample

Background sample

Figure 5.2: The pile-up-corrected leading track momentum fraction (f corrtrack) for re-

constructed 1-prong (left) and 3-prong (right) tau candidates.

f corrtrack, the pile-up-corrected leading track momentum fraction: The leading track

momentum fraction ftrack is given as:

ftrack =pleadtrk

T∑∆Rj<0.2j EEM

T,j

, (5.5)

where pleadtrkT is the highest transverse momentum of any single track in the tau

candidate’s core region ∆R < 0.1 and the denominator is the same as for f corrcore .

The pile-up correction is given by:

f corrtrack =

ftrack + 0.003 ·Nvtx if pT < 80 GeVftrack otherwise

. (5.6)

Jets from tau decays typically consist of fewer decay products than QCD jets andso, their ftrack is close to 1. As Figure 5.2 shows, this is only partially true for

58

5.2 Identification of Tau Leptons

1-prong decays due to the high branching ratio of decays with at least one neutralpion.

trackR

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Fra

ctio

n of

Can

dida

tes

0

0.05

0.1

0.15

0.2

0.25 1-prong = 8 TeVs

Signal sample

Background sample

trackR

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Fra

ctio

n of

Can

dida

tes

0

0.05

0.1

0.15

0.2

0.25 3-prong = 8 TeVs

Signal sample

Background sample

Figure 5.3: The track radius (Rtrack) for reconstructed 1-prong (left) and 3-prong(right) tau candidates.

Rtrack, the pT-weighted track radius: Rtrack is defined as:

Rtrack =

∑∆Ri<0.4i pT,i ·∆Ri∑∆Ri<0.4

i pT,i

. (5.7)

The index i iterates over all tracks in the isolation cone. (∆R < 0.4 w.r.t. theintermediate tau axis) Because of their narrow shower profile, jets from tau decaystypically have a smaller Rtrack than QCD jets, as shown in Figure 5.3.

maxR∆

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Fra

ctio

n of

Can

dida

tes

0

0.02

0.04

0.06

0.08

0.13-prong

= 8 TeVs

Signal sample

Background sample

(a) Maximum angular distance (∆Rmax)

flight

TS

10− 5− 0 5 10 15 20

Fra

ctio

n of

Can

dida

tes

0

0.1

0.2

0.3

0.4

0.53-prong

= 8 TeVs

Signal sample

Background sample

(b) Transverse flight path significance(Sflight

T )

Figure 5.4: ∆Rmax and SflightT for reconstructed 3-prong tau candidates.

∆Rmax, the maximum angular distance: ∆Rmax is defined as the maximum distancebetween a track inside the core cone (∆R < 0.2) and the intermediate tau axis.Similarly to Rtrack, this variable uses the collimated nature of jets from tau decays

59

5 Tau Reconstruction and Identification

and, as such, is highly correlated with Rtrack. Figure 5.4a shows the distributionsin ∆Rmax of the signal and background samples.

SflightT , the transverse flight path significance: Sflight

T is only defined for multi-prongtau candidates. It is given as the decay length significance of the reconstructedsecondary vertex in the transverse plane:

SflightT =

LflightT

δLflightT

, (5.8)

where LflightT is the reconstructed signed4 decay length and δLflight

T is its estimateduncertainty.

The non-negligible lifetime of the tau lepton results in a typically positive decaylength and hence in Sflight

T > 0. For QCD jets, on the other hand, the secondaryvertex is often reconstructed very close to the primary vertex and Sflight

T ≈ 0. Thisis illustrated in Figure 5.4b.

[GeV]tracksm

0 1 2 3 4 5 6 7 8

Fra

ctio

n of

Can

dida

tes

0

0.1

0.2

0.3

0.4

0.5

0.6 3-prong = 8 TeVs

Signal sample

Background sample

Figure 5.5: The invariant mass (mtracks) for reconstructed 3-prong tau candidates.

mtracks, the invariant mass of the track system: For the calculation of the invari-ant mass, all tracks in the isolation cone (∆R < 0.4) are considered. For jetsfrom tau decays, this quantity is expected to have a narrow peak below the taumass because neutrinos and neutral pions are not visible as tracks and thus do notcontribute to mtracks. For QCD jets, on the other hand, this measure is expectedto have a wide distribution due to QCD jet fragmentation. These expectations

4The sign of LflightT is defined to be positive (negative) if the secondary vertex is reconstructed down-

stream (upstream) of the primary vertex, i.e. if the dot product of the jet direction and the impactparameter is positive (negative).

60

5.2 Identification of Tau Leptons

are verified by Figure 5.5.

IPlead trkS

15− 10− 5− 0 5 10 15

Fra

ctio

n of

Can

dida

tes

0

0.1

0.2

0.3

0.4

0.5

0.6

1-prong = 8 TeVs

Signal sample

Background sample

(a) Leading track IP significance (SIPleadtrk)

isotrackN

0 1 2 3 4 5 6 7 8 9

Fra

ctio

n of

Can

dida

tes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.91-prong

= 8 TeVs

Signal sample

Background sample

(b) Number of isolation tracks (N isotrack)

Figure 5.6: SIPleadtrk and N iso

track for reconstructed 1-prong tau candidates.

SIPleadtrk, the significance of the leading track’s IP: This variable may be calculated

as:

SIPleadtrk =

d0δd0

, (5.9)

for the highest-pT track in the core region (∆R < 0.2). The impact parameter d0is the transverse distance of closest approach of the track to the tau vertex; δd0 isits estimated uncertainty. As Figure 5.6a shows, SIP

leadtrk tends to smaller absolutevalues for QCD jets, whereas its absolute value is larger for jets from tau decays.

N isotrack, the number of tracks in the isolation annulus: The isolation annulus is de-

fined as the region of 0.2 < ∆R < 0.4 w.r.t. the intermediate tau axis. Due totheir low track multiplicity and narrow shape, jets from tau decays have a lowerN iso

track than QCD jets, as Figure 5.6b shows.

Nπ0, the number of reconstructed neutral pions: This number only includes pionsreconstructed in the core region (∆R < 0.2). Figure 5.7 shows the distributionsof signal and background samples. Its separating power is particularly high for3-prong tau candidates because of the low branching ratio of the channel τ± →π±π+π−π0.

mvisτ , the visible mass of the tau candidate: This is the invariant mass of all tracks

and neutral pions in the core region (∆R < 0.2). Its distribution, as shown inFigure 5.8, is similar to that of mtracks (cf. Figure 5.5). Despite the similarities tomtracks, however, mvis

τ is almost uncorrelated to all other identification variablesand so, one may expect an MVA method to gain additional information from it.

61

5 Tau Reconstruction and Identification

0πN

0 1 2 3

Fra

ctio

n of

Can

dida

tes

0

0.1

0.2

0.3

0.4

0.5

0.6 1-prong = 8 TeVs

Signal sample

Background sample

0πN

0 1 2 3

Fra

ctio

n of

Can

dida

tes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7 3-prong = 8 TeVs

Signal sample

Background sample

Figure 5.7: The number of neutral pions (Nπ0) for reconstructed 1-prong (left) and3-prong (right) tau candidates.

[MeV]visτm

500 1000 1500 2000 2500 3000 3500 4000 4500

Fra

ctio

n of

Can

dida

tes

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

1-prong = 8 TeVs

Signal sample

Background sample

[MeV]visτm

500 1000 1500 2000 2500 3000 3500 4000 4500

Fra

ctio

n of

Can

dida

tes

0

0.05

0.1

0.15

0.2

0.25 3-prong = 8 TeVs

Signal sample

Background sample

Figure 5.8: Visible tau candidate mass (mvisτ ) for reconstructed 1-prong (left) and

3-prong (right) tau candidates.

fvis−pT, the visible fraction of transverse momentum: This variable is given by:

fvis−pT =

∑∆Ri<0.2i pT,i

pτT, (5.10)

where pτT is the total transverse momentum of the tau candidate. The index i

iterates over tracks as well as neutral pions reconstructed by the Neutral PionFinder[123]. As Figure 5.9 shows, the value of this variable is close to one forjets from tau decays; for QCD jets, it tends to smaller values due to additionalneutral hadrons not reconstructed by the Neutral Pion Finder.

Table 5.2 summarizes which of these variables are used when training the identificationalgorithm for 1-prong and multi-prong tau candidates respectively.

The identification algorithm assigns to each candidate a tau score, a number typicallyin the interval [0; 1]. A score close to 1 indicates that a candidate likely is a true

62

5.2 Identification of Tau Leptons

Tvis-pf

0 0.5 1 1.5 2

Fra

ctio

n of

Can

dida

tes

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35 1-prong = 8 TeVs

Signal sample

Background sample

Tvis-pf

0 0.5 1 1.5 2

Fra

ctio

n of

Can

dida

tes

0

0.05

0.1

0.15

0.2

0.25

3-prong = 8 TeVs

Signal sample

Background sample

Figure 5.9: Visible fraction of transverse momentum (fvis−pT) for reconstructed1-prong (left) and 3-prong (right) tau candidates.

Table 5.2: Identification variables used by the MVA methods, depending on theprongness of the training sample. Black circles indicate the usage of aparticular variable.

f corrcore f corr

track Rtrack ∆Rmax SflightT mtracks SIP

leadtrk N isotrack Nπ0 mvis

τ fvis−pT

Training on 1-prong tau candidates:• • • • • • • •

Training on 3-prong tau candidates:• • • • • • • • •

tau lepton, whereas a score close to 0 is given to candidates likely to be QCD jets.Depending on the desired purity and amount of the data, experimenters may require acertain minimum score for a tau candidate to pass identification.

Additionally to the discrimination against QCD jets, dedicated algorithms exist thatdiscriminate against background from channels such as Z → e+e− and Z → µ+µ−.These electron veto and muon veto algorithms are described in more detail in [114].

63

6 Optimization of ANN-Based TauIdentification

When applying artificial neural networks to the problem of identification of jets fromhadronically decaying tau leptons, there are several issues to be considered and hyper-parameters must be set, which affect the outcome of training. Among other things, onemust choose a network topology fit for the given problem and both an error functionand minimization algorithm for training.

In the sections 6.1 and 6.2, the used samples and evaluation procedure are defined.In the sections 6.3 through 6.6, several common obstacles encountered when applyingANNs to tau identification are discussed. In the sections 6.7 through 6.10, hyperpara-meters are optimized and comparisons for various choices are presented. Section 6.11summarizes the results of this optimization.

The analysis has been carried out using the ROOT framework in the version 5.34[125].For neural networks, the MLP class of TMVA v4.2.0 has been used[126]. As TMVA doesnot provide the SARPROP training algorithm for neural networks, the source code hasbeen modified to implement it. The implementation follows the pseudocode providedin Listing 1 in the appendix. Finally, the evaluation as described in Section 6.2 isbased on work by Stefanie Hanisch[127]. It has been implemented using the SFrameframework[128].

6.1 Signal and Background SamplesThe data set used for training and evaluation of the MVA methods in this thesis has beenconstructed from several sources. Table 6.1 gives an overview about the subsampleswhich have been used and their respective size.

The signal samples have been generated by Monte Carlo simulations. Event gen-eration has been carried out using PYTHIA8[129] while for the detector simulation,

65

6 Optimization of ANN-Based Tau Identification

Table 6.1: ID and size of each sample used for training and evaluation of multi-variate analysis methods. The index of Z ′ denotes its mass in GeV.

Training/Validation set Test setProcess Sample ID 1-prong 3-prong 1-prong multi-prongSignal samplesW → τντ 147812 24 065 7959 32 116 14 382Z → τ+τ− 147818 26 684 7714 35 598 13 878Z ′

250 → τ+τ− 170201 24 137 6718 32 237 11 069Z ′

500 → τ+τ− 170202 24 713 6398 32 605 10 725Z ′

1000 → τ+τ− 170204 25 239 5324 33 484 10 588Total 124 838 34 113 166 040 60 642Background sampleQCD dijets PeriodD 66 887 86 165 89 913 259 552

Geant4[130] and the the ATHENA framework[131, section 3.3] have been used.The events for the signal sample primarily stem from the processes W → τντ and

Z → τ+τ−. In order to enhance statistics in the high-pT region, tau candidates fromthe decay Z ′ → τ+τ− of a hypothetical additional gauge boson have been added. Threedifferent masses have been assumed for Z ′: 250 GeV, 500 GeV, and 1000 GeV. The sizesof these 5 subsamples have been chosen so that their contributions to the whole signalsample are approximately equal. (cf. Table 6.1)

The background tau candidates have been sampled from a QCD di-jet selection[132] ofLHC data collected in the PeriodD run from July 24 to August 23, 2012 with a center-of-mass energy of

√s = 8 TeV[114]. The background sample contains jets from tau

decays only with a fraction of ∼ 10−2. This is because the cross section for productionof quark- or gluon-initiated jets is larger than the cross section for tau lepton productionby two orders of magnitude. (cf. Chapter 5) The size of the background sample hasbeen chosen to be of the same order of magnitude as the signal sample.

6.2 Evaluation of Identification Algorithms

In order to evaluate and compare different approaches to ANN-based tau identification,one needs a labeled data set which is statistically independent of the training set aswell as a figure of merit. Furthermore, the threshold on the tau score, which ultimatelydetermines which tau candidates are considered signal and which background (firstdefined in Equation 3.9) needs to be taken into account. Thus, this section is dedicatedto describing the framework of evaluation used in this thesis.

66

6.2 Evaluation of Identification Algorithms

6.2.1 Figures of Merit

To accommodate for the needs of physics analyses, the selection criteria on the test setare less strict than on the training set. This means that, while multi-prong events withtwo reconstructed tracks were excluded from the training (cf. Section 5.2), they areincluded in the evaluation. Hence, the classifier which has been training on 3-prongevents is applied to all multi-prong events.

In order to measure the performance of an identification algorithm, it is vital to definethe signal efficiency:

ε1(multi)−prongsig =

N1(multi)−prongsig,pass

N1(3)−prongsig,gen

. (6.1)

N1(multi)−prongsig,pass is the number of successfully reconstructed signal sample tau candidates

(1- or multi-prong) which passed identification1, and N1(3)−prongsig,gen is the number of gen-

erated2 signal tau candidates (1- or 3-prong).Similarly, the background efficiency is defined as:

ε1(multi)−prongbkg =

N1(multi)−prongbkg,pass

N1(multi)−prongbkg,rec

, (6.2)

where N1(multi)−prongbkg,pass is the number of reconstructed tau candidates (1- or multi-prong)

from the background samples which passed identification, and N1(multi)−prongbkg,rec is the num-

ber of successfully reconstructed background tau candidates. (Note that the backgroundsamples are not MC-generated and thus, truth-matching is not possible.)

Both signal and background efficiency depend on the threshold ythrs, i.e. the minimumANN output for an event necessary to pass the identification. If ythrs is high, εsig andεbkg are low, and vice versa. This dependency may be visualized by plotting (εbkg)

−1

over εsig. (cf. Figure 6.1) In physics, this graph is often called a ROC3 curve. Itdiverges for εsig → 0 (threshold goes to 1) and has a global minimum of 1 if εsig ismaximal (threshold goes to 0). The maximum value of εsig is the signal reconstructionefficiency, which is always lower than 1.

In machine learning, the area under the ROC curve (AUC) is regularly used as a

1Note that εsig technically conflates the efficiency of the reconstruction and the identification step.2In order to count towards N

1(3)−prongsig,gen , a generated tau lepton must still pass the acceptance cuts

listed in Section 5.2.3Receiver operating characteristic. Note that in most other fields, the ROC curve is defined as εsig

over εbkg.

67

6 Optimization of ANN-Based Tau Identification

Signal Efficiency

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Inve

rse

Bac

kgro

und

Effi

cien

cy

10

210

310

410

MLP

1-prong = 8 TeVs

Figure 6.1: An exemplary ROC curve of a neural network. The abbreviation MLPstands for Multi-Layer Perceptron, an informal term for feed-forwardneural networks. Note that εsig does not go to 1 but instead is limitedby the reconstruction efficiency.

figure of merit[133]. While the definition of the ROC curve used here differs fromthe standard, it is still desirable to maximize the area under it, i.e. to have as highan inverse background efficiency4 as possible for any given signal efficiency. Withinthe scope of this thesis, the AUC is defined as the integral of the ROC curve overthe interval [0.35; 0.70]. This interval covers the range of εsig typically used in physicsanalyses and avoids the divergent behavior at εsig → 0.

6.2.2 Independence of Transverse Momentum and Pileup Effects

For the application of tau identification algorithms, it is desirable that the algorithmperforms equally well in different ranges of transverse momentum pT. It turns out, how-ever, that for any fixed score threshold, the signal efficiency shows a great dependenceon the transverse momentum. (cf. Figure 6.2a)

In order to cancel out this effect, one has to calculate the threshold for a given targetsignal efficiency in bins of pT. Using these pT-dependent thresholds ythres(εsig, pT), thesignal efficiency can be kept constant over a wide range of pT. (cf. Figure 6.2b) Becausethe background sample has a different pT distribution than the signal sample, the curveεbkg(pT) will typically not be flat. However, since working points are defined in termsof a given signal efficiency, this behavior is not problematic.

4Some authors refer to (εbkg)−1 as background rejection, but the same term is also used for 1− εbkg.

Therefore, this term will be avoided entirely in this thesis.

68

6.2 Evaluation of Identification Algorithms

[GeV]T

vistrue p

20 40 60 80 100 120 140 160

sign

al e

ffici

ency

0.0

0.2

0.4

0.6

0.8

1.0

1.2 = 8 TeVs1-prong, MLP

loose

medium

tight

(a) For fixed score thresholds, the signal ef-ficiency shows a strong dependence onthe transverse momentum.

[GeV]T

vistrue p

20 40 60 80 100 120 140 160

sign

al e

ffici

ency

0.0

0.2

0.4

0.6

0.8

1.0

1.2 = 8 TeVs1-prong, MLP

loose

medium

tight

(b) By choosing appropriate score thresh-olds, εsig can be kept nearly constant.For pvis

T < 20 GeV, εsig is limited by thereconstruction efficiency.

Figure 6.2: Combined signal efficiency of reconstruction and identification over thegenerated visible transverse momentum. Depicted are curves for threeworking points loose (70 %), medium (60 %), and tight (40 %). Theidentification algorithm is a neural network trained and evaluated on1-prong tau candidates.

µ0 5 10 15 20 25 30 35 40

sign

al e

ffici

ency

0.0

0.2

0.4

0.6

0.8

1.0

1.2 = 8 TeVs1-prong, MLP

loosemediumtight

(a) For 1-prong tau candidates.

µ0 5 10 15 20 25 30 35 40

sign

al e

ffici

ency

0.0

0.2

0.4

0.6

0.8

1.0

1.2 = 8 TeVsmulti-prong, MLP

loosemediumtight

(b) For multi-prong tau candidates.

Figure 6.3: Combined signal efficiency of reconstruction and identification over µ,the average number of interactions per bunch crossing. Depicted arecurves for three working points loose, medium, and tight. The identifi-cation is based on an ANN trained on 1-prong (left) and 3-prong (right)tau candidates respectively. The working points are defined as 70 %,60 %, and 40 % (left) and 65 %, 55 %, and 35 % (right) respectively.

69

6 Optimization of ANN-Based Tau Identification

Table 6.2: TMVA options[126, table 22ff.] used to train the benchmark BDT.

Option ValueNTrees 100nCuts 200MaxDepth 8MinNodeSize 0.1UseYesNoLeaf FalseSeparationType GiniIndexBoostType AdaBoostAdaBoostBeta 0.2PruneMethod NoPruning

Figure 6.3 shows the dependence of εsig and εbkg on µ, the average number of inter-actions per bunch crossing. Due to the pile-up corrections described in Section 5.2, allgraphs show that εsig is largely independent of µ. This means that the identificationalgorithm is not affected by pile-up effects.

6.2.3 Benchmark Algorithm

The state-of-the-art tau identification is based on a Boosted Decision Tree (BDT)[114].An introduction to BDTs is given in the TMVA Users Guide[126, p. 108].

As a benchmark for the performance of neural networks, a BDT has been trainedwith the options given in Table 6.2. They are the same as given by Hanisch[127],except pruning has been turned off as advised by the TMVA Users Guide[126, p. 115].The performance is nonetheless approximately equal, as Figure 6.4 shows. The AUC asdefined in Subsection 6.2.1 is 9.17 for 1-prong and 22.28 for multi-prong tau candidates.This BDT will be used throughout this chapter as a benchmark.

6.3 Input Data Standardization

In order to achieve competitive results with neural network training, it is important toproperly prepare the input data. For example, the variable mtracks is measured in MeVand numerically larger than other variables (eg. f corr

core) by several orders of magnitude.Since an ANN’s synapse weights initially are of the same order of magnitude, the netinput z

(k)i of any neuron would be dominated by the contribution of mtracks.

A reasonable measure to counter this effect is to standardize5 the input variables,i.e. to perform a linear transformation on them so that their means are close to zero

70

6.3 Input Data Standardization

Signal Efficiency

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Inve

rse

Bac

kgro

und

Effi

cien

cy

10

210

310

410

PruningNoPruning

1-prong = 8 TeVs

Signal Efficiency0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Rat

io

0.60.70.80.9

11.11.21.3

Signal Efficiency

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Inve

rse

Bac

kgro

und

Effi

cien

cy

10

210

310

410

PruningNoPruning

multi-prong = 8 TeVs

Signal Efficiency0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Rat

io

0.60.70.80.9

11.11.21.3

Figure 6.4: Comparing the BDT used in [127] (black) with the benchmark BDT ofthis thesis (red) for 1-prong (top) and multi-prong (bottom) tau can-didates. The ratios below each plot show the performance of the latterBDT divided by the former BDT’s performance for easier comparison.

71

6 Optimization of ANN-Based Tau Identification

Table 6.3: TMVA options[126, table 19f.] used to train the neural networks usedin Section 6.4.

Option ValueHiddenLayers 20,20,20,20EstimatorType CENeuronType tanhTrainingMethod SARPTemperature 0.02NCycles 3000

and their variances close to one. This procedure is suggested by several authors[51,p. 298][58].

TMVA handles this procedure slightly differently. It’s input pre-processing optionVarTransform=Normalize applies a linear transformation in such a way that each vari-able’s values are guaranteed to be in the range [−1; 1]:

yni = 2 · xni − xmin

i

xmaxi − xmin

i

− 1, (6.3)

where xni is the value of the i-th input variable for the n-th event, xmax

i and xmini are

the minimal and maximal values for that variable and yni is the normalized value.While the difference between actual standardization and the TMVA normalization

procedure has not been investigated in terms of effect on the performance of trainedANNs, the results are expected to be comparable. The reason for this assumption isthat the main motivation for standardization is to avoid net inputs that are in thesaturation region of the hyperbolic tangent activation function, i.e. to avoid |z(k)i | ≫ 1.This basic requirement is fulfilled by both standardization and TMVA’s normalization.

6.4 Stabilization of Training Results

Training an ANN most often leads to a local minimum of its error function differentfrom the global minimum. Consequently, the result of the training process and theperformance of the resulting classifier depend on the ANN’s initial position in the weightspace. However, the synapse weights and biases of an ANN are typically initialized withrandom values.6 Hence, the performance of ANNs during the evaluation is random to

5The term normalization is often used interchangeably.

72

6.4 Stabilization of Training Results

Table 6.4: Sample mean and standard deviation of the Area under ROC Curve(AUC) in the interval [0.35; 0.70] for individual ANNs and ensembles oftwo different sizes.ANNs have been trained on 1-prong tau candidates for the third columnand on 3-prong tau candidates for the fourth column.

Sample size 1-prong AUC Multi-prong AUCIndividual ANNs 50 8.97 ± 0.21 19.2 ± 1.75-member ensembles 10 9.73 ± 0.06 22.2 ± 0.910-member ensembles 5 9.83 ± 0.05 22.6 ± 0.8

some extent as well.Therefore, when comparing ANNs trained with different hyperparameters, it is diffi-

cult to tell if a difference in performance has been caused by the different training orsimply because one ANN has happened to converge to a better local minimum. Thismakes it necessary to find a means to stabilize the neural networks used, i.e. to decreasethe dependence of the training outcome on the initial values.

Of the methods to improve generalization introduced in Section 3.9, the concept ofneural network ensembles turns out to also have a stabilizing effect in terms of localminima; this is true even if the member networks are all trained on the same trainingset, contrary to the ensembles’ theoretical motivation.

In order to investigate the stabilizing effect of network ensembles, 50 neural networkswhich are identical except for their initial synapse weights and biases have been trained.For the training, the SARPROP algorithm has been used. (cf. Subsection 3.6.4) Fora list of all parameters and their exact values, refer to Table 6.3. After training, theseANNs have been divided randomly into ensembles of equal size. After that, each en-semble has been evaluated according to Section 6.2. The evaluation results are givenin Table 6.4.

The results show a clear improvement of both average performance and stability whenusing neural network ensembles. Not only does the standard deviation decrease by 50 %and more, there also is a significant improvement in the average AUC. Comparing thedistribution of AUCs for individual ANNs and for 5-member ensembles with the two-sample Kolmogorov–Smirnov test gives a significance of the difference of p < 0.001 forboth 1-prong and multi-prong tau candidates. These results are further visualized byFigures 6.5 and 6.6.

Doubling the ensemble size from 5 to 10 ANNs further increases average AUC and

6TMVA initializes all synapse weights and biases by sampling from a uniform distribution in theinterval [−2; 2].

73

6 Optimization of ANN-Based Tau Identification

Signal Efficiency

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Inve

rse

Bac

kgro

und

Effi

cien

cy

10

210

310

410

BDTLowest ANNHighest ANN

1-prong = 8 TeVs

Signal Efficiency0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Rat

io

0.60.70.80.9

11.11.21.3

Signal Efficiency

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Inve

rse

Bac

kgro

und

Effi

cien

cy

10

210

310

410

BDTLowest Ens.Highest Ens.

1-prong = 8 TeVs

Signal Efficiency0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Rat

io

0.60.70.80.9

11.11.21.3

Figure 6.5: ROC curves comparing individual ANNs (top) and 5-member ensem-bles (bottom) with the benchmark BDT (black curve). Depicted arethe identification algorithms with the smallest (red) and the largest(blue) AUC among the sample of trained ANNs.The ratio plots below each diagram show that the performance of theensembles is more stable than for individual ANNs. All algorithmshave been trained and evaluated on 1-prong tau candidates.

74

6.4 Stabilization of Training Results

Signal Efficiency

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Inve

rse

Bac

kgro

und

Effi

cien

cy

10

210

310

410

BDTLowest ANNHighest ANN

multi-prong = 8 TeVs

Signal Efficiency0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Rat

io

0.60.70.80.9

11.11.21.3

Signal Efficiency

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Inve

rse

Bac

kgro

und

Effi

cien

cy

10

210

310

410

BDTLowest Ens.Highest Ens.

multi-prong = 8 TeVs

Signal Efficiency0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Rat

io

0.60.70.80.9

11.11.21.3

Figure 6.6: The same curves as in Figure 6.5, but the algorithms have been trainedon 3-prong and evaluated on multi-prong tau candidates. While theeffect is less prominent, ensembles still are more stable than individualANNs.Note that the erratic behavior for εsig . 0.3 is caused by a combinationof low statistics (cf. Section 6.1) and numerical issues (many true tauleptons get a score very close to 1.0).

75

6 Optimization of ANN-Based Tau Identification

decreases standard deviation as Table 6.4 shows. However, testing the significance ofthe AUC increase with the Kolmogorov–Smirnov test gives mixed results: p = 0.007for 1-prong and p = 0.975 for multi-prong tau candidates, indicating a lack of signifi-cance for the latter. Additionally, the standard deviations decrease only by 17 % and11 % respectively. Comparing this to the 100 % increase in computational complexitysuggests 5-member ensembles of neural networks as the more efficient choice. Thus, inthe following sections, 5-member ensembles will be used for comparison purposes.

6.5 Event Reweighting

Another important desideratum for tau identification is that the performance of classifi-cation algorithms be independent of pT, the transverse momentum, and µ, the averagenumber of interactions per bunch crossing. To guarantee such independence, it is cru-cial that the distributions of signal and background training data are equal both in pT

and in µ. This can be achieved through application of weights w(pT, µ) on each eventduring both training and evaluation.

Figure 6.7a shows the pT distributions for 1-prong signal and background events. Ascan be seen, there is an exponential decrease in background events with increasing pT

whereas the signal distribution exhibits a resonance structure. This means that for lowpT, the influence of fake taus in the background data is far greater than for high pT ,which is a source of bias in the classification algorithm.

Figure 6.7b shows the µ distributions for 1-prong signal and background events. Itis important to note here that the signal events have been created in Monte Carlosimulations and as such, their µ distribution does not take into account the 2012 runconditions. Thus, it differs significantly from the µ distribution of background events,which have been extracted from 2012 data using a QCD di-jet data set selection.

These considerations lead to the following procedure: Each signal event is given aweight wsig(µ) such that their weighted µ distribution equals the background µ distri-bution. (cf. Figure 6.7b) And each background event is given a weight wbkg(pT) suchthat their pT distribution equals the one for signal events. (cf. Figure 6.7a)

The advantage of this split is two-fold: First, reweighting the signal events in µ givesthem a more realistic dependence on this variable. Secondly, reweighting the signalevents in pT would invalidate the pT-dependent thresholds derived in Subsection 6.2.2.Thus, pT-reweighting is applied to the background events instead.

Finally, Figure 6.8 presents the effect of the reweighting procedure on the classifier’sperformance. The diagrams show a significant decrease in the AUC. This is to be

76

6.5 Event Reweighting

pt

20 40 60 80 100 120 140

Fra

ctio

n of

can

dida

tes

0.1

0.2

0.3

0.4

0.5Di-jet data (2012)

Signal MC samples

= 8 TeVs1-prong,

[GeV]T

p

50 100 150

Rat

io

1

2

(a) Distribution in pT.

10 15 20 25 30

Fra

ctio

n of

can

dida

tes

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Di-jet data (2012)

Signal MC samples

= 8 TeVs1-prong,

Average number of bunch crossings per interaction

10 20 30

Dat

a/S

igna

l

0

1

2

3

(b) Distribution in µ.

Figure 6.7: Distributions of 1-prong tau candidates over pT and µ extracted fromthe signal (red) and the background sample (black). Below each plot,the per-candidate weights resulting from these distributions are given.

77

6 Optimization of ANN-Based Tau Identification

Signal Efficiency

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Inve

rse

Bac

kgro

und

Effi

cien

cy

10

210

310

410

No Reweighting

Reweightingµ+T

p

1-prong = 8 TeVs

Signal Efficiency0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Rat

io

0.60.70.80.9

11.11.21.3

Signal Efficiency

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Inve

rse

Bac

kgro

und

Effi

cien

cy

10

210

310

410

No Reweighting

Reweightingµ+T

p

multi-prong = 8 TeVs

Signal Efficiency0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Rat

io

0.60.70.80.9

11.11.21.3

Figure 6.8: ROC curves of the benchmark BDT without (black) and with (red) pTand µ weights applied during training and evaluation. 1-prong (top)and multi-prong (bottom) tau candidates have been used for evaluation.Below each plot, the performance of the latter BDT divided by theformer’s is shown in red for easier comparison.

78

6.6 Score Flattening Algorithm

Score Cut

0 0.2 0.4 0.6 0.8 1

Sig

nal e

ffici

ency

(R

eco+

ID)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

= 8 TeVsmulti-prong, pt < 40 GeV

BDT

(a) Only considering tau candidates withpT < 40 GeV.

Score Cut

0 0.2 0.4 0.6 0.8 1

Sig

nal e

ffici

ency

(R

eco+

ID)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

= 8 TeVsmulti-prong, pt > 40 GeV

BDT

(b) Only considering tau candidates withpT > 40 GeV.

Figure 6.9: An example plot of target signal efficiency εsig over the necessary scorethreshold ythrs for two different pT ranges. It can be seen that settingthe threshold to one specific value yields very different signal efficienciesin both pT ranges. The depicted classifier is the benchmark BDTtrained on 3-prong and evaluated on multi-prong tau candidates.

expected, as the reweighting prevents classifiers from indirectly using pT as a classifica-tion criterion. Without reweighting, this would have been possible because a differencebetween signal and background in the pT distribution means that the input variablespace is sampled differently. The reweighting procedure ensures that sampling is doneindependently of pT and thus gives a more realistic prediction of how well the classifierswould perform on real data.

6.6 Score Flattening AlgorithmAs detailed in Subsection 6.2.2, the signal efficiency εsig of a classifier does not onlydepend on the chosen score threshold ythrs, but also on the transverse momentum ofthe tau candidates that the classifier is used on. (cf. Figure 6.9) Thus, to achieve aconstant signal efficiency over a wide range of pT, it is necessary to choose different ythrs

for different pT bins.

79

6 Optimization of ANN-Based Tau Identification

Another approach which has been explored during the work on this thesis is trans-forming the tau scores in such a way that the signal efficiency becomes independent ofpT. Let

f : (y, pT) 7→ y′, (6.4)

be this transformation, where y denotes the score originally given to a particular taucandidate, pT its reconstructed transverse momentum, and y′ the transformed score. Itis clear that f must be monotonically increasing in y so that tau candidates with a lowy have a low y′ as well. Additionally, because the pT-dependent cuts calculated in Sub-section 6.2.2 are binned in pT, f needs to be binned as well. Finally, the transformationshould be as simple as possible; this requirement is fulfilled e.g. by f being piece-wiselinear in y.

Thus, the following procedure has been developed: First, calculate the functionythrs(pT, εsig) with as high a precision as necessary. Secondly, assume a series of targetsignal efficiencies εi (i = 0, . . . , N), where ε0 = 0.0, εN ≤ εreco

sig , and εi < εi+1 for all i.These εi should cover the range of interesting signal efficiencies as finely as necessary.Then, for each tau candidate with a tau score y, find an i such that

ythrs(pT, εi+1) < y < ythrs(pT, εi), (6.5)

i.e. the tau candidate passes the ID for a target efficiency of εi+1, but not for the stricterefficiency εi. With this information, one can apply the following linear transformation:

y′ =y − ythrs(pT, εi+1)

ythrs(pT, εi)− ythrs(pT, εi+1)· (εi+1 − εi) + (1− εi+1). (6.6)

Using (6.5), it is trivial to verify that y′ ∈ [1− εi+1; 1− εi].

A result of this transformation is that there is a very simple relation between thresh-olds on y′ and εsig over the range [0.0, εN ]. Only accepting tau candidates with y′ > ythrs

results in a signal efficiency of εsig ≈ 1− ythrs. (cf. Figure 6.10) Moreover, if the abovetransformation is applied to all pT bins independently, this simple relation holds inde-pendently of pT. (cf. Figure 6.11) Note that, while the score yANN given by a neuralnetwork to a tau candidate could be interpreted as an approximation of the Bayesianposterior probability (cf. Section 3.8), this is no longer true for the modified scorey′ANN.

80

6.6 Score Flattening Algorithm

Score Cut

0 0.2 0.4 0.6 0.8 1

Sig

nal e

ffici

ency

(R

eco+

ID)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

= 8 TeVs1-prong

BDT

(a) Curve for 1-prong tau candidates.

Score Cut

0 0.2 0.4 0.6 0.8 1

Sig

nal e

ffici

ency

(R

eco+

ID)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

= 8 TeVsmulti-prong

BDT

(b) Curve for multi-prong tau candidates.

Figure 6.10: Target signal efficiency εsig plotted over the necessary score thresholdythrs after the transformation y 7→ y′ has been applied. The depictedclassifier is the benchmark BDT trained on 1-prong (left) and 3-prong(right) tau candidates resp.

[GeV]T

vistrue p

20 40 60 80 100 120 140 160

sign

al e

ffici

ency

0.0

0.2

0.4

0.6

0.8

1.0

1.2 = 8 TeVs1-prong, BDT

loose

medium

tight

(a) Curve for 1-prong tau candidates.

[GeV]T

vistrue p

20 40 60 80 100 120 140 160

sign

al e

ffici

ency

0.0

0.2

0.4

0.6

0.8

1.0

1.2 = 8 TeVsmulti-prong, BDT

loose

medium

tight

(b) Curve for multi-prong tau candidates.

Figure 6.11: Signal efficiency after the score transformation plotted over gener-ated visible transverse momentum for the three working points loose,medium, and tight. They are defined as 70 %, 60 %, and 40 % resp.for 1-prong tau candidates (left), and 65 %, 55 %, and 35 % resp. formulti-prong candidates (right). The depicted classifier is the bench-mark BDT.

81

6 Optimization of ANN-Based Tau Identification

Table 6.5: TMVA options used in the comparison of different training algorithms.

(a) Options for stochastic gradientdescent.

Option ValueHiddenLayers 20,20,20,20EstimatorType CENeuronType tanhTrainingMethod BPBatchMode sequentialLearningRate 0.03DecayRate 0.005NCycles 2000

(b) Options for SARPROP.

Option ValueHiddenLayers 20,20,20,20EstimatorType CENeuronType tanhTrainingMethod SARPCoolingSpeed 0.02NCycles 3000

(c) Options for BFGS.

Option ValueHiddenLayers 20,20,20,20EstimatorType CENeuronType tanhTrainingMethod BFGSNCycles 2000

6.7 Choice of Training Algorithm

Section 3.6 introduced several algorithms which can be used to train an ANN for aparticular problem. Each of them has its own advantages and disadvantages. Thissection compares how well neural network ensembles trained with different trainingalgorithms perform. In particular, three different algorithms are compared: stochasticgradient descent (cf. Subsection 3.6.2), SARPROP (cf. Subsection 3.6.4), and theBFGS quasi-Newton method (cf. Subsection 3.6.5).

To do so, three five-member ANN ensembles have been trained. The options passedto TMVA are given in Table 6.5. For the stochastic gradient descent, the hyperpa-rameters have been chosen so that training converges in approximately 2000 epochs.For SAPROP, the cooling speed has been chosen so that training converges in approxi-mately 3000 epochs. The number of training epochs has been chosen for each trainingalgorithm such as to ensure all ANNs reach the region of overtraining7.

After training, all ensembles have been evaluated as described in Section 6.2. Theensemble output has been computed as the arithmetic mean of the output of all member

7Cf. Subsection 3.9.4 why it is desirable to finish training inside the region of overtraining.

82

6.7 Choice of Training Algorithm

Signal Efficiency

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Inve

rse

Bac

kgro

und

Effi

cien

cy

10

210

310

410BDT

SARP

BFGS

BP

1-prong = 8 TeVs

Signal Efficiency0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Rat

io

0.60.70.80.9

11.11.21.3

Signal Efficiency

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Inve

rse

Bac

kgro

und

Effi

cien

cy

10

210

310

410BDT

SARP

BFGS

BP

multi-prong = 8 TeVs

Signal Efficiency0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Rat

io

0.60.70.80.9

11.11.21.3

Figure 6.12: ROC curves of the BP (red), the SARP (blue) and the BFGS (green)ensembles as well as the benchmark BDT (black) for 1-prong (top)and multi-prong (bottom) tau candidates. The ratios below each plotshow the performance of each ensemble divided by the benchmarkperformance for easier comparison.

83

6 Optimization of ANN-Based Tau Identification

Table 6.6: Area under ROC Curve (AUC) in the interval [0.35; 0.70] for the en-sembles BP, SARP, and BFGS. The first column gives the mean AUC ofthe member ANNs; the second column gives the AUC of the ensembleoutput.

(a) Results for 1-prong candidates.

Average AUC Ensemble AUCBP 5.7 ± 2.9 7.91SARP 9.0 ± 0.3 9.78BFGS 8.4 ± 0.3 9.53

(b) Results for multi-prong candidates.

Average AUC Ensemble AUCBP 11.9 ± 1.6 11.9SARP 19.2 ± 1.4 22.4BFGS 15.5 ± 1.3 20.5

ANNs as given in Subsection 3.9.4. The resulting ROC curves are given in Figure 6.12.The AUCs of the ensembles are compared in Table 6.6.

From the results, it is clearly visible that stochastic gradient descent performs muchworse than the other two algorithms. SAPROP and BFGS, on the other hand, result inapproximately equally well-performing ANNs when considering the standard deviationof ensemble AUCs found in Table 6.4.

However, SARPROP and BFGS differ significantly in their computational complex-ity. The member ANNs of the SARP ensemble which have been trained on 1-prongcandidates finished training after (8.54 ± 0.05) h on average. Measuring on the samesystem and under similar conditions, training the member ANNs of the BFGS ensembletook approximately (36.8 ± 0.3) h. While these training times depend on the computinginfrastructure being utilized, the large difference shows that SARPROP is computation-ally more efficient than BFGS while achieving comparable results.

6.8 Choice of Error Function

In Section 3.8, two different error functions have been presented: the sum-of-squareserror MSE8 and the cross-entropy CE. It has been shown that it is sensible to combinethe CE error with the logistic output layer activation function and the MSE error withthe identity activation function. This section investigates whether choosing one errorfunction over the other has a significant impact on the performance of the ANNs.

To do so, two five-member ANN ensembles have been trained. The options passed toTMVA are given in Table 6.7. The pseudorandom number generator has been seededwith the same values for both ensembles; this means that for each ANN in the MSEensemble, there is exactly one ANN in the CE ensemble with the same initial synapse

8From “mean squared error” an alternative name found in the ANN literature.

84

6.8 Choice of Error Function

Signal Efficiency

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Inve

rse

Bac

kgro

und

Effi

cien

cy

10

210

310

410

BDT

CE

MSE

1-prong = 8 TeVs

Signal Efficiency0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Rat

io

0.60.70.80.9

11.11.21.3

Signal Efficiency

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Inve

rse

Bac

kgro

und

Effi

cien

cy

10

210

310

410

BDT

CE

MSE

multi-prong = 8 TeVs

Signal Efficiency0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Rat

io

0.60.70.80.9

11.11.21.3

Figure 6.13: ROC curves of the MSE (red) and the CE (blue) ensembles as well as thebenchmark BDT (black) for 1-prong (top) and multi-prong (bottom)tau candidates. The ratios below each plot show the performanceof each ensemble divided by the benchmark performance for easiercomparison.

85

6 Optimization of ANN-Based Tau Identification

Table 6.7: TMVA options used in the comparison of different error functions.

Option ValueHiddenLayers 20,20,20,20EstimatorType MSE | CENeuronType tanhTrainingMethod SARPROPTemperature 0.02NCycles 3000

Table 6.8: Area under ROC Curve (AUC) in the interval [0.35; 0.70] for the MSE andthe CE ensemble. The first column gives the mean AUC of the memberANNs; the second column gives the AUC of the ensemble output.

(a) Results for 1-prong candidates.

Average AUC Ensemble AUCMSE 8.1 ± 0.3 7.61CE 9.0 ± 0.3 9.85

(b) Results for multi-prong candidates.

Average AUC Ensemble AUCMSE 16.0 ± 0.5 16.59CE 20.1 ± 2.0 23.19

weights and biases. After training, both ensembles have been evaluated in the sameway as in Section 6.7. The resulting ROC curves are given in Figure 6.13. The AUCsof the ensembles are compared in Table 6.8.

As can be seen, using the cross-entropy and the logistic output layer activation func-tion gives a consistent improvement of the ROC curve in comparison to the sum-of-squares error and the identity activation function. This suggest that beside alwayshaving an output in the interval [0; 1], CE neural networks generally find better localminima during training than MSE networks.

6.9 Choice of Activation FunctionIn Subsection 3.7.2, it has been argued that using the hyperbolic tangent as the hiddenlayer activation function leads to better training results in comparison with the logisticfunction σ. The given reason is that the non-zero value of the logistic function σ(z) atz = 0 causes all adaptive layers except the first one to start out with large net inputsand, thus, a small gradient in weight space, which is a commonly associated with slowtraining. This section experimentally verifies this claim.

To do so, two five-member ANN ensembles have been trained with an almost identicalconfiguration. (cf. Table 6.9) The hidden layer activation function has been chosen to beσ(z) for one ensemble and tanh(z) for the other. As in Section 6.8, the pseudorandom

86

6.9 Choice of Activation Function

Table 6.9: TMVA options used in the comparison of different activation functions.

Option ValueHiddenLayers 20,20,20,20EstimatorType CENeuronType sigmoid | tanhTrainingMethod SARPROPTemperature 0.02NCycles 3000

Table 6.10: Area under ROC Curve (AUC) in the interval [0.35; 0.70] for thesigmoid and the tanh ensemble. The first column gives the meanAUC of the member ANNs; the second column gives the AUC of theensemble output.

(a) Results for 1-prong candidates.

Average AUC Ens. AUCsigmoid 8.2 ± 0.4 8.70tanh 9.0 ± 0.3 9.85

(b) Results for multi-prong candidates.

Average AUC Ens. AUCsigmoid 15.9 ± 1.5 18.06tanh 20.1 ± 2.0 23.19

number generator has been seeded with the same values for both ensembles. Aftertraining, both ensembles have been evaluated in the same way as in Section 6.7. Theresults are given in Figure 6.14. The AUCs of the ensemles are compared in Table 6.10.

As predicted, the tanh ensemble performs significantly better than the sigmoid en-semble. The reason for this can be seen in Figure 6.15. While the convergence behaviorof the error function is qualitatively similar for both sigmoid and tanh ANNs, conver-gence is much slower for the former. This is further proven by the fact that increasingthe training time of the sigmoid ANNs gives a significant boost in performance. Train-ing them for 6000 instead of 3000 epochs gives an ensemble AUC of 9.28 (vs. 8.70) foron 1-prong and an AUC of 19.54 (vs. 18.06) for multi-prong tau candidates.

Hence, while using an antisymmetric activation function in the hidden layers doesnot lead to better-performing ANNs per se, it does speed up training significantly andthus allows training more complex neural networks in the same time. Other nonlinearantisymmetric activation functions are expected to give similar results to the hyperbolictangent, but have not been investigated due to a lack of implementation in TMVA.

87

6 Optimization of ANN-Based Tau Identification

Signal Efficiency

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Inve

rse

Bac

kgro

und

Effi

cien

cy

10

210

310

410

BDTTanhSigmoid

1-prong = 8 TeVs

Signal Efficiency0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Rat

io

0.60.70.80.9

11.11.21.3

Signal Efficiency

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Inve

rse

Bac

kgro

und

Effi

cien

cy

10

210

310

410

BDTTanhSigmoid

multi-prong = 8 TeVs

Signal Efficiency0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Rat

io

0.60.70.80.9

11.11.21.3

Figure 6.14: ROC curves of the sigmoid (red) and the tanh (blue) ensemblesas well as the benchmark BDT (black) for 1-prong (top) and multi-prong (bottom) tau candidates. The ratios below each plot show theperformance of each ensemble divided by the benchmark performancefor easier comparison.

88

6.10 Optimization of Network Topology

Epochs500 1000 1500 2000 2500 3000

Cro

ss-E

ntro

py

0.42

0.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58

0.6

Training Set Error

Validation Set Error

MLP Convergence Test

(a) A member of the sigmoid ensemble.

Epochs500 1000 1500 2000 2500 3000

Cro

ss-E

ntro

py

0.42

0.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58

0.6

Training Set Error

Validation Set Error

MLP Convergence Test

(b) A member of the tanh ensemble.

Figure 6.15: The error function of two exemplary ANNs over the number of passedtraining epochs with one evaluation per 50 epochs. The plots start at300 epochs due to scaling issues. The neural networks in these graphshave been trained and evaluated on 1-prong tau candidates.

Table 6.11: TMVA options used in the optimization of the network topology. Referto the text for information on the HiddenLayers option.

Option ValueHiddenLayers <various values>EstimatorType CENeuronType tanhTrainingMethod SARPTemperature 0.02NCycles 3000

6.10 Optimization of Network TopologyOne of the most crucial hyperparameters to optimize in an ANN-based tau identificationis the topology of the networks, i.e. the number of hidden layers H and the numberof neurons per hidden layer NH. Choosing too simple a topology results in the ANNsunderfitting the optimal decision boundary; making networks too complex will notimprove the performance (or even deteriorate it) while unnecessarily increasing trainingtime. Thus, it is important to investigate how network performance depends on thechosen topology.

The comparison has been carried out by training five-member ANN ensembles fordifferent values of H and NH. (cf. Table 6.11) For H, the values 2 to 5 have been tested,while NH has been varied from 10 to 30 in increments of 5. Within an ANN, each hiddenlayer has the same number of neurons since there is no theoretical motivation to do

89

6 Optimization of ANN-Based Tau IdentificationA

UC[

0.35

; 0.7

0]

7,5

8

8,5

9

9,5

10

Neurons per Hidden Layer0 5 10 15 20 25 30 35

2 hidden layers3 hidden layers4 hidden layers5 hidden layers

(a) Trained on 1-prong tau candidates.A

UC[

0.35

; 0.7

0]

14

16

18

20

22

24

26

Neurons per Hidden Layer0 5 10 15 20 25 30 35

2 hidden layers3 hidden layers4 hidden layers5 hidden layers

(b) Trained on 3-prong tau candidates.

Figure 6.16: Area under ROC Curve (AUC) in the interval [0.35; 0.70] plotted overNH for ensembles of 5 ANNs. H is 2 (black circles), 3 (red squares),4 (green diamonds), or 5 (blue triangles). The error bars are givenunder the assumption that the AUC of the ensembles trained herehas the same standard deviation as in Section 6.4.

otherwise. The AUCs of the resulting network ensembles are presented in Figure 6.16.While the results for the 1-prong and multi-prong classifiers are qualitatively similar,

the standard deviation of the AUC for the multi-prong classifier is rather large. (cf.Figure 6.16b) Therefore, the following analysis focuses on the results gained using 1-prong tau candidates.

Figure 6.16a shows that the ensembles perform poorly for NH ≤ 10, independently ofH. Furthermore, increasing NH above a minimum value of ∼ 15 . . . 20 does not increasethe AUC significantly9. This transition from under- to overfitting is further visualizedin Figure 6.17:

• For NH = 10 (cf. Figure 6.17a), the training and validation set errors are approx-imately equal at all times. This indicates that the ANN’s error has a high biasand a low variance, i.e. its network function is almost independent of the dataset and thus underfits the problem.

• For NH = 15 (cf. Figure 6.17b), the validation set error decreases more slowlythan the training set error. This shows that the ANN’s network function dependsto some degree on the data set.

• For NH = 20 (cf. Figure 6.17c), the validation set error increases in the last ∼ 700

epochs. This indicates that the ANN’s error has a low bias but a high variance,i.e. it overfits the problem.

90

6.10 Optimization of Network Topology

Epochs500 1000 1500 2000 2500 3000

Cro

ss-E

ntro

py

0.44

0.45

0.46

0.47

0.48

0.49

0.5

0.51

0.52

0.53

0.54

Training Set Error

Validation Set Error

MLP Convergence Test

(a) 10 neurons per hidden layer

Epochs500 1000 1500 2000 2500 3000

Cro

ss-E

ntro

py

0.44

0.45

0.46

0.47

0.48

0.49

0.5

0.51

0.52

0.53

0.54

Training Set Error

Validation Set Error

MLP Convergence Test

(b) 15 neurons per hidden layer

Epochs500 1000 1500 2000 2500 3000

Cro

ss-E

ntro

py

0.44

0.45

0.46

0.47

0.48

0.49

0.5

0.51

0.52

0.53

0.54

Training Set Error

Validation Set Error

MLP Convergence Test

(c) 20 neurons per hidden layer

Figure 6.17: The error function of exemplary ANNs over the number of passedtraining epochs with one evaluation per 50 epochs. The plots startat 300 epochs due to scaling issues. The neural networks in thesegraphs have four hidden layers and have been trained and evaluatedon 1-prong tau candidates.

Thus, one should choose NH ≥ 20 to accurately approximate the optimal decision bound-ary. As described in Subsection 3.9.4, ANNs should be in the region of overtrainingwhen used in ensembles.

Figure 6.16a further shows how the AUC varies with the number of hidden layersH. Generally, increasing H leads to an improvement of the ANN performance up to avalue of H = 4. Adding a fifth hidden layer does not lead to a significant increase of theAUC, suggesting that such a layer cannot further optimize the internal representationof the input variables. Thus, an optimal trade-off between ensemble performance andtraining speed is given by the hyperparameters H = 4 and NH = 20.

9Figure 6.16b shows similar behavior for the ANNs trained on 3-prong tau candidates.

91

6 Optimization of ANN-Based Tau Identification

6.11 Performance of the Optimized ClassifierIn this chapter, a series of hyperparameters and their influence on the performanceof the tau identification classifier have been discussed. It has been shown that theformation of five-member ANN ensembles significantly improved the predictive powerof the resulting classifier. At the same time, the influence of the ANN’s initial synapseweights before training is decreased significantly. Doubling the size of the ensemblesimproves these results only marginally.

Three training algorithms have been compared: stochastic gradient descent (SGD),SARPROP, and BFGS. The latter two result in classifiers which perform equally wellwhile SGD-trained ANNs perform significantly worse. Furthermore, SARPROP hasbeen shown to be computationally more efficient than BFGS.

The influence of the error function on the results of training has been investigatedas well. Minimizing the cross-entropy error function during the training process resultsin ANNs with a higher AUC than minimizing the sum-of-squares error function. Ithas further been shown that hyperbolic tangent as a hidden layer activation functionis superior to the logistic function σ.

Finally, a systematic search for the optimal network topology has been conducted.The results suggest 4 hidden layers with 20 non-bias neurons each as the optimalchoice. Less complex ANNs result in a decreased AUC of the resulting ensemble. Morecomplex ANNs are computationally more expensive while not improving performancesignificantly.

Figure 6.18 shows the ROC curve of an optimized five-member ensemble. The averageAUC of the optimized ensembles has been found to be 9.73 for 1-prong and 22.2 formulti-prong tau candidates. The relative standard deviation of these values is 0.6 % and4.1 % respectively. It is caused by the random initial weights of the ANNs. The relativedifference to the AUC of the benchmark BDT (9.17 for 1-prong, 22.3 for multi-prongtau candidates) is +6.1 % and −0.4 % respectively. This means that the ANN-basedtau identification is superior to the benchmark BDT for 1-prong and approximatelyequal for multi-prong tau candidates.

The large difference in the relative standard deviation between the classifiers for1-prong and multi-prong tau candidates may be caused by the different sizes of thetraining sets. (cf. Section 6.1) Another possible reason are inherent differences betweenthe two classifiers (e.g. the utilized identification variables). Due to time constraints,this issue has not been further investigated.

92

6.11 Performance of the Optimized Classifier

Signal Efficiency

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Inve

rse

Bac

kgro

und

Effi

cien

cy

10

210

310

410

BDT

MLP

1-prong = 8 TeVs

Signal Efficiency0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Rat

io

0.60.70.80.9

11.11.21.3

Signal Efficiency

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Inve

rse

Bac

kgro

und

Effi

cien

cy

10

210

310

410

BDT

MLP

multi-prong = 8 TeVs

Signal Efficiency0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Rat

io

0.60.70.80.9

11.11.21.3

Figure 6.18: ROC curves of the benchmark BDT (black) and an optimized ANNensemble (red) for 1-prong (top) and multi-prong (bottom) tau can-didates. The ratios below each plot show the performance of theensemble divided by the benchmark performance for easier compari-son.

93

7 Summary and Outlook

Efficient tau identification is important for a wide range of physics analyses. Methods ofmulti-variate analysis provide superior performance in comparison to previous cut-basedapproaches[11]. Of the many algorithms available, Boosted Decision Trees have beenthe method of choice for tau identification[114]. In recent years, however, artificialneural networks have improved both in performance and in efficiency due to newly-developed training algorithms and new theoretical insights. The continuous increase inavailable computational power has made deep neural networks a feasible approach toclassification tasks. Hence, it has become worthwhile to study the application of ANNsto the problem of tau identification.

For this thesis, ANNs have been used to identify hadronically decaying tau leptons ata center-of-mass energy of

√s = 8 TeV. The ANN classifier has been used to distinguish

between Monte-Carlo-generated tau leptons and the dominant QCD-jet background, forwhich di-jet data collected at ATLAS in 2012 has been used. Classification has beencarried out separately for 1-prong and multi-prong tau candidates.

Several parameters of the ANNs were varied in order to optimize the classifier perfor-mance. These parameters and their final values were:

• the training algorithm (SARPROP);

• the error function being minimized during training (cross-entropy);

• the activation function of the hidden neurons (hyperbolic tangent);

• the ANN’s number of hidden layers (4); and

• the number of neurons per hidden layer (20).

Trained ANNs have been grouped into ensembles of five in order to both improveperformance and reduce performance fluctuations due to the randomization of initialweights.

The TMVA package, extended by a custom SARPROP implementation, has beenused to carry out the studies in this thesis. The AUC (Area Under ROC Curve) has been

95

7 Summary and Outlook

chosen as a figure of merit in order to compare the ANN performance to a benchmarkBDT. For the final five-member ensembles, the average AUC has been found to be9.73 ± 0.06 for 1-prong and 22.2 ± 0.9 for multi-prong tau candidates. This improvesthe average AUC of individual ANNs by 8.5 % and 15.6 % respectively and is competitiveor even superior to the AUC of the benchmark BDT (9.17 and 22.3 respectively).

Finally, a new score-flattening transformation has been developed which facilitatesphysics analyses with no defined working point in terms of εsig. It has been shown tokeep the tau score independent of pT.

Despite these results, further studies are advised for several reasons. First, train-ing has been carried out only on comparably small data sets due to time constraints.Increasing the amount of signal and background samples may further boost ANN perfor-mance. It may also affect the optimal network topology, as large ANNs usually benefitmore from an increase in available data than small ones.

Secondly, through continuous research in the past years, several new training algo-rithms have emerged[134, 135, 136, 137] which have outperformed SARPROP on sev-eral benchmark problems. Also, a limited-memory implementation of BFGS has beendeveloped[138] and may reduce the computational complexity which has been the maindisadvantage of BFGS. Furthermore, recent research[139, 140] has presented hiddenlayer activation functions which result in better-performing ANNs than the hyperbolictangent.

Thirdly, the ANNs presented in this thesis operate on input variables which alreadyare highly optimized in terms of separation power. Research on deep neural networks,however, generally focuses on problems with a large amount of unprocessed data[141] onwhich the ANN has to perform feature extraction itself. Thus, it might be worthwhile tostudy the performance of deep neural networks which act on a larger set of unprocessedkinematic variables.

Finally, it has recently become possible to utilize information about the products of atau decay (substructure information) in the identification procedure[142]. Research hasshown that a BDT using only substructure information is competitive to the currentapproach[127], but the effect on ANNs still needs to be studied. Due to their featureextraction capabilities, ANNs might benefit more from substructure information thanBDTs.

96

A Training Algorithms

Listing 1 Pseudo-code implementation of the SARPROP algorithm. Differences toiRprop− are indicated by comments.

1: for τ = 0 to Nepochs do2: Calculate all ∂E

∂wi(τ)

3: T = 2−λτ ▷ Calculate current temperature.4: for all wi do5: ∂E

∂wi(τ) = ∂E

∂wi(τ) + k1T · wi

1+w2i

▷ Add ∂Ω/∂wi for weight decay.6: s = ∂E

∂wi(τ) · ∂E

∂wi(τ − 1)

7: if s > 0 then8: ∆i(τ) = min (∆max, η

+ ·∆i(τ − 1))

9: else if s < 0 then10: if ∆i(τ − 1) < k2T

2 then ▷ Check if wi is close to a local minimum.11: ∆i(τ) = max (∆min, η

− ·∆i(τ − 1) + r · k3T 2) ▷ Add noise term.12: else13: ∆i(τ) = max (∆min, η

− ·∆i(τ − 1))

14: end if15: ∂E

∂wi(τ) = 0

16: else17: ∆i(τ) = ∆i(τ − 1)

18: end if19: ∆wi(τ) = sign(− ∂E

∂wi(τ)) ·∆i(τ)

20: wi(τ + 1) = wi(τ) + ∆wi(τ)

21: end for22: end for

97

A Training Algorithms

Listing 2 Pseudo-code implementation of the iRprop− algorithm.1: for τ = 0 to Nepochs do2: Calculate all ∂E

∂wi(τ)

3: for all wi do4: s = ∂E

∂wi(τ) · ∂E

∂wi(τ − 1)

5: if s > 0 then6: ∆i(τ) = min (∆max, η

+ ·∆i(τ − 1))

7: else if s < 0 then8: ∆i(τ) = max (∆min, η

− ·∆i(τ − 1))

9: ∂E∂wi

(τ) = 0 ▷ Defer weight update.10: else11: ∆i(τ) = ∆i(τ − 1)

12: end if13: ∆wi(τ) = sign(− ∂E

∂wi(τ)) ·∆i(τ) ▷ Note that ∆wi(τ) = 0 if ∂E

∂wi(τ) = 0.

14: wi(τ + 1) = wi(τ) + ∆wi(τ)

15: end for16: end for

Listing 3 Pseudo-code implementation of the BFGS algorithm.1: for τ = 0 to Nepochs do2: Calculate g(τ) = ∇E(τ)

3: if τ = 0 then4: G(τ) = 15: else6: d = w(τ)− w(τ − 1)

7: v = g(τ)− g(τ − 1)

8: G(τ) = G(τ − 1) + ∆G(G(τ − 1),d,v)9: end if

10: p = −G(τ)g(τ)11: α = LineSearch(w(τ),p)12: w(τ + 1) = w(τ) + αp13: end for

98

B Derivation of the Back-PropagationEquations

To illustrate back-propagation, assume an L-layer network as defined in Section 3.5with N0 = N input variables and NL = M outputs. M > 1 is allowed in this derivationas it simplifies the notation. Let w

(k)ij be the synapse weights and b

(k)i the biases of the

network. As a reminder, the k-th layer of an ANN computes its output y(k)i as follows:

z(k)i =

Nk−1∑j=1

w(k)ij y

(k−1)j + b

(k)i , (B.1)

y(k)i = φ(k)(z

(k)i ), (B.2)

where z is the net input, y is the neuron activation, and i is the index of the neuronwith i = 1, . . . , Nk. The dependencies of the functions z

(k)i and y

(k)i on their arguments

are being suppressed for legibility’s sake.The task is to find the partial derivatives ∂E/∂w

(k)ij and ∂E/∂b

(k)i for any synapse

weight or bias of the ANN. To do so, assume first that the error function E is of thefollowing structure:

E(w|S) =∑

(x,t)∈S

e (t,yANN(x|w)) , (B.3)

i.e. the total error function can be written as a sum of per-pattern errors e(t,y) thatcan be computed from the ANN output for each training pattern independently1. Thisallows us to calculate the derivative for each pattern separately and obtain the derivativeof the total error as:

∂E(w|S)∂w

(k)ij

=∑

(x,t)∈S

∂e (t,yANN(x|w))

∂w(k)ij

. (B.4)

99

B Derivation of the Back-Propagation Equations

The first step is to use the chain rule of differentiation, and insert a derivative w.r.t.the net input z of the neuron with which the parameter is associated:

∂e

∂w(k)ij

=∂e

∂z(k)i

∂z(k)i

∂w(k)ij

. (B.5)

Using (B.1), it is clear that:

∂z(k)i

∂w(k)ij

= y(k−1)k . (B.6)

This can be inserted in (B.5) to get:

∂e

∂w(k)ij

=∂e

∂z(k)i

· y(k−1)j , (B.7)

∂e

∂b(k)i

=∂e

∂z(k)i

· 1. (B.8)

At this point, it is useful to define:

δ(k)i :=

∂e

∂z(k)i

, (B.9)

and find a recursive formula for these δ(k)i , which are informally called either neuron

errors or neuron deltas.The neuron deltas of the k-th layer can be calculated using the deltas of the (k+1)-th

layer. To do so, one can use the chain rule once more, being aware of the fact that allneurons of the (k+1)-th layer depend on the output of the neuron associated with δ

(k)i :

δ(k)i =

∂e

∂z(k)i

=

Nk+1∑j=1

∂e

∂z(k+1)j

∂z(k+1)j

∂z(k)i

=

Nk+1∑j=1

∂e

∂z(k+1)j︸ ︷︷ ︸1

∂z(k+1)j

∂y(k)i︸ ︷︷ ︸2

∂y(k)i

∂z(k)i︸ ︷︷ ︸3

. (B.10)

This sum contains three derivatives, one of which does not depend on the summationindex j. The first term is easily identified as δ

(k+1)j . Using (B.1), the second term can

be calculated as w(k+1)ji (Note the swapped indices j and i). And using (B.2), the third

term turns out as the derivative of the activation function, φ(k)′. Put together, this

1This is true for all error functions presented in Section 3.8.

100

gives the recursive formula:

δ(k)i =

(Nk+1∑j=1

δ(k+1)j w

(k+1)ji

)· φ(k)′(z

(k)i ). (B.11)

Finally, consider k = L, the base case of this formula. In (B.4), the (reasonable)assumption has been made that the error function e depends in some way on theoutput of the neural network. Thus, the chain rule can be applied for a third time:

δ(L)i =

∂e

∂z(L)i

=∂e

∂y(L)i︸ ︷︷ ︸1

∂y(L)i

∂z(L)i︸ ︷︷ ︸2

. (B.12)

With (B.2), the second term is, as before, the derivative of the activation function.The first term is the derivative of the error function w.r.t. the network output. Cf.Section 3.8 for the formulas of the relevant error functions.

The results are summarized in Subsection 3.6.1. Other derivatives of the neuralnetwork (e.g. the Jacobian ∂yk/∂xi or the Hessian ∂2E/∂wij∂wkl) can be calculatedusing a similar approach.

101

C Bias–Variance Decomposition forthe Cross-Entropy

According to (3.46) in Subsection 3.8.2, the per-pattern cross-entropy is defined as:

e(t,x) = −t ln y(x, S)− (1− t) ln(1− y(x, S)), (C.1)

where it has been made explicit that the exact result of the network function y := yANN

will depend on its training set S.

As stated in Subsection 3.9.1, the expression under consideration is:

ES[Et[e(t,x)]], (C.2)

where ES[·] is the expectation over all training sets S of an arbitrary but fixed size andEt[·] is the expectation over the random variable t and makes sure that the overlapbetween Csig and Cbkg is handled correctly.

As noted in Subsection 3.9.1, t is a discrete random variable with P (t = 0) = p(Cbkg|x)and P (t = 1) = p(Csig|x). Together with (C.1), this gives:

Et[e(t,x)] = −p(Csig|x) ln y(x, S)− p(Cbkg|x) ln(1− y(x, S)). (C.3)

This formula can be split into two different contributions:

Et[e(t,x)] = H(p) +DKL(p∥y(x, S)), (C.4)

where:

H(p) := −p(Csig|x) ln(p(Csig|x))− p(Cbkg|x) ln(p(Cbkg|x)), (C.5a)

103

C Bias–Variance Decomposition for the Cross-Entropy

is the entropy of the probability distribution of t, while:

DKL(p∥y(x, S)) := p(Csig|x) ln p(Csig|x)y(x, S) + p(Cbkg|x) ln p(Cbkg|x)

1− y(x, S) , (C.5b)

is the Kullback–Leibler divergence of y from p. Intuitively, it is a measure of the in-formation that is lost when using y to approximate a “true” probability distributionp[143, p. 51]. DKL(p∥y) is always non-negative and it is the smaller, the more similarthe distributions p and y are.

Using this decomposition (C.4) and knowing that H(p) does not depend on S, (C.2)can be written as:

ES[Et[e(t,x)]] = H(p) + ES[DKL(p∥y(x, S))]. (C.6)

By adding and subtracting the term DKL(p∥ES[y(x, S)]) (note the difference to thesecond term in (C.6)), one arrives at:

ES[Et[e(t,x)]] = H(p) +DKL(p∥ES[y(x, S)])

+

ES[DKL(p∥y(x, S))]−DKL(p∥ES[y(x, S)]). (C.7)

It is important to note that the braced difference is always non-negative. The reasonis that the probability distribution p does not depend on S and DKL(p∥·) is a convexfunction1; Jensen’s inequality then gives:

ES[DKL(p∥y)]. ≥ DKL(p∥ES[y])

Defining y(x) := ES[y(x, S)] for brevity’s sake, it is trivial to define functions dB anddV so that:

ES[Et[e(t,x)]] = H(p) + dB(p, y(x)) + ES

[dV(y(x), y(x, S))

], (C.8)

where all three contributions are non-negative.

1DKL(p∥x) depends on x through ln(1/x). Because the logarithm lnx is a concave function, it followsthat ln(1/x) = − lnx is a convex function. Thus, DKL(p∥x) is convex in x, as well.

104

List of Figures

3.1 Graphical representation of an artificial neuron . . . . . . . . . . . . . 103.2 Activation functions and their derivatives . . . . . . . . . . . . . . . . 113.3 A typical feed-forward neural network . . . . . . . . . . . . . . . . . . 133.4 The decision boundary created by a single-layer network . . . . . . . . 143.5 The influence of alpha on stochastic gradient descent . . . . . . . . . . 213.6 Gradient descent in an ill-conditioned problem. . . . . . . . . . . . . . 223.7 A visualization of under- and overfitting . . . . . . . . . . . . . . . . . 353.8 Typical training and validation set error during training . . . . . . . . 37

4.1 Cumulative luminosity over time for the year 2012 . . . . . . . . . . . 424.2 Schematic illustration of the CERN accelerator complex . . . . . . . . 434.3 Cut-away view of the ATLAS detector and its subsystems . . . . . . . 444.4 The ATLAS inner detector . . . . . . . . . . . . . . . . . . . . . . . . 464.5 Cut-away view of the ATLAS calorimeter system . . . . . . . . . . . . 484.6 Cut-away view of the ATLAS muon spectrometer system . . . . . . . 50

5.1 f corrcore for 1- and 3-prong tau candidates . . . . . . . . . . . . . . . . . . 57

5.2 f corrtrack for 1- and 3-prong tau candidates . . . . . . . . . . . . . . . . . 58

5.3 Rtrack) for 1- and 3-prong tau candidates . . . . . . . . . . . . . . . . . 595.4 ∆Rmax and Sflight

T for 3-prong tau candidates . . . . . . . . . . . . . . 595.5 mtracks for 3-prong tau candidates . . . . . . . . . . . . . . . . . . . . 605.6 SIP

leadtrk and N isotrack for 1-prong tau candidates . . . . . . . . . . . . . . 61

5.7 Nπ0 for 1- and 3-prong tau candidates . . . . . . . . . . . . . . . . . . 625.8 mvis

τ for 1- and 3-prong tau candidates . . . . . . . . . . . . . . . . . . 625.9 fvis−pT for 1- and 3-prong tau candidates . . . . . . . . . . . . . . . . . 63

6.1 An exemplary ROC curve . . . . . . . . . . . . . . . . . . . . . . . . . 686.2 (Un-)corrected signal efficiency over pT (1-prong) . . . . . . . . . . . . 69

105

List of Figures

6.3 Signal efficiency over µ for 1- and multi-prong . . . . . . . . . . . . . . 696.4 ROC curves of the benchmark BDT (1-/multi-prong) . . . . . . . . . 716.5 ROC curves of the best and the worst 5-member ensemble (1-prong) . 746.6 ROC curves of the best and the worst 5-member ensemble (multi-prong) 756.7 pT and µ distributions of the signal and background samples as well as

the resulting per-event weights (1-prong) . . . . . . . . . . . . . . . . 776.8 Influence of pT- and µ-reweighting . . . . . . . . . . . . . . . . . . . . 786.9 εsig over ythrs for two different pT ranges (multi-prong) . . . . . . . . . 796.10 εsig over ythrs after score transformation (1-/multi-prong) . . . . . . . . 816.11 εsig over pT after score transformation . . . . . . . . . . . . . . . . . . 816.12 Comparison plot of different training algorithms . . . . . . . . . . . . 836.13 Comparison plot of different error functions . . . . . . . . . . . . . . . 856.14 Comparison plot of different activation functions . . . . . . . . . . . . 886.15 Comparison of error evolution during training for ANNs with different

activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.16 Comparison plot of different network topologies . . . . . . . . . . . . . 906.17 Training and validation set error for different network topologies . . . 916.18 Comparison of an optimized ANN ensemble with the benchmark BDT 93

106

List of Tables

5.1 Tau decay channels and branching ratios . . . . . . . . . . . . . . . . 545.2 Identification variables used by the MVA methods . . . . . . . . . . . 63

6.1 Sample IDs and sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.2 TMVA options used to train the benchmark BDT . . . . . . . . . . . 706.3 TMVA options used to train the ANN ensembles . . . . . . . . . . . . 726.4 Mean AUC of individual ANNs and ensembles . . . . . . . . . . . . . 736.5 TMVA options used in the comparison of different training algorithms 826.6 AUCs of the ensembles trained with different training algorithms . . . 846.7 TMVA options used in the comparison of different error functions . . . 866.8 AUCs of ensembles trained with different error functions . . . . . . . . 866.9 TMVA options used in the comparison of different activation functions 876.10 AUCs of ensembles with different activation functions . . . . . . . . . 876.11 TMVA options used in the optimization of the network topology . . . 89

107

Bibliography

All web links were last accessed on October 9, 2015.

[1] D. Griffiths, Introduction to Elementary Particles. Wiley-VCH, 2nd revised ed.,2008.http://eu.wiley.com/WileyCDA/WileyTitle/productCd-3527406018.html.

[2] SLAC-LBL Collaboration, M. L. Perl et al., Evidence for Anomalous LeptonProduction in e+e− Annihilation, Phys. Rev. Lett. 35 (Dec, 1975) 1489–1492.

[3] M. L. Perl, Evidence for, and Properties of, the New Charged Heavy Lepton, inTwelfth Rencontre De Moriond, vol. 1, pp. 75–97. 1977. http://www-public.slac.stanford.edu/sciDoc/docMeta.aspx?slacPubNumber=SLAC-PUB-1923.

[4] L. M. Lederman et al., Observation of a Dimuon Resonance at 9.5 GeV in400-GeV Proton-Nucleus Collisions, Phys. Rev. Lett. 39 (Aug, 1977) 252–255.

[5] CDF Collaboration, F. Abe et al., Observation of top quark production in pp

collisions, Phys. Rev. Lett. 74 (1995) 2626–2631, arXiv:hep-ex/9503002[hep-ex].

[6] DONUT Collaboration, K. Kodama et al., Observation of tau neutrinointeractions, Phys. Lett. B504 (2001) 218–224, arXiv:hep-ex/0012035[hep-ex].

[7] Particle Data Group Collaboration, K. A. Olive et al., Review of ParticlePhysics, Chin. Phys. C38 (2014) 090001.

[8] L. Evans and P. Bryant, LHC Machine, Journal of Instrumentation 3 (2008)S08001.

109

[9] ATLAS Collaboration, G. Aad et al., Observation of a new particle in the searchfor the Standard Model Higgs boson with the ATLAS detector at the LHC , Phys.Lett. B716 (2012) 1–29, arXiv:1207.7214 [hep-ex].

[10] CMS Collaboration, S. Chatrchyan et al., Observation of a new boson at a massof 125 GeV with the CMS experiment at the LHC , Phys. Lett. B716 (2012)30–61, arXiv:1207.7235 [hep-ex].

[11] M. Wolter, Tau identification using multivariate techniques in ATLAS, Tech.Rep. ATL-PHYS-PROC-2009-016. ATL-COM-PHYS-2008-286, CERN, Geneva,Dec, 2008. http://cds.cern.ch/record/1152704.

[12] D. Ciresan, U. Meier, J. Masci, and J. Schmidhuber, A committee of neuralnetworks for traffic sign classification, in IJCNN, pp. 1918–1921. Jul, 2011.

[13] J. Markoff, Scientists See Promise in Deep-Learning Programs, .http://nyti.ms/18mxReu.

[14] T. Ciodaro, D. Deva, D. Damazio, and J. de Seixas, Online Particle Detectionby Neural Networks Based on Topologic Calorimetry Information, Tech. Rep.ATL-DAQ-PROC-2011-049, CERN, Geneva, Nov, 2011.http://cds.cern.ch/record/1402984.

[15] G. Sartisohn and W. Wagner, Higgs Boson Search in the H → WW (∗) → ℓνℓν

Channel using Neural Networks with the ATLAS Detector at 7 TeV. PhD thesis,Wuppertal U., Mar, 2012. http://cds.cern.ch/record/1456081.

[16] Measurement of performance of the pixel neural network clustering algorithm ofthe ATLAS experiment at

√s = 13 TeV , Tech. Rep. ATL-PHYS-PUB-2015-044,

CERN, Geneva, Sep, 2015. http://cds.cern.ch/record/2054921.

[17] W. S. McCulloch and W. Pitts, A logical calculus of the ideas immanent innervous activity, The Bulletin of Mathematical Biophysics 5 (1943) no. 4,115–133.

[18] S. L. Glashow, Partial Symmetries of Weak Interactions, Nucl. Phys. 22 (1961)579–588.

[19] A. Salam and J. C. Ward, Electromagnetic and weak interactions, Phys. Lett. 13(1964) 168–171.

[20] S. Weinberg, A Model of Leptons, Phys. Rev. Lett. 19 (Nov, 1967) 1264–1266.

110

[21] SLAC-SP-017 Collaboration, J. E. Augustin et al., Discovery of a NarrowResonance in e+e− Annihilation, Phys. Rev. Lett. 33 (1974) 1406–1408.

[22] E. Noether, Invarianten beliebiger Differentialausdrücke, Nachrichten von derGesellschaft der Wissenschaften zu Göttingen, Mathematisch-PhysikalischeKlasse 1918 (1918) 37–44. http://eudml.org/doc/59011.

[23] F. Englert and R. Brout, Broken Symmetry and the Mass of Gauge VectorMesons, Phys. Rev. Lett. 13 (Aug, 1964) 321–323.

[24] P. W. Higgs, Broken Symmetries and the Masses of Gauge Bosons, Phys. Rev.Lett. 13 (Oct, 1964) 508–509.

[25] G. S. Guralnik, C. R. Hagen, and T. W. B. Kibble, Global Conservation Lawsand Massless Particles, Phys. Rev. Lett. 13 (Nov, 1964) 585–587.

[26] LHC Higgs Cross Section Working Group Collaboration, J. R. Andersen et al.,Handbook of LHC Higgs Cross Sections: 3. Higgs Properties, arXiv:1307.1347[hep-ph].

[27] C. Boddy, S. Farrington, and C. Hays, Higgs boson coupling sensitivity at theLHC using H → ττ decays, Phys. Rev. D86 (2012) 073009, arXiv:1208.0769[hep-ph].

[28] S. Berge, W. Bernreuther, and S. Kirchner, Determination of the HiggsCP-mixing angle in the tau decay channels at the LHC including the Drell–Yanbackground, European Physical Journal C 74 (2014) no. 11, 3164,arXiv:1408.0798 [hep-ph].

[29] CMS Collaboration, V. Khachatryan et al., Search for the Standard Model HiggsBoson Produced through Vector Boson Fusion and Decaying to bb,arXiv:1506.01010 [hep-ex].

[30] ATLAS Collaboration, G. Aad et al., Measurement of the Z to tau tau CrossSection with the ATLAS Detector, Phys. Rev. D84 (2011) 112006,arXiv:1108.2016 [hep-ex].

[31] ATLAS Collaboration, G. Aad et al., Search for neutral MSSM Higgs bosonsdecaying to τ+τ− pairs in proton-proton collisions at

√s = 7 TeV with the

ATLAS detector, Phys. Lett. B705 (2011) 174–192, arXiv:1107.5003[hep-ex].

111

[32] ATLAS Collaboration, Expected Sensitivity in Light Charged Higgs BosonSearches for H+ to tau+nu and H+ to c+sbar with Early LHC Data at theATLAS Experiment, Tech. Rep. ATL-PHYS-PUB-2010-006, CERN, Geneva,Jun, 2010. http://cds.cern.ch/record/1272425.

[33] ATLAS Collaboration, G. Aad et al., Measurement of the inclusive W± andZ/gamma cross sections in the electron and muon decay channels in pp

collisions at√s = 7 TeV with the ATLAS detector, Phys. Rev. D85 (2012)

072004, arXiv:1109.5141 [hep-ex].

[34] G. Buchalla, G. Burdman, C. T. Hill, and D. Kominis, GIM violation and newdynamics of the third generation, Phys. Rev. D53 (1996) 5185–5200,arXiv:hep-ph/9510376 [hep-ph].

[35] L. Edelhäuser and A. Knochel, Observing nonstandard W ′ and Z ′ through thethird generation and Higgs lens, arXiv:1408.0914 [hep-ph].

[36] D. Kriesel, A Brief Introduction to Neural Networks. 2007.http://www.dkriesel.com/.

[37] F. Rosenblatt, The Perceptron: A perceiving and recognizing automaton ProjectPARA, Tech. Rep. 85-460-1, Cornell Aeronautical Laboratory, 1957.

[38] M. L. Minsky and S. Papert, Perceptrons: An Introduction to ComputationalGeometry. MIT Press, 1st ed., 1969.https://books.google.de/books?id=36E0QgAACAAJ.

[39] M. L. Minsky and S. Papert, Perceptrons: An Introduction to ComputationalGeometry. MIT Press, expanded ed., 4. print. ed., 1990.https://mitpress.mit.edu/books/perceptrons.

[40] M. Olazaran, A Sociological Study of the Official History of the PerceptronsControversy, Social Studies of Science 26 (1996) no. 3, pp. 611–659.http://www.jstor.org/stable/285702.

[41] D. E. Rumelhart, J. L. McClelland, and P. R. Group, Parallel DistributedProcessing: Explorations in the Microstructure of Cognition, vol. 1. MIT Press,1986.

[42] D. E. Rumelhart, J. L. McClelland, and P. R. Group, Parallel DistributedProcessing: Explorations in the Microstructure of Cognition, vol. 2. MIT Press,1986.

112

[43] A. E. Bryson and Y.-C. Ho, Applied Optimal Control: optimization, estimation,and control. Blaisdell Publishing Company, revised ed., 1969.

[44] P. J. Werbos, Beyond regression: new tools for regression and analysis in thebehavioral sciences. PhD thesis, Harvard University, Division of Engineering andApplied Physics, 1974.

[45] P. Werbos, The Roots of Backpropagation: From Ordered Derivatives to NeuralNetworks and Political Forecasting. Wiley, New York, 1994.

[46] S. E. Fahlman, An empirical study of learning speed in back-propagationnetworks, tech. rep., 1988. http://repository.cmu.edu/compsci/1800/.

[47] M. Riedmiller and H. Braun, RPROP: A Fast Adaptive Learning Algorithm, inInternational Symposium on Computer and Information Science VII,E. Gelenbe, ed., pp. 279–286. Antalya, Turkey, 1992.

[48] C. G. Broyden, The Convergence of a Class of Double-rank MinimizationAlgorithms, IMA Journal of Applied Mathematics 6 (1970) no. 1, 76–90.

[49] M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linearsystems, Journal of the National Bureau of Standards 49 (1952) no. 6, 409–436.

[50] K. A. Levenberg, A method for the solution of certain problems in least squaresSIAM , Quart. Appl. Math. 2 (1944) 164–168.

[51] C. M. Bishop, Neural Networks for Pattern Recognition. Clarendon Press, 1995.http://ukcatalogue.oup.com/product/9780198538646.do.

[52] O. Hamsici and A. Martinez, Bayes Optimality in Linear Discriminant Analysis,IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (Apr, 2008)647–657.

[53] G. Cybenko, Approximation by superpositions of a sigmoidal function,Mathematics of Control, Signals and Systems 2 (1989) no. 4, 303–314.

[54] E. B. Baum, On the capabilities of multilayer perceptrons, Journal of Complexity4 (1988) no. 3, 193–215.

[55] K. Hornik, Approximation capabilities of multilayer feedforward networks,Neural Networks 4 (1991) no. 2, 251–257.

113

[56] Y. Bengio, Learning Deep Architectures for AI , Foundations and Trends inMachine Learning 2 (Jan, 2009) 1–127.

[57] S. Hochreiter, The vanishing gradient problem during learning recurrent neuralnets and problem solutions, International Journal of Uncertainty, Fuzziness andKnowledge-Based Systems 6 (1998) no. 2, 107–116.

[58] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, Efficient BackProp, inNeural Networks: Tricks of the Trade, G. Montavon, G. B. Orr, and K.-R.Müller, eds., vol. 7700 of Lecture Notes in Computer Science, pp. 9–48. SpringerBerlin Heidelberg, 2012.

[59] G. B. Orr, Dynamics and algorithms for stochastic search. PhD thesis, OregonGraduate Institute, Beaverton, OR, 1995.

[60] C. Igel and M. Hüsken, Improving the Rprop learning algorithm, in SecondInternational Symposium on Neural Computation (NC 2000), vol. 2000,pp. 115–121. 2000.

[61] C. Igel and M. Hüsken, Empirical evaluation of the improved Rprop learningalgorithm, Neurocomputing 50 (2003) 105–123.

[62] M. Riedmiller, Rprop—Description and Implementation Details, tech. rep.,University Karlsruhe, Germany, 1994.

[63] M. Riedmiller, Advanced Supervised Learning in Multi-layer Perceptrons – FromBackpropagation to Adaptive Learning Algorithms, Int. Journal of ComputerStandards and Interfaces 16 (1994) 265–278. Special Issue on Neural Networks.

[64] N. K. Treadgold and T. D. Gedeon, Simulated annealing and weight decay inadaptive learning: the SARPROP algorithm, IEEE Transactions on NeuralNetworks 9 (Jul, 1998) 662–668.

[65] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, Optimization by SimulatedAnnealing, Science 220 (1983) no. 4598, 671–680.

[66] V. Černý, Thermodynamical approach to the traveling salesman problem: Anefficient simulation algorithm, Journal of Optimization Theory and Applications45 (1985) no. 1, 41–51.

[67] R. Fletcher, A new approach to variable metric algorithms, The ComputerJournal 13 (1970) no. 3, 317–322.

114

[68] D. Goldfarb, A family of variable-metric methods derived by variational means,Mathematics of Computation 24 (1970) no. 109, 23–26.

[69] D. F. Shanno, Conditioning of quasi-Newton methods for function minimization,Mathematics of Computation 24 (1970) no. 111, 647–656.

[70] P. Wolfe, Convergence Conditions for Ascent Methods. II: Some Corrections,SIAM Review 13 (1971) no. 2, 185–188.

[71] J. Nocedal and S. J. Wright, Numerical optimization, vol. 2. Springer New York,1999.

[72] A. M. Chen and H.-m. Lu, On the Geometry of Feedforward Neural NetworkError Surfaces, Neural Computation 5 (1993) no. 6, 910–927.

[73] B. L. Kalman and S. C. Kwasny, Why tanh: choosing a sigmoidal function, inIJCNN, vol. 4, pp. 578–581. Jun, 1992.

[74] I. Isa, Z. Saad, S. Omar, M. Osman, K. Ahmad, and H. Sakim, Suitable MLPNetwork Activation Functions for Breast Cancer and Thyroid Disease Detection,in CIMSim, pp. 39–44. Sep, 2010.

[75] M. D. Richard and R. P. Lippman, Neural network classifiers estimate Bayesiandiscriminant function, Neural Computation Concepts and Theory 3 (1991)461–483.

[76] S. Geman, E. Bienenstock, and R. Doursat, Neural Networks and theBias/Variance Dilemma, Neural Computation 4 (1992) no. 1, 1–58.

[77] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction toStatistical Learning. Springer New York, 2013.

[78] P. Domingos, A unified bias-variance decomposition, in 17th InternationalConference on Machine Learning, pp. 231–238. AAAI, 2000.http://homes.cs.washington.edu/~pedrod/bvd.pdf.

[79] A. Krogh and J. A. Hertz, A simple weight decay can improve generalization,Advances in Neural Information Processing Systems 4 (1992) 950–957.

[80] A. Gupta and S. M. Lam, Weight decay backpropagation for noisy data, NeuralNetworks 11 (1998) no. 6, 1127–1138.

115

[81] G. E. Hinton, Learning translation invariant recognition in a massively parallelnetworks, in Parallel Architectures and Languages Europe, J. de Bakker,A. Nijman, and P. Treleaven, eds., vol. 258 of Lecture Notes in ComputerScience, pp. 1–13. Springer Berlin Heidelberg, 1987.

[82] P. L. Bartlett, For Valid Generalization the Size of the Weights is MoreImportant than the Size of the Network, in Advances in Neural InformationProcessing Systems, M. Mozer, M. Jordan, and T. Petsche, eds., vol. 9,pp. 134–140. MIT Press, 1997.https://papers.nips.cc/paper/1204-unneccessarily-long-url.

[83] C. Wang, S. S. Venkatesh, and J. S. Judd, Optimal Stopping and EffectiveMachine Complexity in Learning, in Advances in Neural Information ProcessingSystems, vol. 6, pp. 303–310. Morgan Kaufmann Publishers, San Mateo, CA,1994.

[84] L. Prechelt, Early Stopping – But When?, in Neural Networks: Tricks of theTrade, G. B. Orr and K.-R. Müller, eds., vol. 1524 of Lecture Notes in ComputerScience, pp. 55–69. Springer Berlin Heidelberg, 1998.

[85] M. P. Perrone and L. N. Cooper, When Networks Disagree: Ensemble Methodsfor Hybrid Neural Networks, Tech. Rep. 61, 1992.http://www.dtic.mil/dtic/tr/fulltext/u2/a260045.pdf.

[86] L. K. Hansen and P. Salamon, Neural Network Ensembles, IEEE Transactionson Pattern Analysis and Machine Intelligence 12 (1990) no. 10, 993–1001.

[87] D. W. Opitz and J. W. Shavlik, Generating Accurate and Diverse Members of aNeural-Network Ensemble, in Advances in Neural Information ProcessingSystems, vol. 9, pp. 535–541. MIT Press, 1997.

[88] D. Barrow and S. Crone, Crogging (cross-validation aggregation) forforecasting – A novel algorithm of neural network ensembles on time seriessubsamples, in IJCNN, pp. 1–8. Aug, 2013.

[89] L. Breiman, Bagging predictors, Machine Learning 24 (1996) no. 2, 123–140.

[90] A. Buja and W. Stuetzle, Observations on bagging, Statistica Sinica 16 (2006)no. 2, 323–352.http://www3.stat.sinica.edu.tw/statistica/J16N2/J16N21/J16N21.html.

116

[91] D. Opitz and R. Maclin, Popular ensemble methods: An empirical study,Journal of Artificial Intelligence Research 11 (1999) 169–198.

[92] ATLAS public luminosity results, https://twiki.cern.ch/twiki/bin/view/AtlasPublic/LuminosityPublicResults.

[93] The LHC sees its first circulating beam, CERN Courier 48 (Oct, 2008) 7–8.http://cerncourier.com/cws/article/cern/35864.

[94] Incident in sector 3-4 of the LHC , CERN Courier 48 (Nov, 2008) 5–9.http://cerncourier.com/cws/article/cern/36274.

[95] Protons are back in the LHC , CERN Courier 49 (Dec, 2009) 5–9.http://cerncourier.com/cws/article/cern/40995.

[96] LHC: access required, time estimate about two years, CERN Courier 53 (Apr,2013) 5–10. http://cerncourier.com/cws/article/cern/52729.

[97] Work for the LHC’s first long shutdown gets under way, CERN Courier 53(Mar, 2013) 26–28. http://cerncourier.com/cws/article/cern/52361.

[98] Proton beams are back in the LHC , CERN Courier 55 (May, 2015) 5–14.http://cerncourier.com/cws/article/cern/60857.

[99] Stable beams at 13 TeV , CERN Courier 55 (Aug, 2015) 25–28.http://cerncourier.com/cws/article/cern/61866.

[100] S. Myers and E. Picasso, The design, construction and commissioning of theCERN Large Electron–Positron Collider, Contemporary Physics 31 (1990) no. 6,387–403.

[101] J. Haffner, The CERN accelerator complex. Complexe des accélérateurs duCERN , Oct, 2013. https://cds.cern.ch/record/1621894. General Photo.

[102] ATLAS Collaboration, G. Aad et al., The ATLAS Experiment at the CERNLarge Hadron Collider, Journal of Instrumentation 3 (2008) no. 08, S08003.

[103] CMS Collaboration, S. Chatrchyan et al., The CMS experiment at the CERNLHC , Journal of Instrumentation 3 (2008) no. 08, S08004.

[104] LHCb Collaboration, A. A. Alves Jr. et al., The LHCb Detector at the LHC ,Journal of Instrumentation 3 (2008) no. 08, S08005.

117

[105] ALICE Collaboration, K. Aamodt et al., The ALICE experiment at the CERNLHC , Journal of Instrumentation 3 (2008) no. 08, S08002.

[106] LHCf Collaboration, O. Adriani et al., The LHCf detector at the CERN LargeHadron Collider, Journal of Instrumentation 3 (2008) no. 08, S08006.

[107] MoEDAL Collaboration, J. Pinfold et al., Technical Design Report of theMoEDAL Experiment, Tech. Rep. CERN-LHCC-2009-006, MoEDAL-TDR-001,CERN, Geneva, Jun, 2009. http://cds.cern.ch/record/1181486.

[108] TOTEM Collaboration, G. Anelli et al., The TOTEM Experiment at the CERNLarge Hadron Collider, Journal of Instrumentation 3 (2008) no. 08, S08007.

[109] A. Outreach, ATLAS Fact Sheet: To raise awareness of the ATLAS detectorand collaboration on the LHC , 2010. http://cds.cern.ch/record/1457044.

[110] ATLAS Collaboration, G. Aad et al., The ATLAS Inner Detectorcommissioning and calibration, European Physical Journal C 70 (Dec, 2010)787–821, arXiv:1004.5293.

[111] ATLAS Collaboration, G. Aad et al., ATLAS pixel detector electronics andsensors, Journal of Instrumentation 3 (2008) no. 07, P07007.http://stacks.iop.org/1748-0221/3/i=07/a=P07007.

[112] A. Ahmad et al., The silicon microstrip sensors of the ATLAS semiconductortracker, Nuclear Instruments and Methods in Physics Research A 578 (2007)no. 1, 98–118.

[113] B. Dolgoshein, Transition radiation detectors, Nuclear Instruments and Methodsin Physics Research A 326 (1993) no. 3, 434–469.

[114] ATLAS Collaboration, Identification of the Hadronic Decays of Tau Leptons in2012 Data with the ATLAS Detector, Tech. Rep. ATLAS-CONF-2013-064,CERN, Geneva, Jul, 2013. https://cds.cern.ch/record/1562839.

[115] ATLAS Collaboration, Standard Model Production Cross Section Measurements,Mar, 2015. https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/CombinedSummaryPlots/SM/ATLAS_b_SMSummary_FiducialXsect/history.html.

118

[116] ATLAS Collaboration, G. Aad et al., Identification and energy calibration ofhadronically decaying tau leptons with the ATLAS experiment in pp collisions at√s=8 TeV , European Physical Journal C 75 (2015) no. 7, 303,

arXiv:1412.7086 [hep-ex].

[117] W. Lampl, S. Laplace, D. Lelas, et al., Calorimeter Clustering Algorithms:Description and Performance, Tech. Rep. ATL-LARG-PUB-2008-002, CERN,Geneva, Apr, 2008. https://cds.cern.ch/record/1099735.

[118] T. Barillari, E. Bergeaas Kuutmann, T. Carli, et al., Local Hadronic Calibration,Tech. Rep. ATL-LARG-PUB-2009-001-2, CERN, Geneva, Jun, 2008.https://cds.cern.ch/record/1112035.

[119] K. Grahn, A. Kiryunin, and G. Pospelov, Tests of Local Hadron CalibrationApproaches in ATLAS Combined Beam Tests, Tech. Rep.ATL-LARG-PROC-2010-011, CERN, Geneva, Sep, 2010.https://cds.cern.ch/record/1289717.

[120] M. Cacciari, G. P. Salam, and G. Soyez, The Anti-k(t) jet clustering algorithm,Journal of High Energy Physics 04 (2008) 063, arXiv:0802.1189 [hep-ph].

[121] ATLAS Collaboration, Performance of the Reconstruction and Identification ofHadronic Tau Decays in ATLAS with 2011 Data, Tech. Rep.ATLAS-CONF-2012-142, CERN, Geneva, Oct, 2012.https://cds.cern.ch/record/1485531.

[122] ATLAS Collaboration, Performance of the ATLAS Inner Detector Track andVertex Reconstruction in the High Pile-Up LHC Environment, Tech. Rep.ATLAS-CONF-2012-042, CERN, Geneva, Mar, 2012.https://cds.cern.ch/record/1435196.

[123] ATLAS Collaboration, Performance of the Substructure Reconstruction ofHadronic Tau Decays with ATLAS, To be published, 2015.

[124] ATLAS Collaboration, G. Aad et al., Electron and photon energy calibrationwith the ATLAS detector using LHC Run 1 data, European Physical Journal C74 (2014) no. 10, 3071, arXiv:1407.5063 [hep-ex].

[125] R. Brun and F. Rademakers, ROOT – An Object Oriented Data AnalysisFramework, Nuclear Instruments and Methods in Physics Research A 389(1997) 81–86. https://root.cern.ch/.

119

[126] A. Hoecker et al., TMVA: Toolkit for Multivariate Data Analysis. Mar, 2007.arXiv:physics/0703039.

[127] S. Hanisch, A. Straessner, and F. Siegert, Optimisation of the Hadronic TauIdentification Based on the Classification of Tau Decay Modes with the ATLASDetector. PhD thesis, Dresden, Tech. U., Feb, 2015.https://cds.cern.ch/record/2004897. Presented 02 Feb 2015.

[128] D. Berge, J. Haller, A. K. Jr., et al., SFrame – A ROOT data analysisframework, 2012. http://sourceforge.net/projects/sframe/.

[129] T. Sjostrand, S. Mrenna, and P. Z. Skands, A Brief Introduction to PYTHIA 8.1,Comput. Phys. Commun. 178 (2008) 852–867, arXiv:0710.3820 [hep-ph].

[130] GEANT4 Collaboration, J. Allison et al., Geant4 developments andapplications, IEEE Transactions on Nuclear Science 53 (Feb, 2006) 270–278.

[131] ATLAS Collaboration, ATLAS Computing: technical design report. TechnicalDesign Report ATLAS. CERN, Geneva, 2005.https://cds.cern.ch/record/837738.

[132] E. Barberio et al., Identification of Hadronic Tau Decays for Summer 2011,Tech. Rep. ATL-PHYS-INT-2011-090, CERN, Geneva, Nov, 2011.https://cds.cern.ch/record/1398585. internal use only.

[133] A. P. Bradley, The use of the area under the ROC curve in the evaluation ofmachine learning algorithms, Pattern Recognition 30 (1997) no. 7, 1145–1159.

[134] S. Ng, S. Leung, and A. Luk, An integrated algorithm of magnified gradientfunction and weight evolution for solving local minima problem, in IJCNN,vol. 1, pp. 767–772. May, 2002.

[135] A. Anastasiadis and G. Magoulas, Nonextensive entropy and regularization foradaptive learning, in IJCNN, vol. 2, pp. 1067–1072. July, 2004.

[136] M. Carvalho and T. Ludermir, An Analysis Of PSO Hybrid Algorithms ForFeed-Forward Neural Networks Training, in Ninth Brazilian Symposium onNeural Networks, pp. 6–11. Oct, 2006.

[137] C.-C. Cheung, A. Lui, and S. Xu, Solving the local minimum and flat-spotproblem by modifying wrong outputs for feed-forward neural networks, inIJCNN, pp. 1–7. Aug, 2013.

120

[138] J. Morales, A numerical study of limited memory BFGS methods, AppliedMathematics Letters 15 (2002) no. 4, 481–487.

[139] X. Glorot and Y. Bengio, Understanding the difficulty of training deepfeedforward neural networks, in JMLR Workshop and Conference Proceedings,vol. 9, pp. 249–256. May, 2010.http://jmlr.csail.mit.edu/proceedings/papers/v9/glorot10a.html.

[140] X. Glorot, A. Bordes, and Y. Bengio, Deep Sparse Rectifier Neural Networks, inJMLR Workshop and Conference Proceedings, vol. 15, pp. 315–323. April, 2011.http://jmlr.csail.mit.edu/proceedings/papers/v15/glorot11a.html.

[141] J. Schmidhuber, Deep learning in neural networks: An overview, NeuralNetworks 61 (2015) 85–117.

[142] S. Fleischmann and K. Desch, Tau lepton reconstruction with energy flow andthe search for R-parity violating supersymmetry at the ATLAS experiment. PhDthesis, Bonn U., 2011. http://cds.cern.ch/record/1504815.

[143] K. P. Burnham and D. R. Anderson, Model Selection and Multimodel Inference.Springer New York, 2 ed., 2002.

121

Danksagung

Zum Abschluss möchte ich all jenen danken, die mir bei der Anfertigung dieser Arbeithilfreich zur Seite standen.

Zuvörderst gilt mein Dank Prof. Dr. Arno Straessner für seine großartige Unterstüt-zung. Ohne ihn wäre diese Arbeit zweifelsohne nicht möglich gewesen. Des Weiterendanke ich Lorenz Hauswald, Sebastian Wahrmund und Dr. Wolfgang Mader für ihrezahlreichen Hinweise und exzellente Unterstützung in technischen sowie physikalischenFragen. Ebenfalls für ihre Unterstützung und Beratung danke ich dem Rest der Dresd-ner Tau-Gruppe, namentlich David Kirchmeier, Dirk Duschinger und Stefanie Hanisch.Mein spezieller Dank geht an Tony Henseleit, dessen Bachelorarbeit wichtige Impulsefür diese Arbeit lieferte.

Bei der ATLAS-Tau-Arbeitsgruppe und ihren Convenern, Attilio Andreazza und WillDavey, bedanke ich mich für ihr wertvolles Feedback zum Score-Flattening-Algorithmus.Ferner möchte ich mich bei allen Mitarbeitern des IKTP – insbesondere bei Prof. Dr. Mi-chael Kobel als Institutsleiter und bei Prof. Dr. Kai Zuber als meinem Zweitgutachter –für die überaus angenehme Arbeitsatmosphäre bedanken.

Mein besonderer Dank gilt meiner Familie dafür, dass sie immer für mich da sind.Alexander Schubert und Julia Unger danke ich für ihre seelische und moralische Unter-stützung. Zu guter Letzt danke ich Morgan für das Ertragen einer unerträglichen Zahlschlechter Kalauer.

123

Selbstständigkeitserklärung

Hiermit versichere ich, dass ich die vorliegende Master-Arbeit mit dem Titel Identifi-cation of Hadronic Tau Lepton Decays at the ATLAS Detector Using Artificial NeuralNetworks selbstständig und ohne unzulässige Hilfe Dritter verfasst habe. Es wurdenkeine anderen als die in der Arbeit angegebenen Hilfsmittel und Quellen benutzt. Diewörtlichen und sinngemäß übernommenen Zitate habe ich als solche kenntlich gemacht.Es waren keine weiteren Personen an der geistigen Herstellung der vorliegenden Arbeitbeteiligt. Mir ist bekannt, dass die Nichteinhaltung dieser Erklärung zum nachträglichenEntzug des Hochschulabschlusses führen kann.

Dresden, 12. Oktober 2015

Nico Madysa