Automated Model Building for Drug Target Prediction

Submitted byPeter Karl Ruch, BSc.

Submitted atInstitute forMachine Learning

SupervisorUniv.-Prof. Dr.Sepp Hochreiter

Co-SupervisorDr. Mag.Gunter Klambauer

December 2019

JOHANNES KEPLERUNIVERSITY LINZAltenbergerstraße 694040 Linz, Osterreichwww.jku.atDVR 0093696

Automated ModelBuilding for Drug TargetPrediction

Master Thesis

to obtain the academic degree of

Master of Science

in the Master’s Program

Bioinformatics

Erklarung der Urheberschaft

Ich erklare hiermit an Eides statt, dass ich die vorliegende Arbeit ohne Hilfe Drit-ter und ohne Benutzung anderer als der angegebenen Hilfsmittel angefertigt habe;die aus fremden Quellen direkt oder indirekt ubernommenen Gedanken sind alssolche kenntlich gemacht. Die Arbeit wurde bisher in gleicher oder ahnlicher Formin keiner anderen Prufungsbehorde vorgelegt und auch noch nicht veroffentlicht.Die vorliegende Masterarbeit ist mit dem elektronisch ubermittelten Textdokumentidentisch.

Linz, December 2019 Peter Ruch

i

Abstract

In the last years deep neural networks have been successfully applied to drug targetprediction tasks [44] and have been shown to rank among the best methods for thisproblem [45]. As deep learning methods are known to scale well to large datasets,this begs the question whether results can be further improved using additionaldata. The many publicly available chemical databases provide a great opportunityto compile new datasets for drug target prediction tasks, but in practice extractingsuch a dataset is labour and time intensive. There currently exists no frameworkthat could be used to facilitate this task, and enable experimentation using newdatasets.This work presents a platform for automated model building for drug target predic-tion. The platform can be used to automatically extract new datasets for structureactivity prediction tasks and allows to freely experiment using different protocols forextracting a dataset, combine data from different databases, and quickly compute awide range of chemical descriptors. In the past due to the large amount of work andtime involved in compiling of a dataset these experiments were infeasible. Compilinga new dataset using the platform now only requires a few days instead of weeks. Theresulting datasets can then be used to build new models for drug target predictiontasks. To demonstrate the platform it is used to extract a dataset from each ofthe last five releases (21-25) of the ChEMBL [20][23] database. The thesis exploreswhether fully automatically extracted datasets can be used for building drug targetprediction models straight away or whether they require additional post-processing.The resulting datasets are analysed in terms of their assay and compound composi-tion and used to train several multi-task neural networks in order to evaluate howdataset composition affects performance of the models. The experiments show thatmodels trained on each of the five extracted datasets achieve similar performance toprevious results by Mayr et al. [45].

iii

Zusammenfassung

In den letzen Jahren wurden Deep Learning Methoden erfolgreich zur Vorhersagevon moglichen Zielen fur pharmazeutische Wirkstoffe eingesetzt, und es wurde gezeigtdass neuronale Netze zu den besten Methoden fur dieses Problem gehoren [45]. Esist bekannt dass Deep Learning Methoden sehr gut mit der Große des Datensatzesskalieren, wodurch sich die Frage stellt ob mit zusatzlichen Daten die Resultateweiter verbessert werden konnen. Viele offentlich verfugbare chemische Datenbankenermoglichen es neue Datensatze fur dieses Anwendung zu erstellen wobei dieserProzess in der Praxis aber sehr arbeits- und zeitintensiv ist. Es gibt zur Zeit keinSoftwaresystem, welches diese Aufgabe vereinfachen wurde, und das Durchfuhrenneuer Experimente erleichtern wurde. Diese Arbeit prasentiert eine Plattform zurautomatisierten Erstellung von Modellen fur die Vorhersage moglicher Ziele furpharmazeutische Wirkstoffe. Die Plattform kann zur Generierung neuer Datensatzefur dieses Problem verwendet werden und es ist nun moglich mit unterschiedlichenProtokollen zur Datensetgenerierung zu experimentieren, Daten von verschiedenenDatenbanken zu kombinieren und rasch eine große Anzahl chemischer Deskriptorenzu berechnen. In der Vergangenheit waren derartige Experimente aufgrund desgroßen Arbeits- und Zeitaufwandes nicht praktikabel. Mithilfe der Plattform dauertdie Erstellung von neuen Datensatzes nun nur noch wenige Tage anstatt Wochen.Die resultierenden Datensatze konnen dann verwendet werden um neue Modelle furdie Vorhersage von moglichen Zielen fur pharmazeutische Wirkstoffe zu erstellen.Als Demonstration wird die Platform verwendet um fur die letzen funf Versionen(21-25) der ChEMBL Datenbank [20][23] je einen neuen Datensatz zu erstellen.Die Datensatze werden auf Basis ihrer Zusammensetzung anhand von chemischenVerbindungen und experimenteller Messungen analysiert and zum Trainieren vonmehreren multi-task Neuronalen Netzen verwendet, um den Einfluss der Strukturdes Datensatzes auf die Performanz der Modelle zu untersuchen. Die Experimentezeigen, dass die Modelle, welche auf die funf Datensatze trainiert wurden, ahnlicheResultate liefern wie eine vorherige Studie von Mayr et al. [45].

v

Acknowledgement

I would like to thank my supervisor Dr. Mag. Gunter Klambauer for the opportunityto work on this topic, and his support throughout this thesis. Developing thisplatform greatly improved my understanding of many aspects of machine learning,especially the importance of data, and the amount of thought and effort that isactually required to build a ”good” dataset. Moreover I would also like to thankAndreas Mayr for answering my questions regarding his work on the LSC paper thisplatform is based on.My thanks goes to all the members of the Machine Learning Institute for answeringmy questions throughout my study and being able to join internal paper discussionsto learn more about machine learning. I hope some of them will find the platformpresented in this thesis useful for their research.Furthermore I would also like to thank my former colleagues from the visualizationgroup at the Institute for Computer Graphics at Johannes Kepler University Linzfor the great time I had when I was working there during my studies. Finally Iwant to express my gratitude towards my parents, Franz and Ulrike. Without theircontinuous support and encouragement to follow my interests I probably would havenever come to Linz and followed a different path.

vii

Contents

1 Introduction 11.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Drug Target Prediction 32.1 Quantitative Structure-Activity Relationship . . . . . . . . . . . . . . 32.2 Chemical Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 Molecular Descriptors . . . . . . . . . . . . . . . . . . . . . . 42.2.2 Structural Keys . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2.3 Fingerprints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Neural Networks 73.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Neural Networks for QSAR . . . . . . . . . . . . . . . . . . . . . . . 8

3.2.1 Fully Connected Networks . . . . . . . . . . . . . . . . . . . . 83.3 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.4 Neural Network Regularization . . . . . . . . . . . . . . . . . . . . . 10

4 Datasets 124.1 Chemical Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.1.1 ChEMBL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.1.2 PubChem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.1.3 ZINC15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2 Database Overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5 Target Prediction Platform 155.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.2 Downloading the databases . . . . . . . . . . . . . . . . . . . . . . . 195.3 Extraction and Preprocessing . . . . . . . . . . . . . . . . . . . . . . 19

5.3.1 Extracting the compound and assay information . . . . . . . . 195.3.2 Preprocessing Protocol . . . . . . . . . . . . . . . . . . . . . . 205.3.3 Merging the different databases . . . . . . . . . . . . . . . . . 21

5.4 Computation of Descriptors . . . . . . . . . . . . . . . . . . . . . . . 215.4.1 Static Features . . . . . . . . . . . . . . . . . . . . . . . . . . 225.4.2 Semisparse Features . . . . . . . . . . . . . . . . . . . . . . . 225.4.3 Tox Features and Morgan Fingerprints . . . . . . . . . . . . . 225.4.4 Sparse Features . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.5 Export of records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.6 Workflow Management . . . . . . . . . . . . . . . . . . . . . . . . . . 23

ix

5.7 Training Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.8 Platform Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6 Evaluation of effects of dataset composition on model performance 256.1 Experimental design in the life sciences . . . . . . . . . . . . . . . . . 28

6.1.1 Compound Bias . . . . . . . . . . . . . . . . . . . . . . . . . . 286.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.2.1 Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . 326.2.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.2.3 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . 32

7 Results 357.1 Results on LSC Fold 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 357.2 Role of helper tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

8 Discussion 428.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Appendices 51

Appendix A Heatmaps of Assay Co-occurence 52

x

List of Tables

5.1 Affinity thresholds for measurements excluding standard activity com-ments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6.1 Hyperparameters used for the experiments . . . . . . . . . . . . . . . 34

7.1 Mean AUC scores for semisparse features across different ChEMBLdataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

7.2 Mean AUC scores for semisparse features across different ChEMBLdatasets reduced to a common subset of assays which were used fortraining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

xi

List of Figures

2.1 Illustration of cross validation folds. For every column a model is fiton the training folds and evaluated on the test fold. . . . . . . . . . . 6

3.1 Illustration of a fully connected network for learning a single task. . . 93.2 Illustration of a multi-task neural network. . . . . . . . . . . . . . . . 103.3 This figure shows the loss on the training (train) and validation (val)

sets from a training run of a neural network, which illustrates theoverfitting problem. Validation loss is minimal around epoch 30, bytraining longer the validation loss starts to increase again while thetraining loss decreases steadily. . . . . . . . . . . . . . . . . . . . . . 11

4.1 This plot shows the size of the different databases as well as thesize of the intersections via the UpSet [39] visualization techniquefor sets. The vertical black dotted lines resemble intersections of thedatabases. The bar at the top of each black line displays the numberof shared compounds. Both ChEMBL and PubChem share around1.5 million compounds. Surprisingly despite more than 96.5 millioncompounds in PubChem, both ZINC15 and ChEMBL25 still containunique compounds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.1 Illustration of the raw triples records as stored in a chemical database. 155.2 The affinity values per compound and assay id pair are averaged and

an activity label is assigned. . . . . . . . . . . . . . . . . . . . . . . . 165.3 The assay activity pairs per compound are collected and molecular

information (in SMILES format) for each compound is added to everyrecord. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.4 Based on the molecular information different molecular descriptorscan be calculated and are also added to the records. The resultingdataset is ready for training a model. . . . . . . . . . . . . . . . . . . 17

5.5 Screenshot of the AirFlow DAG for the ChEMBL processing pipeline 23

6.1 This scatterplot shows the number of compounds and assays for theraw ChEMBL databases. Number of assays and compounds linearilyincreases every year but there is a drop in number of assays withChEMBL release 24. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6.2 Number of compounds vs. number of assays for the ChEMBL releases21 to 25. There are jumps in number of assays and compounds forthe database updates from version 21 to 22 as well as from 23 to 24.Despite the drop in assays in the raw databases (see figure 6.1) thenumber of assays in the derived datasets is increasing. . . . . . . . . . 26

xiii

6.3 This UpSet [39] plot shows the intersection of assays across the datasetsextracted from the different ChEMBL releases. Only 1545 assays ap-pear in all datasets, which is only around 68% of the total number ofunique assays across all datasets (2256 unique assays). . . . . . . . . 27

6.4 The graphic shows the distribution of the low compound frequency as-says across the different ChEMBL releases. We can see that the laterChEMBL releases have more assays with lower compound frequencies. 29

6.5 The ratio of uncommon assays which are only member of one of thefolds increases over time. . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.6 Shift in distribution of assays with low compound frequencies betweentrain and test folds based on LSC clusters and complete dataset acrossall ChEMBL releases. Splitting the folds causes the distributions toshift as the preprocessing cutoff of at least around 100 compounds perassay is no longer satisfied. For lower ChEMBL releases train and testfolds are still relatively close but the gap widens with releases 24 and25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.7 Diagram of the architecture of the model used for the experiments. . 33

7.1 Scatterplots of AUC scores vs train set size of the different ChEMBLreleases. Splitting the dataset into separate folds results in someassays with less than 100 compounds in the training set. . . . . . . . 36

7.2 The AUC assay scores are relatively consistent, but there are severalassays which jump across the different ChEMBL releases. . . . . . . . 37

7.3 Conditioning the assay scores on different AUC interval ranges revealsdifferent distributions for the AUC scores. Lower AUC assays showhigher variance than assays with higher AUC. . . . . . . . . . . . . . 38

7.4 Comparison of the assay AUC vs. trainset size for the experimentsusing all assays Assays Full and only the subset Assays Subset. Com-paring the assay scores shows that AUC scores follow a linear trend(black lines have slope 1 and were not fit via to the data). For therestricted subset there are more drops in AUC for high compoundfrequency assays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7.5 AUC scores across the Assays Subset are also consistent. There areagain some noisy assays that jump a lot across ChEMBL releases. . . 41

A.1 Assay co-occurence - ChEMBL21 . . . . . . . . . . . . . . . . . . . . 53A.2 Assay co-occurence - ChEMBL22 . . . . . . . . . . . . . . . . . . . . 54A.3 Assay co-occurence - ChEMBL23 . . . . . . . . . . . . . . . . . . . . 55A.4 Assay co-occurence - ChEMBL24 . . . . . . . . . . . . . . . . . . . . 56A.5 Assay co-occurence - ChEMBL25 . . . . . . . . . . . . . . . . . . . . 57

xiv

Chapter 1

Introduction

Recently Mayr et al. [45] published a large scale comparison of machine learningmethods for drug target prediction. The authors extracted a dataset from ChEMBL[20] [23] release 20 which was then used for training several machine learning algo-rithms (SVMs, Random Forests, Neural Networks, . . . ) in order to compare theirperformance. In the study deep learning [38] methods (i.e. neural networks) havebeen found to significantly outperform all other machine learning methods, pro-viding chemoinformatics with a valuable baseline that can be used to guide futureresearch. Given the recent reproducibility crisis in many fields of academic researchit is especially commendable that alongside the paper Mayr et al. also publishedsupplemental material, describing the data processing, as well as the code and ac-tual data that was used for the experiments.

Unfortunately, due to the long peer-review cycles, by this time the data used forthe study is already somewhat outdated, as ChEMBL 20 was released in February2015 and is now more than four years old. In the meantime the ChEMBL databasereached version 25, released in March 2019, incorporating a lot of new data fromrecent studies and experiments. As neural networks are known to scale well to largerdatasets one might ask whether and how training on a new extended dataset basedon a more recent ChEMBL release would impact performance.

Answering this question requires to compile new datasets for the different ChEMBLreleases to use in the experiments. But even though the code for creating the datasetused in the large scale comparison is available online, the process of extracting sucha dataset still remains challenging. Initial attempts to extract a dataset using thepublished scripts reveal several problems: The code is not designed to be run fre-quently, requiring to launch several scripts in a particular order, where some scriptstake a long time to complete or require extensive amounts of resources, especiallymemory. If datasets need to be extracted often (e.g. for every new ChEMBL re-lease or when experimenting with different preprocessing protocols), having to waitweeks for results and simultaneously blocking a lot of computational resources, isnot a long term option. Ideally extraction of a new dataset would be completelyautomated, efficiently utilize the available hardware and complete in a few hours orpossibly days.

In this work a platform for automated building of models drug target predictionis presented, that was developed to address these concerns. The platform simpli-

1

fies and speeds up the extraction of new datasets and even allows to compile newdatasets using data from multiple databases. As demonstration the platform is usedto compile a dataset from each of the last five ChEMBL releases (21 - 25). Thesedatasets are then used to train several multi-task neural network to evaluate the im-pact of dataset compositions on model performance and whether the automaticallycompiled datasets require any additional (manual) post-processing.

1.1 Outline

The thesis is organized in multiple chapters: Chapter Drug Target Prediction brieflydiscusses chemical descriptors and introduces supervised learning concepts. Thechapter Neural Networks focuses on neural networks and their usage for modellingstructure-activity relationships. Chapter Datasets presents the databases supportedby the platform that can be used for compiling new datasets. The platform itself andits components are presented and discussed in chapter Target Prediction Platform.In chapter Evaluation of effects of dataset composition on model performance severaldatasets extracted using the platform are analysed and the experimental setup forevaluating model performance is presented. The results of the experiments arepresented in chapter Results. The final chapter Discussion closes the thesis with ashort review of the work.

2

Chapter 2

Drug Target Prediction

One of the interests in chemical compounds stems from their physiological effectsand their potential usage as drugs to treat a specific condition. The biological activ-ity of a compound, and by this its capability to be utilized as a drug, is determinedby its chemical structure, which is usually termed Structure-Activity-Relationship(SAR). Development of a new drug for a disease condition first requires the identifi-cation of the biological processes, referred to as targets, which are involved with thedisease to be treated. The next step entails the development of assay experiments,which are then used to determine compounds that interact with the specific target,and thereby potential drug candidates. The process of testing for this interactionof compounds and assays is also called screening. In the last decade the screeningprocess has been completely automated and extremely parallelized by the usage ofrobots, which is now also known as High Throughput Screening. In High ThroughputScreening experiments large libraries of thousands of small molecules are tested fortheir interaction with a specific target. Compounds which are found to be active insuch an HTS experiment are termed hits. that are then further evaluated in orderto determine so called leads, compounds that show biological activity but mightstill not be suitable as drug due to poor druglikeness (as determined for example byLipinski’s Rule of Five [41]). After several years and multiple biological and clinicaltrials eventually a drug, derived from initial compound, is finally approved. In realityonly a small percentage of all drugs in development reaches the market, with the ef-ficiency of drug research and development having declined over the last 50 years [54].

The large amount of experimental results generated through HTS are now in-creasingly being used to build quantitative models to model structure-activity-relationships [31].

2.1 Quantitative Structure-Activity Relationship

Quantitative Structure-Activity Relationship (QSAR) [18] involves the modelling ofthe structure-activity relationships using regression or classification models. Regres-sion and classification techniques are usually aggregated under the term supervisedlearning, which involves the learning a function which properly approximates theobserved relationship of a set of input and output pairs. For learning such a func-tion molecules are commonly mapped into a numeric representation that encodeschemical properties or the structure of a molecule by a vector of numbers.

3

2.2 Chemical Descriptors

A chemical descriptor encodes chemical information as a number. In this thesis theterm descriptor will be used interchangeably to refer to molecular descriptors aswell structural descriptors, such as structural keys and fingerprints.

2.2.1 Molecular Descriptors

Molecular descriptors use logic or mathematical procedures to encode chemical infor-mation as a number [63]. These descriptors can be based on properties determinedin experiments: aqueous solubility, molar refractivity, polarizability, . . . or calculatedusing mathematical procedures. Molecular descriptors are usually further classifiedbased on the dimensionality of structure representation as in [22]:

• 0D: atom counts, bond counts, . . .

• 1D: counts of atom types, fragment counts, . . .

• 2D: topological descriptors: Zagreb Index, Wiener Index, . . .

• 3D: geometrical descriptors: molecular eccentricity, radius of gyration, . . .

Over the years a large amount of molecular descriptors have been developed (consult[63] for an extensive overview). The library Mordred Descriptors [46] can be usedto compute a large number of molecular descriptors.

2.2.2 Structural Keys

Structural keys were initially designed for database search. Structural keys are basedon a fixed dictionary of chemical substructures. The frequency of every substructureis counted and stored as a vector of frequencies. Alternatively structural keys canalso only encode the presence (1) or absence (0) of a specific substructure as a bitvector.Examples for structural keys are the PubChem Fingerprint and MACCS Keys. Thisstudy uses an additional structural key based on ”Tox Features”, substructures thathave previously been reported as toxicophores, as used in [45].

2.2.3 Fingerprints

Fingerprints were developed as more abstract version of a structural key. In con-trast to structural keys they do not require the specification of a set of substruc-tures, but compute the relevant substructures from the molecule itself. This signif-icantly increases the number of substructures and provides a better ”sampling” ofthe molecule. As it is not generally feasible to store all substructures separately theyare usually hashed into a bit vector (standard sizes are 1024 or 2048 bits). Thereare multiple ways to determine these substructures, such as Extended ConnectivityFingerprints (ECFP) [52] or simple Depth First Search (DFS).

Descriptors can be computed using one of several chemoinformatics software toolk-its such as RDKit [36], CDK [67], Mordred Descriptors [46] or jCompoundMapper[28], but given the large amount of different types of descriptors, structural keys and

4

fingerprints, not every library can be used to compute any chemical descriptor.

Together with the different ways to encode a molecule as a numeric vector it ispossible to model the relationship of the outcome of the assay experiments andcompound structure using supervised learning techniques.

2.3 Supervised Learning

In supervised learning, given a set of training samples {z1, z2, . . . , zn−1, zn} where ziare input output pairs (xi, yi) of a feature vector x ∈ X and a corresponding labelor value y ∈ Y , one wants to find a function or model f : X → Y which shouldclosely predict the output value of input y ≈ y where y = f(x).

One wants to find a model, that is not only able to correctly process the samplesused for selecting the model, but possibly also optimally handles future data. Theerror or loss per sample is generally measured using a loss function L(y, y). Com-monly used loss functions are the quadratic loss L(y, y) = (y − y)2 or the zero-oneloss L(y, y) = I(y 6= y) where I is the indicator function.

Performance of a model on future data is called generalization error or risk R, andis estimated by the expected loss on future data:

R(f(.)) = Ez [L(y, f(x))] =

∫Z

L(y, f(x)) p(z) dz (2.1)

As data generating distribution p(z) is generally not known the risk cannot becomputed and needs to be approximated in practice. The Risk can be estimatedby splitting the available data Z = {z1, z2, . . . , zn−1, zn} into a train Ztrain ={z1, z2, . . . , zl−1, zl} and test set Ztest = {zl+1, zl+2, . . . , zl+m−1, zl+m}, where the trainset is used for selecting (i.e. learning) the model and the test set is only used forevaluation.

R(fZtrain(.)) ≈ 1

m

l+m∑i=l+1

L(yi, fZtrain(xi)) (2.2)

If the number samples |Z| is small, splitting the training data into separate trainand test sets might be problematic, as less data makes it more difficult to learna good model. In these cases cross validation, another method for estimating thegeneralization error, is used.

For cross validation the data Z is split into k equally sized subsets, called folds. Foreach fold Zj the model is trained on the k− 1 other folds Z\j, and evaluated on theremaining fold Zj (see figure 2.1 for an illustration).

Rk−CV (Z) =1

n

k∑j=1

n

k

∑zi∈Zj

(L(yi, fZ\j(xi))

)(2.3)

5

Z1

Z3

Z2

Z4

Z1

Z3

Z2

Z4

Z1

Z3

Z2

Z4

Z1

Z3

Z2

Z4Training Folds

Test Fold

Model \Z1 Model \Z2 Model \Z3 Model \Z4

Complete Dataset

Z{1..4} … Folds

Figure 2.1: Illustration of cross validation folds. For every column a model is fit onthe training folds and evaluated on the test fold.

Luntz and Brailovsky [42] showed that cross validation is an almost unbiased esti-mator for the risk.

There exist multiple ways to model such an relationship such as Support VectorMachines, Random Forests or Neural Networks and in practice not every a learningalgorithm is not guaranteed to perform well on every problem [68]. For QSARtasks neural networks have repeatedly shown [43] [44] to perform very well and rankamong the best performing methods [45].

6

Chapter 3

Neural Networks

Over the last decades the field of machine learning gained a lot of popularity. Giventhe availability of large amounts of data, more computational resources and free andopen access to learning material on machine learning, people started to apply ma-chine learning techniques to a wide range of problems. As a result machine learningmethods are now used in many fields.

In the last years especially neural networks gained a lot of attention by the me-dia, academia and industry, after many challenging problems were continuouslybeing solved using neural network based learning systems. Neural networks are ableto model complex relationships, systems like AlphaGo [55] and AlphaStar [65] byGoogle Deepmind or OpenAI Five [48] by the company OpenAI were able to achievesuper-human-like performance in games, modern network variants now power ma-chine translation [69], speech generation [66] [51] and can be used to generate naturalimages [13].

3.1 Neural Networks

Neural networks are chains of differentiable functions f that map an input x ∈ Rn

to y ∈ Rm. The functions f are also referred to as layers, with the simplest typeof functions being of the form (3.1), called linear layer, where a is a non-linear(activation) function, and weights W and biases b are the function parameters.

flinear(x) = a(Wm×nx + b) (3.1)

Neural networks are usually be composed of multiple layers. The activation functiona is usually one of several standard functions like Sigmoid or Tanh Functions, Rec-tified Linear Units (ReLU) [47], Elastic Linear Units (ELU) [19] or Scaled ElasticLinear Units (SeLU) [34]. Introducing non-linearities allows the network to modelmore complicated functions. If a linear activation (i.e. y = x) would be used,a chain of linear layers (3.2) would simply correspond to a set of matrix vectormultiplications and additions.

F (x) = f(3)linear ·f

(2)linear ·f

(1)linear(x) = a(Wn3×n2a(Wn2×n1a(Wn1×n0x+b0)+b1)+b2) (3.2)

7

In order to find a suitable input-output relationship one usually specifies a lossfunction L(y, y) that models the error of network prediction y and the ground-truthvalue y and that is minimized during learning. Since all functions f in a networkwere previously required to be differentiable one is able to compute the gradientsof the parameters which then can be used to update the parameters. The gradientsare computed using the backpropagation algorithm [53], by applying the chain rulefor differentiation. In the past when implementing a neural network program, thebackpropagation steps for a model had often first to be derived by hand. Todaythere exist many software packages that offer automatic differentiation [12], whichcan be used to automatically derive the gradients of a model.

Neural networks are not limited to linear layers (3.1), but there exist many differenttypes of layers which are used for modelling different kinds of data (e.g. convolu-tional layers), or altering the learning behaviour and generalization of the network(e.g. dropout [57] or batch normalization [30]). Networks can also include gatingfunctionality for managing the information flow through the network [29] [58]. Thechain of functions (3.2) does not necessarily require to be linear, but can also resem-ble complicated computation graphs for which the gradients can be simply computedvia automatic differentiation. This flexibility allows to realize new problem specificneural network architectures such as Transformers [64], for processing of text, thatperform extremely well and are now excessively used in practice.

Neural networks composed of linear layers (3.1) are referred to as densely or fullyconnected (FC) neural networks and are used to model vector type data. Networkswith convolutional layers are called convolutional neural networks (ConvNets) andare commonly used for modelling spatial data such as images or audio. There alsoexist some network versions that can be used for modelling temporal data, such asrecurrent neural networks (RNNs) or the Long-Short-Term-Memory (LSTM) [29]networks. Given the popularity of neural networks there exist many excellent onlineresources (blog posts, online courses, etc.) and text books [25] that can be consultedto get a better overview of all the different network types.

3.2 Neural Networks for QSAR

As already discussed in chapter 2 molecules can be represented using multiple differ-ent molecular descriptors. While there exist special neural networks for processing(molecular) graphs [24] [70], in chemoinformatics fully connected networks still findwide application and have been shown to perform extremely well on structure ac-tivity prediction tasks [45].

3.2.1 Fully Connected Networks

Molecules encoded as numeric feature vectors are used as input of a neural networkin order to model a structure activity relationship. A linear layer 3.1 maps theinput feature vector into intermediate representation, by matrix multiplication withthe weights W , addition of a bias term b, and applying of an activation function,In a classic interpretation the vector elements are also referred to as neurons that

8

o1

h20

h21

h22

h23

h24

1

h10

h11

h12

h13

h14

1

x0

x1

x2

x3

1

hidden2 hidden1 inputoutput

bias

Figure 3.1: Illustration of a fully connected network for learning a single task.

are connected to all vector elements (neurons) of the next and previous layer (seefigure 3.1), where each connection between neurons corresponds to an element inthe weight matrix W .

The single output (neuron) in the final layer usually has a sigmoid activation functionand uses binary crossentropy (3.3) as loss.

L = y log(y) + (1− y) log(1− y) (3.3)

During training a fully connected network is used to predict the output for ran-domly sampled subsets batches of the training data, the predictions are then usedtogether with the ground-truth labels to compute the model’s loss. Using the lossthe parameter gradients can be computed via auto-differentiation. With the gra-dients the model parameters are then updated using a variant of gradient descent.The prediction step is commonly referred to as forward pass, the computation of thegradients as backward pass.

When building a QSAR model one might not be interested in modelling only a singletarget, since many biological targets might be related and possibly also interact withthe same molecular structures. By training on multiple targets simultaneously onecould make use of similarities in molecular structure and also increase the numberof samples that can be used for learning.

3.3 Multi-Task Learning

This approach is also known as multi-task learning, and can be easily realized usingneural networks. In the multi-task learning setting for every task of interest oneextends the initial neural networks with an additional output (neuron) (see figure3.2). By doing this one obtains an output vector which resembles the predictionsfor the different tasks. One can then define a new masked loss function (3.4) onthese predictions and the vector of ground truth labels. As one sample might not

9

o2

o1

h20

h21

h22

h23

h24

1

h10

h11

h12

h13

h14

1

x0

x1

x2

x3

1

hidden2 hidden1 inputoutput

bias

Figure 3.2: Illustration of a multi-task neural network.

have measurements for every possible task, invalid tasks without labels are usuallymasked out by element-wise multiplication with a binary mask m.

L =n∑i

mi(yi log(yi) + (1− yi) log(1− yi)) (3.4)

Via auto-differentiation one can easily backprop through this masked loss and ob-tain the parameter gradients.

Multi-Task learning has been studied in the past [16]: Learning different tasks gen-erally leads to an amplification of the data that can be used for training. Sharingtasks also helps to select the relevant features (i.e. molecular descriptors), and helpother tasks discover better representations via sharing of gradient information.

3.4 Neural Network Regularization

One drawback of neural networks is that they are very prone to overfitting. Overfit-ting refers to the situation when the validation loss increases again while the trainingloss is steadily decreasing (see figure 3.3 for an illustration). This effect is causedby networks being heavily overparametrized. Neural networks have very high learn-ing capacity, allowing them to memorize their inputs or even fit random noise [71].Training neural networks requires careful monitoring, it is easy to overshoot andtrain too long. Loss is generally closely monitored and intermediate model statesare checkpointed (stored on disk so that a previous version of the model can bereloaded and training restarted from there).

There exist several techniques to regularize a neural network in order to reduceoverfitting. The simplest being early stopping [17], that is to simply stop the trainingas soon as the validation loss increases. Other techniques are dropout [47] or weightdecay [35]. Dropout randomly ignores a subset neurons with probability p during

10

Loss

Epochs

Val Loss

Train Loss

Figure 3.3: This figure shows the loss on the training (train) and validation (val) setsfrom a training run of a neural network, which illustrates the overfitting problem.Validation loss is minimal around epoch 30, by training longer the validation lossstarts to increase again while the training loss decreases steadily.

the forward pass. Dropout is only applied during training. Essentially every batcha different subnetwork within the original network is used to make the prediction.During the backward pass only the ”active” subset of the parameters is updated.During inference (e.g. for estimating the generalization error) dropout is disabled,and an ”average” of all subnetworks is used to make the prediction. Dropout can bealso applied at the input layer, causing the model to learn using subsets of differentcombinations of input feature elements instead of all features. This is also known asinput dropout and in practice greatly improves network generalization error. Weightdecay is another way to avoid overfitting. Weight decay is a penalty based on theL1 or L2 norm of the weights that is added to the loss. During optimization weightdecay keeps the parameters from growing too large leading to a more robust model.If weights are large even small changes in the inputs will cause a large change inthe outputs and predictions will change quickly. Small weights are more robust tochanges in the input features and stabilizes the model predictions.

11

Chapter 4

Datasets

When learning a model, the quality of the model is directly determined by theavailability and quality of the data. In the machine learning field publicly availabledatasets such as MNIST [37], MS-COCO [40] or ImageNet [21] provide commonstandards the community can use to measure progress and guide research. Withevery year new datasets are released to address issues with old datasets or to intro-duce new challenges to the field. In chemoinformatics one of these challenges wasthe Tox21 challenge [4], which was launched in 2014 by the National Institute ofHealth, with the goal to predict how chemical compounds interfere with biochemicalpathways by only relying on the chemical structure of the data. Together with thechallenge a new dataset was released which is now one of the standards used forQSAR studies. By current standards this datasets with 12.060 training and 647 testsamples is relatively small, with modern datasets used in machine learning beingusually much larger. For extending Tox21 or creating entirely new datasets thatcan be used for QSAR studies one is interested obtaining new data on chemicalcompounds and outcomes of associated bioassay experiments, as generated by HighThroughput Screening experiments. For many problems sourcing data for creatingnew and also possibly large datasets is difficult, either due to the general lack ofavailable data, or the cost of annotation in order to being able to apply supervisedlearning techniques. The chemoinformatics field is in the lucky position that thereexist multiple freely accessible databases which store large amounts of compoundactivity relationships, that can be mined and used to compile new datasets.

4.1 Chemical Databases

In chemoinformatics there are exist many of often specialized databases. The threemajor public databases are ChEMBL [20] [23], PubChem [33] and ZINC15 [60]where large amounts data on chemical compounds and their biological activities arecollected. These databases are initiatives led by European Molecular Biology Lab(EMBL), ChEMBL, the National Institute of Health (NIH) - PubChem- as well asother academic organisations - ZINC15. While these databases differ in their scopeand focus they can all be used to extract and derive new datasets for QSAR studies.

12

4.1.1 ChEMBL

ChEMBL [20] [23] is a ”chemogenomic” database that focuses on bioactive moleculeswith drug-like properties. ChEMBL is curated manually and contains chemical,bioactivity and genomic data. The database is managed by EMBL-EBI and followsa roughly yearly release schedule. The latest release, ChEMBL 25, contains 12482targets, more than 1.8 million compounds and 15.5 million activity measurements.

4.1.2 PubChem

PubChem [33] is a freely accessible chemistry database maintained by the NIH.PubChem includes data on chemical compounds and their bioactivity measurementsand can be used to compile and extend datasets for drug-target prediction. Thedatabase currently contains more than 96.5 million compounds, and more than 1million bioassays with 268 million bioactivities.

4.1.3 ZINC15

ZINC15 [60] is a project by the Irwin and Shoichet Laboratories on the Departmentof Pharmaceutical Chemistry at the University of California. The main focus ofZINC is to provide data for molecular docking, for example to predict the orientationof a compound which binds to an enzyme. Like the other databases ZINC15 alsocontains bioassay results that can be used for compiling new datasets.

4.2 Database Overlap

Given millions of stored compounds one might ask whether how these databasesoverlap and whether there are compounds that are only represented in a singledatabase. To evaluate this for every compound its InCHIKey [27], a hash valuefor an InCHI string [27] describing a molecule, was computed which was used toidentify ”identical” compounds across the databases. While there always exists asmall probability for a collision (i.e. two distinct molecules having the identicalInCHIKey) when using a hash function, this probability is very small in theory.Pletnev et al. [50] recently showed that the practical collision rate is similar to thetheoretical expected rate.Investigating ChEMBL, PubChem and ZINC15, the three databases show someoverlap but there are still several compounds that are only represented in a singledatabase, as illustrated by figure 4.1. In practice this large overlap of the databasesis beneficial as it allows to increase the number of assays for common compounds.Having access to this amount of data and being able to compile new datasets is agreat opportunity for chemoinformatics research and allows to study how datasetsize (adding more compounds) and composition (adding more assays) affects theperformance of QSAR models.

13

Figure 4.1: This plot shows the size of the different databases as well as the sizeof the intersections via the UpSet [39] visualization technique for sets. The verticalblack dotted lines resemble intersections of the databases. The bar at the top ofeach black line displays the number of shared compounds. Both ChEMBL andPubChem share around 1.5 million compounds. Surprisingly despite more than96.5 million compounds in PubChem, both ZINC15 and ChEMBL25 still containunique compounds.

14

Chapter 5

Target Prediction Platform

5.1 Motivation

The chapter Datasets already gives an overview of the different databases that canbe used to compile new datasets for learning tasks.

In practice deriving a new dataset from a database involves multiple steps. One firstneeds to construct the assay label matrix from the affinity or activity measurementsfor the different assays and compounds. This data is composed of experiment out-comes where a compound is tested on a specific assay and its corresponding bindingaffinity (e.g. IC50, logPKA, . . . ) or activity (active or inactive) is measured. Theserecords are often also referred to as triples as they consist of three components:compound, assay and affinity/activity value (see figure 5.1 for an illustration). Forevery pair of compounds and assays a database might contain multiple triples fromdifferent experiments.

compound_id

c1

c2

c1

c1

c3

assay_id

a1

a7

a20

a1

a2

affinity

9

4

9

8

12

Figure 5.1: Illustration of the raw triples records as stored in a chemical database.

For building a dataset these triples need to be preprocessed and cleaned. Mea-surements for identical compound assay pairings are usually averaged to obtain anaverage affinity / activity score, which is used to assign a label (e.g. activity score< 0.5 is inactive, > 0.5 active or affinity < 8 is inactive, > 8 is active). The result-

15

compound_id

c1

c2

c1

c3

assay_id

a1

a7

a20

a2

activity

1

0

1

1

avg_affinity

8.5

4

9

9

Figure 5.2: The affinity values per compound and assay id pair are averaged and anactivity label is assigned.

compound_id

c1

c2

c3

assay_id::activity

a1

a7

a20

a2

1

0

1

1

compound_id

c1

c2

c3

smiles

CN=C=O

CCO

c1ccccc1

Figure 5.3: The assay activity pairs per compound are collected and molecularinformation (in SMILES format) for each compound is added to every record.

ing records now contain the compound id and the assigned activity values for everytested assay (see figure 5.2).The assay and activity values per compound can now be collected as a list andmerged with the compound’s chemical structure information (e.g. SMILES strings[7]) which is usually stored separately from the triples (figure 5.3). Once the chemi-cal structures are available various chemical descriptors can be computed and simplyadded to each record (figure 5.4). A record now contains the chemical descriptorsand the label information required for supervised learning methods.

The above process may seem simple, but there currently exists no public frameworkthat covers these processing steps. While frameworks such as TensorFlow Extended(TFX) [62] allow to model a wide range of data input pipelines, the chemoinfor-matics setting is a somewhat exotic usecase, and hardly any of the framework’sstandard functionality is applicable to this problem. Realizing a pipeline for ex-tracting a new dataset requires a lot of custom code for interfacing with multiplelibraries and additional programs and tools. Especially the feature computationstep is often challenging as the various descriptors are not available in every libraryand one is required to mix and match various chemoinformatics toolkits, which areimplemented in different programming languages, to get an acceptable coverage ofthe space of chemical descriptors.

16

compound_id

c1

c2

c3

assay_id::activity

a1

a7

a20

a2

1

0

1

1

smiles

CN=C=O

CCO

c1ccccc1

mol_desc

2,3,...

2,2,...

5,7,...

Figure 5.4: Based on the molecular information different molecular descriptors canbe calculated and are also added to the records. The resulting dataset is ready fortraining a model.

In practice extraction ”pipelines” for exporting a dataset are simply realized via aseries of scripts that ”glue together” various program calls. These scripts usuallyresult in code that is generally hard to understand and maintain, and that failseasily. Based on these shortcomings and the requirement for robust and fast pro-cessing pipelines calls for the development of a new framework that simplifies theextraction of new datasets for QSAR studies. Using the framework it should bepossible to automatically extract and merge datasets from multiple public chemicaldatabases and compute a wide range of chemical descriptors. The processing shouldonly require minimal supervision and human intervention (i.e. launching differentprocessing steps per hand).Implementing such a framework in order to replace current brittle pipelines is nottrivial and requires to address multiple challenges:

Integration of multiple data sources

As every database used comes with its own custom data schema, or is ”schema-less”(i.e. using a dynamically typed schema, records might sometimes have additionalmeasurements which are not available everywhere), standard input functions are notalways applicable and custom (database) connectors are required. Furthermore inte-gration of data from multiple sources requires careful consideration of intermediatedata structures and storage formats.

Time to results

When conducting research one of the most limiting factors is time. Extraction of rawrecords, preprocessing and computation of features can take a significant amount oftime and progress often stalls when people are not able to work interactively andhave to wait for results. As an example using the standard jCompoundMapper [28]program the computation of a single type of chemical feature requires more than aweek. Waiting for only a few hours already hinders experimentation and one quicklygets distracted and loses the train of thought. In other domains such as computervision it is already known that the choice of encoding and data augmentation canhave a big impact on the quality of the results [49]. Having to wait multiple daysfor one processing run to complete makes it unattractive to investigate the effects of

17

the choice of processing protocol or different molecular descriptors. Speeding up thewhole preprocessing would allow for ablation studies of these effects, and possiblyalso lead to the development of better drug target prediction models.

Interfacing different libraries and languages

Currently none of the open source chemoinformatics toolkits implements all chemi-cal descriptors. In some cases it is often a problem to find a public implementationat all. In practice one usually aims to reuse implementations, as re-implementingan algorithms is often not possible due to lack of knowledge in (quantum) chem-istry required to properly understand the steps involved into the computation ofa descriptor. Supporting only a small set of descriptors requires to interface withmultiple libraries and/or programming languages which increases the complexity ofa pipeline.

Scheduling of pipeline components

Provided that clean implementations of the different processing steps exist, the pro-grams for all processing steps still need to be run in the correct order. Given thecomplexity of certain processing pipelines (especially if multiple datasets are mergedin the process) this cannot easily be handled by wrapping everything in a script.Processing steps might fail due to network issues or misconfigured server environ-ments. Scripts only provide limited means to tolerate failures or to recover fromerrors. Having a human permanently supervise the pipeline execution and addressfailures is generally not practical.

Ultimately a platform based on Apache Spark [10], a framework for distributed com-putation and data processing, the chemoinformatics libraries RDKit [36], MordredDescriptors [46], CDK [67] and jCompoundMapper [28], and Apache Airflow [8], aframework for workflow managment, was developed which addresses all of the aboveissues. Apache Spark is designed for big data applications and offers an expres-sive programming interface. Native language support for Python and Java makesit easy to interface with the different chemoinformatics toolkits which are eitherimplemented in Python (RDKit, Mordred) or Java (CDK, jCompoundMapper). Byutilizing Spark it is possible to scale to very large datasets with millions of records, ascommonly encountered in the chemoinformatics setting, and distribute the process-ing and feature computation across multiple cores on a single machine or potentiallyalso across several separate nodes. Parallelizing the processing greatly shortens timeto results. Airflow allows to schedule and rerun pipelines in specific time intervalsand also handles the separate processing steps, relaunching them when required.Even more importantly all executions are automatically logged by AirFlow. Execu-tion logs are stored persistently on disk and can be reviewed in case of problems.

As already hinted above a processing pipeline can be split into multiple componentswhich will be described in detail in the following sections:

• Downloading the databases

• Extraction and Preprocessing

• Computation of chemical features

18

• Export of the resulting records

5.2 Downloading the databases

The databases are downloaded using a collection of bash scripts utilizing the wgetcommand, if the database was available via FTP (ChEMBL, PubChem), or viaPython scripts, when it was required to interact with an online API (ZINC15).To distinguish between different database versions and guarantee a minimum ofreproducibility the platform utilizes two different strategies: If the database followsa release schedule such as ChEMBL the database release is used for versioning.In case the database in continuously updated, as it is the case for PubChem andZINC15, always the latest database version is downloaded and stored in a newfolder using the current date as timestamp. The database dumps used for creatinga dataset are usually stored alongside the other results which allows to rerun theprocessing using the exact same data.

5.3 Extraction and Preprocessing

5.3.1 Extracting the compound and assay information

In the first step the compound, assay, affinity/activity triples are extracted fromeach database using a custom database connector and the records are exported toParquet [9] dataframes, an efficient storage format for tabular data, which supportsparallel processing and allows to specify a schema for the datasets (similar to arelational database). In the next processing step the chemical data required forcomputation of descriptors has to be extracted. This chemical information is usu-ally stored in SDF files, structure data files, which encode the chemical informationin text format. While this format preserves a lot of information compared to linenotation formats such as SMILES [7] or InCHI [27], that both encode the chemicalsas ASCII strings, it is not very efficient and can not easily be processed in parallel.To resolve this issue the SDF files are also exported into Parquet format. Parquetis a great fit for this problem as it automatically shards the original large databasedump into multiple smaller files, that can be read in parallel, and enables indexedaccess for every entry. The SDF files are first parsed using RDKit, the Molblock(a single entry within a SDF file) for each compound is then stored alongside itsInCHIKey [27] (a chemical hash value for every molecule) as separate rows in theParquet dataframe. This step can be seen as some sort of sanitation of the data.It is not possible to guarantee that the different databases use the same library forexporting the SDF files. Parsing and exporting all compounds first using RDKittherefore helps to ”normalize” the data. Exporting the molecules via RDKit guar-antees that all Molblocks were generated using the same exporter. In case RDKitencounters an error during parsing the molecule is dropped. The InCHIKey field ofeach row is later used to detect duplicates as well as key for merging the assays of”identical” compounds which share the same InCHIKey. Having the possibility tomatch records is especially important if one wants to merge records from databasesthat use different compound identifiers. If two compounds share the same InCHIKeyand are merged the corresponding Molblocks are simply concatenated as a list. Itis possible to either review any additional Molblock files per compound or to ignore

19

Label Threshold in -log10(M) Relation Type

active ≥ 5.5 "<","<=","=","~"

weak active > 5.0 and < 5.5 "<","<=","=","~"

inactive ≤ 4.5 ">",">=","=","~"

weak inactive > 4.5 and < 5.0 ">",">=","=","~"

indeterminate > 4.5 or < 5.5 any

Table 5.1: Affinity thresholds for measurements excluding standard activity com-ments.

them. Currently only one of the Molblock files is used for computing the descriptors(as all Molblocks result in the same InCHIKey we assume sufficient similarity ofthe compounds and not necessarily a hash collision). After this process one endsup with a large Parquet dataframe that is later merged with the processed assayactivity labels.

5.3.2 Preprocessing Protocol

Having all the activity triples for all databases stored as Parquet dataframes, whichsupport parallel reads, allows to process the records in parallel via Spark SQL.The preprocessing protocols for each database differs slightly. The procedure forthe ChEMBL dataset is based on the protocol used by [45]. The protocols for Pub-Chem and ZINC15 follow another protocol used internally in the Machine LearningInstitute at Johannes Kepler University Linz. The biggest issue is that the databasemight contain conflicting results, and the way these problem cases are handled. Therecords are usually cleaned in a separate step after which all compounds are mergedwith their corresponding vector activity label and exported as rows in another Par-quet dataframe.

The next short sections describe the protocols for all databases in more detail.

Protocol ChEMBL

This protocol resembles the protocol employed Mayr et al. [45], in the followingreferred to as LSC. The protocol assigns the activity labels primarily based on theactivity comment field. If the comment is either inactive, Not Active or Not

Active (inhibition < 50% 10um and thus dose-response curve not measured

the measurement is considered inactive. If the comment is Active or active it aslabeled as active.

Measurements with no std value, std unit not nM or a standard relation notin {">", ">=", "<","<=","=","~" are dropped. The remaining compounds arelabeled using the same thresholds as in the LSC supplemental material [45]. Thethresholds used are documented in table 5.1 and are identical to the thresholds inthe supplemental material Table S1 of the LSC study.

In a final cleanup step conflicting measurements are removed (i.e. compounds withboth active and inactive measurements for an assay). This protocol also assign la-

20

bels to weakly active and inactive compounds. For the current pipeline only strictlyactive and inactive measurements are used and all other labels are dropped. Fur-thermore all assays with less than 100 compounds are also removed.

Protocol PubChem

For this protocol all assays with less than 25 measuremens are dropped. The labelsare based on the activity labels from PubChem. For every compound assay pairthe average activity is computed. Average activity smaller 0.5 is set to 0 (inactive),larger than 0.5 to 1 (active). Average activity values which are exactly 0.5 aredropped. Finally the assays are filtered so that every assay has at least 10 activeand 10 inactive samples and at least 25 measurements.

Protocol ZINC15

For the ZINC15 database the average affinity is computed for all compound assaypairs. The activity labels are then set to active (1) if average affinity is larger than8, otherwise to inactive (0). As last step the assays are also filtered. Every assaythen has at least 10 active, 10 inactive, and in total at least 25 measurements.

Note

As the processing relies on InChIKeys for determining identical compounds thismight cause some problems in practice when two compounds with identical InCHIKeyhave conflicting activity measurements for the same assay. In this case the corre-sponding assay measurements are dropped which causes that in the resulting datasetsome assays have less than the expected associated measurements (i.e. using thisprotocol for ChEMBL some assays have less than 100 activity measurements).

5.3.3 Merging the different databases

If a dataset should be compiled from multiple databases the results have to bemerged. For compounds it is possible to identify identical samples via the InCHIKeyhash. The mol files of samples with the same InCHIKey are simply stored in alist. As it is nearly impossible to identify identical assays across databases assaysoriginating from different databases are simply concatenated and indexed using anid. The list of assays used in the final dataset is stored together with a numericidentifier in another Parquet dataframe.For the merging process all database specific compounds ids are exported with theInCHIKey of the corresponding molecule as Parquet dataframe that can be used todetermine the lineage of a compound.

5.4 Computation of Descriptors

Descriptors are computed using RDKit [36], Mordred [46], CDK [67] and jCom-poundMapper [28]. The platform supports several sets of descriptor pairings whosechoice are motivated by descriptors used in LSC.

21

5.4.1 Static Features

In LSC the static features were computed using ChemoPy [14]. As the ChemoPylibrary is no longer actively maintained the platform relies on the static descrip-tors computed using Mordred Descriptors [46], a new chemoinformatics library forPython. The library is able to compute 2D and 3D descriptors. For the pipelinemordred version 1.2 is used to compute the 1613 supported 2D features. 3D featuresare currently not used.

5.4.2 Semisparse Features

In LSC the authors also utilize a set of semisparse features, composed of CATS2D,SHED features as well as MACCS, PubChem and Daylight-like fingerprints (RDKitFingerprints). The PubChem fingerprints are computed using implementation fromCDK [67]. The CADS2D and SHED features are computed using jCompoundMap-per [28]. The jCompoundMapper was updated to a new version of Java and CDKin order not to rely on an outdated CDK release which is used by the original jCom-poundMapper version. The MACCS structure keys and RDKit Fingerprints werecomputed using RDKit [36].

5.4.3 Tox Features and Morgan Fingerprints

Another semisparse descriptor setup uses tox features as well as RDKit Morganfingerprints. The tox features are a set of toxicophore features utilized in-house inthe Machine Learning institute. These features were mined from the literature andhave been previously reported as toxicophores. The Morgan fingerprints are a typeof circular fingerprint implemented in the RDKit toolkit. The Morgan fingerprintsare hashed and stored as bitvectors, the tox features are exported as structural keys.

5.4.4 Sparse Features

The framework also supports the computation of path features, namely ECFC fea-tures with radius 4 and 6 and DFS features with search depth 8. These featuresare also computed via the updated version of the library jCompoundMapper. Thesesparse features are exported by substructure key and are not hashed to a bitvector.The features are then filtered, only features which appear in at least in 10 differentcompounds are kept. If frequency features are computed they are additionally nor-malized. In practice this allows the study of relationships between substructure andpredictions via attribution methods such as integrated gradients [61] or layer-wiserelevance propagation [11] that were adapted to molecules.

5.5 Export of records

Up to now the complete processing makes excessive use of Parquet files for storage ofoutputs and intermediate results. Unfortunately Parquet, although great for dataprocessing, is not really suited as input format for training due its the columnarformat which makes it hard to extract single examples. While there are speciallibraries such as Uber’s Petastorm [26] for ingesting Parquet files into PyTorch [59]or TensorFlow [5] training loops the platform refrains from this and relies on the

22

Figure 5.5: Screenshot of the AirFlow DAG for the ChEMBL processing pipeline

TFRecords [3] format instead. In order to maximize the utilization of acceleratorssuch as GPUs and TPUs (Tensor Processing Units) [32] the choice of records formatis important. Especially architectures with a low compute to data movement ratiosuch as fully connected networks are easily starved for data. Using the wrong for-mat slows down training and caching all records in RAM is not always possible. Inthe chemoinformatics setting one is additionally confronted with another problem:the features (i.e. the chemical descriptors) can be very sparse which does not allowto reuse optimized data loading pipelines that are available in for dense data suchas images such as NVIDIA Dali [1]. Luckily TFRecords, the custom TensorFlowrecord format, alongside the tf.data module addresses these issues and allows forhigh performance input pipelines. As the platform is based on Spark and Tensor-Flow, popular frameworks which are excessively used in industry and research, thereare already several libraries available for interfacing the frameworks. The recordsare directly exported from Parquet dataframe into the TFRecords format using theSpark-TensorFlow connector [2], an OpenSource Java library maintained by Googlethat adds TFRecords support to Spark. The Spark-TensorFlow connector makes iteasy to interact with datasets in TFRecord format using the familiar Spark SQLAPI. Despite using an efficient storage format for training it is still possible to trans-form the data into another format with only minimal effort.

During export the dataset is futher split into separate training and testing foldsbased on the precomputed clusters from the LSC paper. This is not really recom-mended as discussed in chapter 6 but allows to get started quickly and start buildingnew models for drug target prediction tasks.

5.6 Workflow Management

The number of different processing steps, reliance on intermediate results and mul-tiple options for the pipeline add to the complexity of the whole platform. A pro-cessing pipeline is composed of multiple steps, separate tasks which are chainedand depend on each other. If people only rarely interact with the platform it ishard to remember all the inter-dependencies. Having to review the documentationof all building blocks every time one wants to rerun a pipeline costs a lot of time,especially if the pipeline is run only once a month and details are quickly forgotten.If the pipelines are rerun frequently the overhead of managing everything by handwill become substantial. Wrapping everything in scripts will not necessarily help asusers have to deal with errors in the scripts which are generally not easy to debug.

To solve this problem the platform utilizes Apache Airflow, a modern workflowmanagement system, to take care of scheduling the tasks and monitoring the ex-ecution of the pipelines. One can argue that the overhead introduced by Airflow

23

is substantial compared to wrapping everything with a script. The toolkit offersseveral features which would require significant additional engineering effort if theyare added to the wrapper scripts. Processing workflows in AirFlow can be specifiedas code which makes them very flexible, and easy to use and maintain. The systemis also very scalable and actively used by many companies. It is therefore easierto utilize AirFlow for workflow management than to implement these componentsfrom scratch.

5.7 Training Process

In the platform the training code is currently decoupled from the preprocessingpipeline. In machine learning time spent on training usually exceeds the time spenton preprocessing (assuming the computed features are cached on persistent storageand reused multiple times during hyperparameter optimization). The current train-ing setup is based on TensorFlow 2.0 release [6] using TFRecords and the tf.datamodule for dataloading. TensorFlow 2.0 uses eager execution by default, in con-trast to the graph based approach of the TensorFlow 1.0 release [5]. Not havingto deal with compute graphs facilitates experimentation a lot. The code structureis now very similar to PyTorch [59], another deep learning framework popular dueto its simplicity and elegance. TensorFlow has also support for sparse features andsparse operations (e.g. sparse matrix multiplication and embeddings) and efficientTFRecords based input pipelines.

More details on the training setup and the settings used can be found in the chapterevaluation 1.

5.8 Platform Usage

The processing pipelines are specified as AirFlow Dags (see figure 5.5) which chaintogether the various processing steps. These dags can be setup to be automaticallyscheduled or triggered manually using the AirFlow WebUI. AirFlow will automat-ically run each processing step and try to restart them if failures are encountered.By default the ChEMBL pipeline will compute all sets of features presented above.Computing all descriptors usually takes several days, running the the pipeline foronly a subset of the features completes in a few hours.

1see Evaluation of effects of dataset composition on model performance, page 25

24

Chapter 6

Evaluation of effects of dataset

composition on model

performance

This chapter presents an evaluation of the effects of dataset composition in termsof number of compounds as well as number of assays. It is not immediately obviouswhether simply applying a preprocessing protocol already produces a dataset thatis suitable for learning. It is also not clear how different database releases affect thecomposition of the derived datasets and whether models are sensible to the distri-bution of assays. For this purpose for the ChEMBL database releases 21 to 25 adataset is extracted using the previously presented platform.

2122232425

ChEMBL Release

Number of Compounds vs Number of Assays

0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 2,000,000

Number of Compounds

Raw Database

Num

ber o

f Ass

ays

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

900,000

1,000,000

1,100,000

1,200,000

1,300,000

1,400,000

Figure 6.1: This scatterplot shows the number of compounds and assays for the rawChEMBL databases. Number of assays and compounds linearily increases everyyear but there is a drop in number of assays with ChEMBL release 24.

25

For all database releases the same extraction and processing protocol 1 is used. Theraw ChEMBL database releases increase in the number of compounds, but there isa drop in number of assays with ChEMBL 24 (see figure 7.3).The resulting datasets grow in terms of the number of compounds as well as thenumber of assays (Figure 6.2).

2122232425

ChEMBL Release

Number of Compounds vs. Number of Assays

490000495000 500000 505000 510000 515000 520000 525000 530000 535000 540000 545000 550000 555000 560000 565000 570000 575000 580000 585000590000Number of Compounds

1550

1600

1650

1700

1750

1800

1850

1900

1950

2000

2050

2100

2150

2200

Num

ber o

f Ass

ays

Figure 6.2: Number of compounds vs. number of assays for the ChEMBL releases21 to 25. There are jumps in number of assays and compounds for the databaseupdates from version 21 to 22 as well as from 23 to 24. Despite the drop in assaysin the raw databases (see figure 6.1) the number of assays in the derived datasets isincreasing.

Interestingly the resulting datasets for earlier ChEMBL releases are not perfect sub-sets of later releases. The datasets contain different sets of assays (see figure 6.3).This is likely due to the fact that over time not only new compounds and assays butalso additional activity measurements for an assay-compound pair are added. New,possibly conflicting, measurements will therefore result in different distributions ofassays across the derived datasets.

Some of these ”uncommon” assays have over 1000 compound measurements andsuddenly disappear from one ChEMBL release to the other (CHEMBL3214825 orCHEMBL3214882 are missing after ChEMBL24). These assays are then not evenpart of the initial raw database dumps, which suggests that the cause are structuralchanges within the ChEMBL database and not bugs in the processing pipeline. Onthe other hand also assays with more than 36000 compound measurements appear(CHEMBL3562077), likely due to inclusion of new HTS results. The datasets for thedifferent ChEMBL releases differ in the distribution of assays with a low compoundfrequency. Figure 6.4 shows that the latest ChEMBL releases tend to have moreassays with lower compound frequencies.

1see Protocol ChEMBL, Page 20

26

1545

308285

42 3714 12 6 4 2 1

0

500

1000

1500

Inte

rsec

tion

Siz

e

ChEMBL25

ChEMBL24

ChEMBL23

ChEMBL22

ChEMBL21

0500100015002000Set Size

Figure 6.3: This UpSet [39] plot shows the intersection of assays across the datasetsextracted from the different ChEMBL releases. Only 1545 assays appear in alldatasets, which is only around 68% of the total number of unique assays across alldatasets (2256 unique assays).

The heatmaps of assay co-occurenc (see appendix A.1, A.2, A.3, A.4, A.5) furthershow that the co-occurence of assays added in later ChEMBL releases and assaysalready present in earlier releases is very low.

27

It is not obvious whether or how strong these distribution shifts will affect the re-sults. One option is to exclude all ”uncommon” assays which do not appear in alldatasets, the other option is to keep them and check how the additional assays (i.e.helper tasks in the multitask learning setting) will impact performance. Removingassays might inadvertently also remove measurements of assays that were just re-cently merged, and were separate assays before. Keeping assays can still be used tostudy the effects of additional data.

For the purpose of this work it is not possible to verify all the uncommon assays andtrace the data lineage of every assay (mainly due lack of knowledge in biochemistryto guarantee correct assessment of the assays). The additional assays are thereforekept as helper tasks for training.

6.1 Experimental design in the life sciences

Working with biological or chemical data the proper evaluation of the performanceof machine learning methods is often challenging [15].

Practitioners new to the respective problem domain frequently run into problemdue to ignoring non-i.i.d. (independently identically distributed) samples, leadingto inflated performance estimates.

6.1.1 Compound Bias

For biological sequences such as DNA or amino acid sequences there usually existselection effects. Sequence composition determines protein folding which in turnimpacts the fitness of an individual. In the real world we will therefore not observeevery possible sequence but only a subset of similar evolutionary related sequences.Evaluating a machine learning method on a biological sequence related learningproblem using train and test sets, that contain very similar sequences, would likelyoverestimate the performance of the method. A common strategy to tackle thisproblem is by structuring the data and assigning similar items to the same fold.One possibility to structure sequences are phylogenetic trees which can for examplebe generated via UPGMA (unweighted pair group method with arithmetic mean)[56], an approach for hierarchically clustering of biological sequence data.Non-i.i.d distribution of input data due to selection or batch effects is not a purelybiological problem but is also encountered in the chemoinformatics setting wherethis phenomenon is usually referred to as compound bias as described in DeepTox[44] and LSC [44]. In the scope of pharmaceutical development studies, a molecularbackbone is usually generated first that is then augmented using different func-tional groups with possibly similar chemical properties. Randomly assigning thesesometimes very similar molecules to different folds can lead to inflated performanceestimates. Akin to the biological setting the molecular space can be structured viahierarchical clustering. In LSC [45] single linkage clustering is used to assign similarcompounds (in term of Jaccard similarity) to the same fold. In LSC this approachis used to derive three separate clusters that are used in a cross validation setup inorder to obtain robust performance estimates across a large set of machine learningmethods.

28

ChE

MBL

21 v

s. C

hEM

BL21

8011014

017020

023026

029032

035038

041044

047050

053056

059062

065068

071074

077080

083086

089092

095098

0N

umbe

r of C

ompo

unds

per

Ass

ay

050100

150

200

250

Count of Records

ChE

MBL

21 v

s. C

hEM

BL22

8011014

017020

023026

029032

035038

041044

047050

053056

059062

065068

071074

077080

083086

089092

095098

0N

umbe

r of C

ompo

unds

per

Ass

ay

050100

150

200

250

Count of Records

ChE

MBL

21 v

s. C

hEM

BL23

8011014

017020

023026

029032

035038

041044

047050

053056

059062

065068

071074

077080

083086

089092

095098

0N

umbe

r of C

ompo

unds

per

Ass

ay

050100

150

200

250

Count of Records

ChE

MBL

21 v

s. C

hEM

BL24

8011014

017020

023026

029032

035038

041044

047050

053056

059062

065068

071074

077080

083086

089092

095098

0N

umbe

r of C

ompo

unds

per

Ass

ay

050100

150

200

250

Count of Records

ChE

MBL

21 v

s. C

hEM

BL25

609012

015018

021024

027030

033036

039042

045048

051054

057060

063066

069072

075078

081084

087090

093096

0990

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

ChE

MBL

22 v

s. C

hEM

BL21

8011014

017020

023026

029032

035038

041044

047050

053056

059062

065068

071074

077080

083086

089092

095098

0N

umbe

r of C

ompo

unds

per

Ass

ay

050100

150

200

250

Count of Records

ChE

MBL

22 v

s. C

hEM

BL22

8011014

017020

023026

029032

035038

041044

047050

053056

059062

065068

071074

077080

083086

089092

095098

0N

umbe

r of C

ompo

unds

per

Ass

ay

050100

150

200

250

Count of Records

ChE

MBL

22 v

s. C

hEM

BL23

8011014

017020

023026

029032

035038

041044

047050

053056

059062

065068

071074

077080

083086

089092

095098

0N

umbe

r of C

ompo

unds

per

Ass

ay

050100

150

200

250

Count of Records

ChE

MBL

22 v

s. C

hEM

BL24

8011014

017020

023026

029032

035038

041044

047050

053056

059062

065068

071074

077080

083086

089092

095098

0N

umbe

r of C

ompo

unds

per

Ass

ay

050100

150

200

250

Count of Records

ChE

MBL

22 v

s. C

hEM

BL25

609012

015018

021024

027030

033036

039042

045048

051054

057060

063066

069072

075078

081084

087090

093096

0990

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

ChE

MBL

23 v

s. C

hEM

BL21

8011014

017020

023026

029032

035038

041044

047050

053056

059062

065068

071074

077080

083086

089092

095098

0N

umbe

r of C

ompo

unds

per

Ass

ay

050100

150

200

250

Count of Records

ChE

MBL

23 v

s. C

hEM

BL22

8011014

017020

023026

029032

035038

041044

047050

053056

059062

065068

071074

077080

083086

089092

095098

0N

umbe

r of C

ompo

unds

per

Ass

ay

050100

150

200

250

Count of Records

ChE

MBL

23 v

s. C

hEM

BL23

9012015

018021

024027

030033

036039

042045

048051

054057

060063

066069

072075

078081

084087

090093

096099

0N

umbe

r of C

ompo

unds

per

Ass

ay

050100

150

200

250

Count of Records

ChE

MBL

23 v

s. C

hEM

BL24

8011014

017020

023026

029032

035038

041044

047050

053056

059062

065068

071074

077080

083086

089092

095098

0N

umbe

r of C

ompo

unds

per

Ass

ay

050100

150

200

250

Count of Records

ChE

MBL

23 v

s. C

hEM

BL25

609012

015018

021024

027030

033036

039042

045048

051054

057060

063066

069072

075078

081084

087090

093096

0990

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

ChE

MBL

24 v

s. C

hEM

BL21

8011014

017020

023026

029032

035038

041044

047050

053056

059062

065068

071074

077080

083086

089092

095098

0N

umbe

r of C

ompo

unds

per

Ass

ay

050100

150

200

250

Count of Records

ChE

MBL

24 v

s. C

hEM

BL22

8011014

017020

023026

029032

035038

041044

047050

053056

059062

065068

071074

077080

083086

089092

095098

0N

umbe

r of C

ompo

unds

per

Ass

ay

050100

150

200

250

Count of Records

ChE

MBL

24 v

s. C

hEM

BL23

8011014

017020

023026

029032

035038

041044

047050

053056

059062

065068

071074

077080

083086

089092

095098

0N

umbe

r of C

ompo

unds

per

Ass

ay

050100

150

200

250

Count of Records

ChE

MBL

24 v

s. C

hEM

BL24

8011014

017020

023026

029032

035038

041044

047050

053056

059062

065068

071074

077080

083086

089092

095098

0N

umbe

r of C

ompo

unds

per

Ass

ay

050100

150

200

250

Count of Records

ChE

MBL

24 v

s. C

hEM

BL25

609012

015018

021024

027030

033036

039042

045048

051054

057060

063066

069072

075078

081084

087090

093096

0990

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

ChE

MBL

25 v

s. C

hEM

BL21

609012

015018

021024

027030

033036

039042

045048

051054

057060

063066

069072

075078

081084

087090

093096

0990

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

ChE

MBL

25 v

s. C

hEM

BL22

609012

015018

021024

027030

033036

039042

045048

051054

057060

063066

069072

075078

081084

087090

093096

0990

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

ChE

MBL

25 v

s. C

hEM

BL23

609012

015018

021024

027030

033036

039042

045048

051054

057060

063066

069072

075078

081084

087090

093096

0990

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

ChE

MBL

25 v

s. C

hEM

BL24

609012

015018

021024

027030

033036

039042

045048

051054

057060

063066

069072

075078

081084

087090

093096

0990

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

ChE

MBL

25 v

s. C

hEM

BL25

609012

015018

021024

027030

033036

039042

045048

051054

057060

063066

069072

075078

081084

087090

093096

0990

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

21 22 23 24 25

ChE

MBL

Rel

ease

Fig

ure

6.4:

The

grap

hic

show

sth

edis

trib

uti

onof

the

low

com

pou

nd

freq

uen

cyas

says

acro

ssth

ediff

eren

tC

hE

MB

Lre

leas

es.

We

can

see

that

the

late

rC

hE

MB

Lre

leas

eshav

em

ore

assa

ys

wit

hlo

wer

com

pou

nd

freq

uen

cies

.

29

Single linkage clustering is very memory intensive and the size of ChEMBL databaseincreased significantly over the last few years. Due to time and hardware contstraintsreclustering the latest ChEMBL releases is out of scope of this work. Instead thepublished pre-computed clusters from the LSC study are reused. The fold 1 clusterfrom LSC is chosen as test set, the other two clusters are merged and augmentedwith any new compound samples. The test set cluster is only used for computingthe final performance metrics after training of the models. As alternative approachthe LSC clusters could also be used as ”scaffold” and new samples can be assignedto a cluster based on the ratio of similar items. This approach is also not realizeddue to computational limitations (although it would require less resources than acomplete single linkage clustering). Reusing the clustering from LSC only affects thecompounds of each fold but does not take the distribution of assays into account.In practice this might be problematic if models are sensible to the distribution ofassays (see Figure 6.6). Using the LSC clusters does also not guaranteed that allassays are still available in both folds. Splitting the dataset into folds causes manyassays to be only part of one of the folds, the ratio of assays which are not part ofboth folds increases with every ChEMBL release (Figure 6.5).

commonuncommon

Common Assays Train/Test

21 22 23 24 25

ChEMBL Release

0

200

400

600

800

1,000

1,200

1,400

1,600

1,800

2,000

2,200

Num

ber o

f Ass

ays

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

Ratio U

ncomm

on Assays

Figure 6.5: The ratio of uncommon assays which are only member of one of thefolds increases over time.

6.2 Experiments

Due to computational limitations it is out of scope to run the experiments for allcombinations of descriptors and multiple hyperparameter settings. Based on thegood performance of the semisparse features in the LSC study the experiments alsouse this feature type 2 that were computed via the platform using jCompoundMap-per [28], CDK [67] and RDKit [36].

2Semisparse Features, page 22

30

ChE

MBL

21 -

Trai

n vs

. Tes

t

080

160

240

320

400

480

560

640

720

800

880

960

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

ChE

MBL

22 -

Trai

n vs

. Tes

t

080

160

240

320

400

480

560

640

720

800

880

960

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

ChE

MBL

23 -

Trai

n vs

. Tes

t

080

160

240

320

400

480

560

640

720

800

880

960

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

ChE

MBL

24 -

Trai

n vs

. Tes

t

080

160

240

320

400

480

560

640

720

800

880

960

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

ChE

MBL

25 -

Trai

n vs

. Tes

t

080

160

240

320

400

480

560

640

720

800

880

960

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

ChE

MBL

21 -

Tota

l vs.

Tra

in

080

160

240

320

400

480

560

640

720

800

880

960

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

ChE

MBL

22 -

Tota

l vs.

Tra

in

080

160

240

320

400

480

560

640

720

800

880

960

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of RecordsC

hEM

BL23

- To

tal v

s. T

rain

080

160

240

320

400

480

560

640

720

800

880

960

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

ChE

MBL

24 -

Tota

l vs.

Tra

in

080

160

240

320

400

480

560

640

720

800

880

960

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

ChE

MBL

25 -

Tota

l vs.

Tra

in

080

160

240

320

400

480

560

640

720

800

880

960

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

ChE

MBL

21 -

Tota

l vs.

Tes

t

080

160

240

320

400

480

560

640

720

800

880

960

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

ChE

MBL

22 -

Tota

l vs.

Tes

t

080

160

240

320

400

480

560

640

720

800

880

960

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

ChE

MBL

23 -

Tota

l vs.

Tes

t

080

160

240

320

400

480

560

640

720

800

880

960

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

ChE

MBL

24 -

Tota

l vs.

Tes

t

080

160

240

320

400

480

560

640

720

800

880

960

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

ChE

MBL

25 -

Tota

l vs.

Tes

t

080

160

240

320

400

480

560

640

720

800

880

960

Num

ber o

f Com

poun

ds p

er A

ssay

050100

150

200

250

Count of Records

test

tota

ltra

in

Dat

aset

Fol

ds

Fig

ure

6.6:

Shif

tin

dis

trib

uti

onof

assa

ys

wit

hlo

wco

mp

ound

freq

uen

cies

bet

wee

ntr

ain

and

test

fold

sbas

edon

LSC

clust

ers

and

com

ple

tedat

aset

acro

ssal

lC

hE

MB

Lre

leas

es.

Splitt

ing

the

fold

sca

use

sth

edis

trib

uti

ons

tosh

ift

asth

epre

pro

cess

ing

cuto

ffof

atle

ast

arou

nd

100

com

pou

nds

per

assa

yis

no

longe

rsa

tisfi

ed.

For

low

erC

hE

MB

Lre

leas

estr

ain

and

test

fold

sar

est

ill

rela

tive

lycl

ose

but

the

gap

wid

ens

wit

hre

leas

es24

and

25.

31

6.2.1 Evaluation Measures

These scores for every assay in the output layer are mapped to probabilities in therange between 0 and 1 by applying the sigmoid activation function. Probabilitieslarger than 0.5 are classified as active, smaller than 0.5 as inactive. Using thepredicted probabilities and ground truth labels a receiver operator characteristic(ROC) curve is computed. As ROC curves for many assays are usually hard tointerpret, for every curve the area under the curve (AUC) is computed as summarystatistic. The AUC scores are used to track the predictive performance of the modelfor every assay. In case an assay is only part of the test set it is also excludedfrom evaluation, as there is not any corresponding training data. For comparing theperformance across different ChEMBL datasets, the subset of assays that appearin every ChEMBL release in both training and test folds and were also part of theassays in the LSC study, are used. In this setup this results in 1046 assays wherethe AUC scores are compared over time.

6.2.2 Architecture

Given the comparably low sparsity of the data it is possible to cast all semisparsefeatures to dense tensors to speed up training. It was therefore not necessary to dealwith sparse matrix multiplication or embeddings of the chemical features.

The architectures used in this work are simple fully connected networks. Everysemisparse feature has its separate input layer which are then concatenated at thefirst dense hidden layer (for an illustration see figure 6.7). Concatenating the outputsof the input layers allows to reduce the number of parameters of the model and ismore flexible as it also allows to combine sparse and dense features. Merging sparseand dense features is not straightforward, as sparse features such as ECFP featurescannot be easily cast to dense vectors without severe implications on memory usageand model runtime. Integrating the dense features with the sparse features requiresto reprocess the complete dataset everytime a different combination of features isused. It is therefore easier to use the concatenation approach as it allows to com-bine different feature types without having to deal with custom code for mergingthe features.

At every input layer dropout is applied to the feature vectors. Each hidden fullyconnected layer is preceded by a dropout layer. At the output layer there is anoutput for every assay/task in the dataset. For training the outputs are not reducedto the 1046 common assays. Masked binary crossentropy is used as loss function,where outputs with no corresponding activity label are masked out.

6.2.3 Hyperparameters

The network is optimized using standard SGD. The optimization process does notuse any momentum which was ignored to reduce the set of hyperparameters. Forregularization (input-)dropout is applied before the input and hidden layers. Addi-tionally L2 weight decay is added to the weights. The learning rate is reduced every500000 iterations via exponential learning rate decay. The hyperparameters for theexperiment are chosen based on a small grid search over the parameters using the

32

SHED CATS2D MACCS PubChem FP RDKit FP

Dropout

Linear

Output

Concatenate

Dropout Dropout Dropout Dropout

Dropout

Linear

Dropout

Linear

Linear Linear Linear Linear

Dropout

Linear

Figure 6.7: Diagram of the architecture of the model used for the experiments.

dataset from ChEMBL 25 split into a training and validation fold in 60:40 ratio.For benchmarking the model performance the complete train set is used for learn-ing. All models are trained 80 epochs after which performance on the test set thatcorresponds to fold 1 is evaluated. The hyperparameters of the benchmark runs canbe found in table 6.1.

33

Hidden InputsSHED 8CATS2D 32MACCS 256PubChem FP 1024RDKit FP 2048

Hidden Units 2000Hidden Layers 3Activation ReLUBatch Size 64Learning Rate 0.1LR Decay Rate 0.96 (exponential)LR Decay Steps 500000Epochs 80L2 Weight Decay 0.01Weight Init Glorot UniformDropout Rate 0.4Input Dropout Rate 0.2

Table 6.1: Hyperparameters used for the experiments

34

Chapter 7

Results

7.1 Results on LSC Fold 1

All trained models achieve AUC scores larger than 0.70 but show slightly worse meanAUC than the results for fold1 from the LSC study (see table 7.2). For evaluationAUC values were only computed for a subset of 1046 common assays which areshared among all datasets and the LSC assays. Figure 7.1 shows that each modeltrained on a different ChEMBL dataset follows the same trend where assay AUCincreases with the number of training samples. For ChEMBL 24 and ChEMBL 25there are some drops in assay AUC for several high compound frequency assays.Tracking AUC values across datasets reveals the high variance of the assay AUCscores (see figure 7.2). When stratifying the AUC values for different AUC rangesfor fold1 AUC scores shows that most assay AUC scores are stable across datasets(see figure 7.3). The additional uncommon assays which are supposed to act as”helper tasks” do not seem to improve performance a lot, likely due to the fact thatmost new assays are infrequently co-occur with old assays (see Appendix A).

7.2 Role of helper tasks

In order to investigate the effect of helper tasks for every ChEMBL dataset eachmodel was also retrained on the restricted subset of common assays without any ad-ditional uncommon assays. These experiments use the identical architectures as the”full” experiments. The number of output classes in the output layer was identicaland uncommon assays were simply masked out and were not used for computingthe loss.

The models for the ChEMBL21, 22 and 23 datasets still show the similar trend butfor ChEMBL 24 and 25 there are again drops in AUC scores (see figure 7.4). MeanAUC score for fold 1 from the LSC study is higher than the mean AUC of modelstrained on the restricted datasets (see table 7.2).

ChEMBL Release LSC Fold1 21 22 23 24 25Mean AUC 0.7356 0.7267 0.7311 0.7287 0.7331 0.7349

Table 7.1: Mean AUC scores for semisparse features across different ChEMBLdataset.

35

ChEMBL 23 AUC Scores vs Trainset Size

10 20 100 200 1,000 2,000 10,000 20,000 100,000Trainset Size

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

AUC

Sco

res


10 20 100 200 1,000 2,000 10,000 20,000 100,000Trainset Size

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

AUC

Sco

res


10 20 100 200 1,000 2,000 10,000 20,000 100,000Trainset Size

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

AUC

Sco

res


10 20 100 200 1,000 2,000 10,000 20,000 100,000Trainset Size

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

AUC

Sco

res


10 20 100 200 1,000 2,000 10,000 20,000 100,000Trainset Size

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

AUC

Sco

res

Figure 7.1: Scatterplots of AUC scores vs train set size of the different ChEMBLreleases. Splitting the dataset into separate folds results in some assays with lessthan 100 compounds in the training set.

36

mea

n AU

C

AUC

Sco

res

for T

ests

et -

ChE

MBL

Full

ChE

MBL

20

Ref

eren

ce F

old1

ChE

MBL

21

ChE

MBL

22

ChE

MBL

23

ChE

MBL

24

ChE

MBL

25

Dat

aset

s

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Assay AUC

Fig

ure

7.2:

The

AU

Cas

say

scor

esar

ere

lati

vely

consi

sten

t,but

ther

ear

ese

vera

las

says

whic

hju

mp

acro

ssth

ediff

eren

tC

hE

MB

Lre

leas

es.

37

AUC

Sco

res

for T

ests

et c

ondi

tione

d on

LSC

AU

C ra

nge

0.0

to 0

.5

ChE

MBL

20

Ref

eren

ce F

old1

ChE

MBL

21

ChE

MBL

22

ChE

MBL

23

ChE

MBL

24

ChE

MBL

25

Dat

aset

s

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Assay AUC

AUC

Sco

res

for T

ests

et c

ondi

tione

d on

LSC

AU

C ra

nge

0.5

to 0

.6

ChE

MBL

20

Ref

eren

ce F

old1

ChE

MBL

21

ChE

MBL

22

ChE

MBL

23

ChE

MBL

24

ChE

MBL

25

Dat

aset

s

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Assay AUC

AUC

Sco

res

for T

ests

et c

ondi

tione

d on

LSC

AU

C ra

nge

0.6

to 0

.7

ChE

MBL

20

Ref

eren

ce F

old1

ChE

MBL

21

ChE

MBL

22

ChE

MBL

23

ChE

MBL

24

ChE

MBL

25

Dat

aset

s

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Assay AUC

AUC

Sco

res

for T

ests

et c

ondi

tione

d on

LSC

AU

C ra

nge

0.7

to 0

.8

ChE

MBL

20

Ref

eren

ce F

old1

ChE

MBL

21

ChE

MBL

22

ChE

MBL

23

ChE

MBL

24

ChE

MBL

25

Dat

aset

s

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Assay AUC

AUC

Sco

res

for T

ests

et c

ondi

tione

d on

LSC

AU

C ra

nge

0.8

to 0

.9

ChE

MBL

20

Ref

eren

ce F

old1

ChE

MBL

21

ChE

MBL

22

ChE

MBL

23

ChE

MBL

24

ChE

MBL

25

Dat

aset

s

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Assay AUC

AUC

Sco

res

for T

ests

et c

ondi

tione

d on

LSC

AU

C ra

nge

0.9

to 1

.0

ChE

MBL

20

Ref

eren

ce F

old1

ChE

MBL

21

ChE

MBL

22

ChE

MBL

23

ChE

MBL

24

ChE

MBL

25

Dat

aset

s

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Assay AUC

Fig

ure

7.3:

Con

dit

ionin

gth

eas

say

scor

eson

diff

eren

tA

UC

inte

rval

range

sre

veal

sdiff

eren

tdis

trib

uti

ons

for

the

AU

Csc

ores

.L

ower

AU

Cas

says

show

hig

her

vari

ance

than

assa

ys

wit

hhig

her

AU

C.

38

ChEMBL Release LSC Fold1 21 22 23 24 25Mean AUC 0.7356 0.7290 0.7292 0.7251 0.7256 0.7296

Table 7.2: Mean AUC scores for semisparse features across different ChEMBLdatasets reduced to a common subset of assays which were used for training.

As the datasets were restricted to the same set of assays the drops in assay AUCare likely due to the compound composition of the datasets but could be also dueto the choice of architecture:

Architectural effects

Reusing the same architectures with the same number of outputs allowed to reusethe datasets by simply masking out the uncommon assays at loss computation. Itwas later realized that the number of output tasks not only reduces the ratio of”active” weights that receive a gradient signal, but at the same time also alters theinitialization of the weights in the output layer due to the choice of initialization. Inthe experiments the output layers are initialized using TensorFlow Glorot Uniform

where the weights are randomly initialized in the range U([±√

6fanin+fanout

]). For

wider output layers caused by a larger number of tasks these distributions have asmaller range. As an Example: If the output layer is now 2000×2200 and only 1046assays are used the initial values of the weights will be in the range [−0.0378, 0.0378]instead of [−0.0444, 0.0444] (for 2000 × 1046). This is more narrow region of theweight space combined with the fact that only a subset of the whole layer will beupdated (i.e. 1046 used versus 2200 total assays) might impact the optimization pro-cess and ultimately result in models with different minima. For these experimentsthis effect is stronger as a large part of the weights of the output layers is inactive.But also in the standard setting where the assay labels are already quite sparse, es-pecially when merging different databases (i.e. ChEMBL, PubChem, ZINC15), andco-occurence between assays is low, results might be impacted by these changes.The drops in AUC scores for high compound frequency assays could be influencedby slightly different initialization, especially given that the the models for the olderChEMBL dataset releases perform better.

Study of the effects of initialization in combination with the distribution of assaysis out of scope of this work but the role of initialization requires more attention andgets often overlooked. In practice initialization schemes used are rarely reported,and are generally not handled consistently across frameworks (e.g. TensorFlow andPyTorch use a different default initialization scheme for dense layers or the sameinitialization schemes introduce an additional gain term depending on the activationfunction used).

39

10 20 100 200 1,000 2,000 10,000 20,000 100,0000.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

10 20 100 200 1,000 2,000 10,000 20,000 100,0000.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

10 20 100 200 1,000 2,000 10,000 20,000 100,0000.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

10 20 100 200 1,000 2,000 10,000 20,000 100,0000.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

10 20 100 200 1,000 2,000 10,000 20,000 100,0000.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

10 20 100 200 1,000 2,000 10,000 20,000 100,0000.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

10 20 100 200 1,000 2,000 10,000 20,000 100,0000.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

10 20 100 200 1,000 2,000 10,000 20,000 100,0000.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

10 20 100 200 1,000 2,000 10,000 20,000 100,0000.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

10 20 100 200 1,000 2,000 10,000 20,000 100,0000.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Assay AUC vs. Trainset SizeFull

Assay AUC vs. Trainset SizeSubset

Comparison of Assay ScoresFull vs. Subset

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.000.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.000.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.000.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.000.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.000.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Trainset Size

Assa

y AU

C

ChE

MBL

25C

hEM

BL24

ChE

MBL

23C

hEM

BL22

ChE

MBL

21

Assa

y Sc

ores

Ful

lAs

say

Scor

es F

ull

Assa

y Sc

ores

Ful

lAs

say

Scor

es F

ull

Assa

y Sc

ores

Ful

l

Assa

y AU

CAs

say

AUC

Assa

y AU

CAs

say

AUC

Trainset Size Assay Scores Subset

Figure 7.4: Comparison of the assay AUC vs. trainset size for the experimentsusing all assays Assays Full and only the subset Assays Subset. Comparing theassay scores shows that AUC scores follow a linear trend (black lines have slope 1and were not fit via to the data). For the restricted subset there are more drops inAUC for high compound frequency assays.

40

mea

n AU

C

AUC

Sco

res

for T

ests

et -

ChE

MBL

Subs

et

ChE

MBL

20

Ref

eren

ce F

old1

ChE

MBL

21

ChE

MBL

22

ChE

MBL

23

ChE

MBL

24

ChE

MBL

25

Dat

aset

s

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Assay AUC

Fig

ure

7.5:

AU

Csc

ores

acro

ssth

eA

ssay

sS

ubs

etar

eal

soco

nsi

sten

t.T

her

ear

eag

ain

som

enoi

syas

says

that

jum

pa

lot

acro

ssC

hE

MB

Lre

leas

es.

41

Chapter 8

Discussion

This thesis introduced a platform for automatic model building for drug target pre-diction, a set of tools for automatically extracting new datasets from various chem-ical databases. Compiling a new dataset using the platform’s processing pipelinesnow only requires a few days instead of weeks, greatly improving productivity. Us-ing the platform it is easy to compute a wide range of different features and studythe impact of different combination of features and the effects of assay distributionon model performance.

As demonstration the platform was used to extract a separate dataset from thelast five releases of the ChEMBL database. Each dataset was then used to train amulti-task neural network to investigate the impact of dataset composition in termsof number of compounds and number of assays. Using the full datasets with addi-tional assays it was possible to obtain similar AUC scores as the results from theLSC study [45]. Restricting the set of assays to the common assays, that are sharedbetween all datasets, also resulted in reasonable mean AUC values. The experi-ments show that rerunning the same extraction pipeline (with identical parameters)for different databases and training on the resulting datasets leads to models withsimilar performance as reported previously by Mayr et al.. Nevertheless this prob-lem still requires further study. Based on these limited experiments it is still notclear how the combination of assay distribution and sparsity of the tasks affect theperformance of a model. However, using the presented platform, it is now possibleto automatically train a large set of models in order gain a better understanding ofthe performance of neural networks for drug target prediction.

8.1 Contributions

The code for all platform components is published under the Target PredictionPipeline organization in the internal GitLab repository of the machine learning in-stitute at Johannes Kepler University Linz.

The code is split into multiple repositories:

• tpp airflow Apache Airflow configuration files and dags for ChEMBL database

• tpp python Python programs and PySpark jobs for datasets extraction, pro-cessing and descriptor computation using Python libraries (RDKit [36], Mor-

42

https://git.bioinf.jku.at/tpp

https://git.bioinf.jku.at/tpp

https://git.bioinf.jku.at/tpp/tpp_airflow

https://git.bioinf.jku.at/tpp/tpp_python

dred Descriptor [46])

• tpp scala Spark jobs for descriptor computations using Java libraries (CDK,jCompoundMapper [28])

• tpp tensorflow TensorFlow 2.0 code for multitask networks utilizing TFRecordsas input format

• tpp tutorial tutorial which explains the platform and components

• jCompoundMapper updated version of the jCompoundMapper code by Hin-selmann et al. [28]

43

https://git.bioinf.jku.at/tpp/tpp_scala

https://git.bioinf.jku.at/tpp/tpp_tensorflow

https://git.bioinf.jku.at/tpp/tpp_tutorial

https://git.bioinf.jku.at/tpp/jcompoundmapper

Bibliography

[1] Nvidia dali. https://docs.nvidia.com/deeplearning/sdk/

dali-developer-guide/docs/index.html, 2019. Accessed: 2019-11-09.

[2] Spark tensorflow connector. https://github.com/tensorflow/ecosystem/

tree/master/spark/spark-tensorflow-connector, 2019. Accessed: 2019-11-09.

[3] Tfrecords. https://www.tensorflow.org/tutorials/load_data/tfrecord,2019. Accessed: 2019-11-09.

[4] Tox21. https://ntp.niehs.nih.gov/go/tox21, 2019. Accessed: 2019-11-09.

[5] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, MichaelIsard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, JoshLevenberg, Dandelion Mane, Rajat Monga, Sherry Moore, Derek Murray, ChrisOlah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, KunalTalwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas,Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, andXiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneoussystems, 2015. URL https://www.tensorflow.org/. Software available fromtensorflow.org.

[6] Akshay Agrawal, Akshay Naresh Modi, Alexandre Passos, Allen Lavoie, AshishAgarwal, Asim Shankar, Igor Ganichev, Josh Levenberg, Mingsheng Hong, Ra-jat Monga, et al. Tensorflow eager: A multi-stage, python-embedded dsl formachine learning. arXiv preprint arXiv:1903.01855, 2019.

[7] Eric Anderson, Gilman D Veith, and David Weininger. SMILES, a line nota-tion and computerized interpreter for chemical structures. US EnvironmentalProtection Agency, Environmental Research Laboratory, 1987.

[8] Apache Airflow. Apache Airflow. https://airflow.apache.org/, 2019. Ac-cessed: 2019-11-09.

[9] Apache Parquet. Apache Parquet. https://parquet.apache.org/, 2019. Ac-cessed: 2019-11-09.

[10] Apache Spark. Apache Spark. https://spark.apache.org/, 2019. Accessed:2019-11-09.

44

https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/index.html

https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/index.html

https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-connector

https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-connector

https://www.tensorflow.org/tutorials/load_data/tfrecord

https://ntp.niehs.nih.gov/go/tox21

https://www.tensorflow.org/

https://airflow.apache.org/

https://parquet.apache.org/

https://spark.apache.org/

[11] Sebastian Bach, Alexander Binder, Gregoire Montavon, Frederick Klauschen,Klaus-Robert Muller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.

[12] Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, andJeffrey Mark Siskind. Automatic differentiation in machine learning: a survey.Journal of machine learning research, 18(153), 2018.

[13] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training forhigh fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.

[14] Dong-Sheng Cao, Qing-Song Xu, Qian-Nan Hu, and Yi-Zeng Liang. Chemopy:freely available python package for computational biology and chemoinformat-ics. Bioinformatics, 29(8):1092–1094, 2013.

[15] Fan Cao and Melissa J Fullwood. Inflated performance measures in enhancer–promoter interaction-prediction methods. Nature genetics, 51(8):1196–1198,2019.

[16] Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.

[17] Rich Caruana, Steve Lawrence, and C Lee Giles. Overfitting in neural nets:Backpropagation, conjugate gradient, and early stopping. In Advances in neuralinformation processing systems, pages 402–408, 2001.

[18] Artem Cherkasov, Eugene N Muratov, Denis Fourches, Alexandre Varnek,Igor I Baskin, Mark Cronin, John Dearden, Paola Gramatica, Yvonne C Mar-tin, Roberto Todeschini, et al. Qsar modeling: where have you been? whereare you going to? Journal of medicinal chemistry, 57(12):4977–5010, 2014.

[19] Djork-Arne Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and ac-curate deep network learning by exponential linear units (elus). arXiv preprintarXiv:1511.07289, 2015.

[20] Mark Davies, Micha l Nowotka, George Papadatos, Nathan Dedman, AnnaGaulton, Francis Atkinson, Louisa Bellis, and John P Overington. Chemblweb services: streamlining access to drug discovery data and utilities. Nucleicacids research, 43(W1):W612–W620, 2015.

[21] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Ima-genet: A large-scale hierarchical image database. In 2009 IEEE conference oncomputer vision and pattern recognition, pages 248–255. Ieee, 2009.

[22] Thomas Engel and Johann Gasteiger. Chemoinformatics: basic concepts andmethods. John Wiley & Sons, 2018.

[23] Anna Gaulton, Anne Hersey, Micha l Nowotka, A Patrıcia Bento, Jon Chambers,David Mendez, Prudence Mutowo, Francis Atkinson, Louisa J Bellis, ElenaCibrian-Uhalte, et al. The chembl database in 2017. Nucleic acids research, 45(D1):D945–D954, 2016.

45

[24] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, andGeorge E Dahl. Neural message passing for quantum chemistry. In Proceedingsof the 34th International Conference on Machine Learning-Volume 70, pages1263–1272. JMLR. org, 2017.

[25] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MITpress, 2016.

[26] Robbi Gruener, Owen Cheng, and Yevgeni Litvin. Introducing petastorm: Uberatg’s data access library for deep learning, 2018. URL https://eng.uber.com/

petastorm/.

[27] Stephen R Heller, Alan McNaught, Igor Pletnev, Stephen Stein, and DmitriiTchekhovskoi. Inchi, the iupac international chemical identifier. Journal ofcheminformatics, 7(1):23, 2015.

[28] Georg Hinselmann, Lars Rosenbaum, Andreas Jahn, Nikolas Fechner, and An-dreas Zell. jcompoundmapper: An open source java library and command-linetool for chemical fingerprints. Journal of cheminformatics, 3(1):3, 2011.

[29] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neuralcomputation, 9(8):1735–1780, 1997.

[30] Sergey Ioffe and Christian Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. arXiv preprintarXiv:1502.03167, 2015.

[31] Jeremy L Jenkins, Andreas Bender, and John W Davies. In silico target fishing:Predicting biological targets from chemical structure. Drug Discovery Today:Technologies, 3(4):413–421, 2006.

[32] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gau-rav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden,Al Borchers, et al. In-datacenter performance analysis of a tensor processingunit. In 2017 ACM/IEEE 44th Annual International Symposium on ComputerArchitecture (ISCA), pages 1–12. IEEE, 2017.

[33] Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He,Qingliang Li, Benjamin A Shoemaker, Paul A Thiessen, Bo Yu, et al. Pubchem2019 update: improved access to chemical data. Nucleic acids research, 47(D1):D1102–D1109, 2018.

[34] Gunter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter.Self-normalizing neural networks. In Advances in neural information processingsystems, pages 971–980, 2017.

[35] Anders Krogh and John A Hertz. A simple weight decay can improve general-ization. In Advances in neural information processing systems, pages 950–957,1992.

[36] Greg Landrum et al. RDKit: Open-source cheminformatics soft-ware, 2018. URL http://www.rdkit.org/. Software available fromhttps://github.com/rdkit/rdkit.

46

https://eng.uber.com/petastorm/

https://eng.uber.com/petastorm/

http://www.rdkit.org/

[37] Yann LeCun, Leon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[38] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.

[39] Alexander Lex, Nils Gehlenborg, Hendrik Strobelt, Romain Vuillemot, andHanspeter Pfister. Upset: visualization of intersecting sets. IEEE transactionson visualization and computer graphics, 20(12):1983–1992, 2014.

[40] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, DevaRamanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Commonobjects in context. In European conference on computer vision, pages 740–755.Springer, 2014.

[41] Christopher A Lipinski, Franco Lombardo, Beryl W Dominy, and Paul J Feeney.Experimental and computational approaches to estimate solubility and perme-ability in drug discovery and development settings. Advanced drug deliveryreviews, 23(1-3):3–25, 1997.

[42] A Luntz and V Brailovsky. On estimation of characters obtained in statisticalprocedure of recognition. technicheskaya kibernetica, 3, 1969.

[43] Junshui Ma, Robert P Sheridan, Andy Liaw, George E Dahl, and VladimirSvetnik. Deep neural nets as a method for quantitative structure–activity rela-tionships. Journal of chemical information and modeling, 55(2):263–274, 2015.

[44] Andreas Mayr, Gunter Klambauer, Thomas Unterthiner, and Sepp Hochreiter.Deeptox: toxicity prediction using deep learning. Frontiers in EnvironmentalScience, 3:80, 2016.

[45] Andreas Mayr, Gunter Klambauer, Thomas Unterthiner, Marvin Steijaert,Jorg K Wegner, Hugo Ceulemans, Djork-Arne Clevert, and Sepp Hochreiter.Large-scale comparison of machine learning methods for drug target predictionon chembl. Chemical science, 9(24):5441–5451, 2018.

[46] Hirotomo Moriwaki, Yu-Shi Tian, Norihito Kawashita, and Tatsuya Takagi.Mordred: a molecular descriptor calculator. Journal of cheminformatics, 10(1):4, 2018.

[47] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restrictedboltzmann machines. In Proceedings of the 27th international conference onmachine learning (ICML-10), pages 807–814, 2010.

[48] OpenAI. Openai five. https://blog.openai.com/openai-five/, 2018.

[49] Luis Perez and Jason Wang. The effectiveness of data augmentation in imageclassification using deep learning. arXiv preprint arXiv:1712.04621, 2017.

[50] Igor Pletnev, Andrey Erin, Alan McNaught, Kirill Blinov, DmitriiTchekhovskoi, and Steve Heller. Inchikey collision resistance: an experimentaltesting. Journal of cheminformatics, 4(1):39, 2012.

47

https://blog.openai.com/openai-five/

[51] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A flow-basedgenerative network for speech synthesis. In ICASSP 2019-2019 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing (ICASSP), pages3617–3621. IEEE, 2019.

[52] David Rogers and Mathew Hahn. Extended-connectivity fingerprints. Journalof chemical information and modeling, 50(5):742–754, 2010.

[53] David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. Learningrepresentations by back-propagating errors. Cognitive modeling, 5(3):1, 1988.

[54] Jack W Scannell, Alex Blanckley, Helen Boldon, and Brian Warrington. Di-agnosing the decline in pharmaceutical r&d efficiency. Nature reviews Drugdiscovery, 11(3):191, 2012.

[55] David Silver, Aja Huang, Christopher J. Maddison, Arthur Guez, LaurentSifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou,Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe,John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, MadeleineLeach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Master-ing the game of go with deep neural networks and tree search. Nature, 529:484–503, 2016. URL http://www.nature.com/nature/journal/v529/n7587/

full/nature16961.html.

[56] Robert R Sokal. A statistical method for evaluating systematic relationships.Univ. Kansas, Sci. Bull., 38:1409–1438, 1958.

[57] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Rus-lan Salakhutdinov. Dropout: a simple way to prevent neural networks fromoverfitting. The journal of machine learning research, 15(1):1929–1958, 2014.

[58] Rupesh Kumar Srivastava, Klaus Greff, and Jurgen Schmidhuber. Highwaynetworks. arXiv preprint arXiv:1505.00387, 2015.

[59] Benoit Steiner, Zachary DeVito, Soumith Chintala, Sam Gross, Adam Paszke,Francisco Massa, Adam Lerer, Gregory Chanan, Zeming Lin, Edward Yang,et al. Pytorch: An imperative style, high-performance deep learning library.Advances in Neural Information Processing Systems, 32, 2019.

[60] Teague Sterling and John J Irwin. Zinc 15–ligand discovery for everyone. Jour-nal of chemical information and modeling, 55(11):2324–2337, 2015.

[61] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution fordeep networks. In Proceedings of the 34th International Conference on MachineLearning-Volume 70, pages 3319–3328. JMLR. org, 2017.

[62] TensorFlow Extended. TensorFlow Extended (tfx). https://www.tensorflow.org/tfx, 2019. Accessed: 2019-11-09.

[63] Roberto Todeschini and Viviana Consonni. Handbook of molecular descriptors,volume 11. John Wiley & Sons, 2008.

48

http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html

http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html

https://www.tensorflow.org/tfx

https://www.tensorflow.org/tfx

[64] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all youneed. In Advances in neural information processing systems, pages 5998–6008,2017.

[65] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michael Mathieu, An-drew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds,Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent rein-forcement learning. Nature, pages 1–5, 2019.

[66] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss,Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio,et al. Tacotron: Towards end-to-end speech synthesis. arXiv preprintarXiv:1703.10135, 2017.

[67] Egon L Willighagen, John W Mayfield, Jonathan Alvarsson, Arvid Berg, LarsCarlsson, Nina Jeliazkova, Stefan Kuhn, Tomas Pluskal, Miquel Rojas-Cherto,Ola Spjuth, et al. The chemistry development kit (cdk) v2. 0: atom typing,depiction, molecular formulas, and substructure searching. Journal of chemin-formatics, 9(1):33, 2017.

[68] David H Wolpert. The lack of a priori distinctions between learning algorithms.Neural computation, 8(7):1341–1390, 1996.

[69] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi,Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, JeffKlingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, ukasz Kaiser, StephanGouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, GeorgeKurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, AlexRudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean.Google’s neural machine translation system: Bridging the gap between humanand machine translation. CoRR, abs/1609.08144, 2016. URL http://arxiv.

org/abs/1609.08144.

[70] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, andPhilip S Yu. A comprehensive survey on graph neural networks. arXiv preprintarXiv:1901.00596, 2019.

[71] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and OriolVinyals. Understanding deep learning requires rethinking generalization. arXivpreprint arXiv:1611.03530, 2016.

49

http://arxiv.org/abs/1609.08144

http://arxiv.org/abs/1609.08144

Appendices

51

Appendix A

Heatmaps of Assay Co-occurence

The following figures show the co-occurence of the assays across the five ChEMBLreleases. Each graphic provides an overview how frequently compounds have mea-surements for a pair of assays. ChEMBL 22 (A.2) introduces several new assayswhich frequently appear together. The new assays for ChEMBL 22 - 25 (A.2, A.3,A.4, A.5) mainly occur with the latest assays, and do not appear together with olderassays. With ChEMBL 24 (A.4) several old assays are removed from the dataset.In ChEMBL 25 (A.5) another assay with high co-occurence is removed and a set ofco-occurring assays is added. The graphics are best viewed in the PDF version ofthe thesis.

52

Figure A.1: Assay co-occurence - ChEMBL21

53


54


55


56


57

Automated Model Building for Drug Target Prediction

Documents

Transcript of Automated Model Building for Drug Target Prediction