ARTIFICIAL NEURAL NETWORKS FOR GENOME-ENABLED …€¦ · ARTIFICIAL NEURAL NETWORKS FOR...

125
Aus dem Institut für Tierzucht und Tierhaltung der Agrar- und Ernährungswissenschaftlichen Fakultät der Christian-Albrechts-Universität zu Kiel ___________________________________________________________________ ARTIFICIAL NEURAL NETWORKS FOR GENOME-ENABLED PREDICTION IN CATTLE: POTENTIAL AND LIMITATIONS Dissertation zur Erlangung des Doktorgrades der Agrar- und Ernährungswissenschaftlichen Fakultät der Christian-Albrechts-Universität zu Kiel vorgelegt von M.Sc. agr. ANITA EHRET aus Hannover, Niedersachsen Dekan: Prof. Dr. Dr. h.c. Rainer Horn 1. Berichterstatter: Prof. Dr. Georg Thaller 2. Berichterstatter: Prof. Dr. Dr. h.c. mult. Daniel Gianola Tag der mündlichen Prüfung: 2. Juli 2014 ___________________________________________________________________ Die Dissertation wurde mit dankenswerter finanzieller Unterstützung des BMBF im Rahmen des AgroClustEr "SYNBREED - Synergistische Pflanzen- und Tierzucht“ (Projekt A3.3) angefertigt.

Transcript of ARTIFICIAL NEURAL NETWORKS FOR GENOME-ENABLED …€¦ · ARTIFICIAL NEURAL NETWORKS FOR...

Aus dem Institut für Tierzucht und Tierhaltung

der Agrar- und Ernährungswissenschaftlichen Fakultät

der Christian-Albrechts-Universität zu Kiel

___________________________________________________________________

ARTIFICIAL NEURAL NETWORKS FOR

GENOME-ENABLED PREDICTION IN CATTLE:

POTENTIAL AND LIMITATIONS

Dissertation

zur Erlangung des Doktorgrades

der Agrar- und Ernährungswissenschaftlichen Fakultät

der Christian-Albrechts-Universität zu Kiel

vorgelegt von

M.Sc. agr.

ANITA EHRET

aus Hannover, Niedersachsen

Dekan: Prof. Dr. Dr. h.c. Rainer Horn

1. Berichterstatter: Prof. Dr. Georg Thaller

2. Berichterstatter: Prof. Dr. Dr. h.c. mult. Daniel Gianola

Tag der mündlichen Prüfung: 2. Juli 2014

___________________________________________________________________

Die Dissertation wurde mit dankenswerter finanzieller Unterstützung des BMBF im Rahmen des

AgroClustEr "SYNBREED - Synergistische Pflanzen- und Tierzucht“ (Projekt A3.3) angefertigt.

2

Schriftenreihe des Instituts für Tierzucht und Tierhaltung der Christian-Albrechts-Universität zu Kiel; Heft 206, 2014 © 2014 Selbstverlag des Instituts für Tierzucht und Tierhaltung

der Christian-Albrechts-Universität zu Kiel Olshausentraße 40, 24098 Kiel Schriftleitung: Prof. Dr. Dr. h.c. mult. E. Kalm

ISSN: 0720-4272 Gedruckt mit Genehmigung des Dekans der Agrar- und Ernährungswissen- schaftlichen Fakultät der Christian-Albrechts-Universität zu Kiel

3

Meiner Familie

4

ANN: Anitas Neural Network, version 1.4

(__) (__)

(oo) (oo)

/------\/ cows \/------\

/ | || beware || | \

* /\---/\ /\---/\ *

~~ ~~ ~~ ~~

“Another Chapter in the biological Alice-in-Wonderland”

(Hazel 1942)

5

Table of Contents CHAPTER 1 7

General Introduction CHAPTER 2 15

Artificial neural networks for genome-enabled prediction in animal and plant breeding: A review CHAPTER 3 45

An empirical application of neural networks with back-propagation to genome enabled predictions of complex traits in Holstein Friesian and German Fleckvieh cattle CHAPTER 4 67

Use of genomic and metabolic information as well as milk performance records for prediction of subclinical ketosis risk via artificial neural networks CHAPTER 5 87

Extreme Learning Machine: A New Approach for Genomic Prediction of Complex Traits CHAPTER 6 97

General Discussion CHAPTER 7 111

General Summary CHAPTER 8 117

Zusammenfassung

6

7

CHAPTER 1

8

9

General Introduction Genomic-enabled prediction is increasingly important in animal and plant breeding and has

also received attention in human genetics (de los Campos et al., 2013). In the last few years,

the rapid development of high-throughput and sequencing technologies have allowed large-

scale genotyping of thousands of genetic markers (e.g. single-nucleotide polymorphisms

(SNPs)) in genomes of plant and animal species. Such dense panels of molecular markers

allow the mapping of the genome (i.e. QTLs) of individuals and populations through linkage

disequilibrium between single markers and variants of genes. Most traits are assumed to be

determined by large numbers of genes with small effects (Falconer and Mackay, 1996), acting

additively in order to express the phenotype of an individual. In 2001, based on this

assumption Meuwissen et al. derived a ground-breaking approach for genome-based

prediction of complex traits. The authors implemented a whole-genome regression model

where phenotypes are linearly regressed on thousands of markers simultaneously. In contrast

to pedigree-based predictions, this approach allows for the accurate prediction of genetic

values of animals early in life. A few years later Schaeffer (2006) set the foundations for the

great success of this approach in practice. In recent years genome-enabled prediction has

gained ground both in animal (VanRaden et al., 2009) and plant breeding (Bernardo and Yu,

2007, Crossa et al., 2010) and genomic selection (GS) has been implemented in several plant

and animal breeding programs.

At the same time the phenomena of “missing heritability of complex traits” attracted

worldwide interest in the scientific community of animal breeders and related research areas

like human medicine (Manolio et al., 2009). Although genome-enabled studies had already

provided valuable insights into genetic basis, they explained a relative small property of the

heritability of most complex traits (Eichler et al., 2010). Many explanations for this missing

(or hidden) heritability have been suggested and several working groups pointed out that the

fundamental problem might be a mismatch between the genetic architecture of complex traits

and the statistical techniques applied (Yang et al., 2010; de los Campos et al., 2010).

Therefore, the existence of non-additive effects and possible higher non-linearities is also an

important explanatory approach. Genetic mechanism might involve complex interactions

among genes, between genes and environmental conditions, or epigenetic effects which are

not fully captured by additive models (de los Campos et al., 2010).

10

In this context there has been a growing interest in using semi- and non-parametric methods

for genome-enabled prediction of quantitative traits to account for non-additive gene effects

and higher non-linearities as well as genotype-environment interactions (Gianola and Kaam,

2008; Gianola et al., 2006; de los Campos et al., 2010). However, when performing genome-

enabled predictions by regressing phenotypes on high-density SNP panels, the number of

markers (p) vastly exceeds the number of available samples (n). Fitting this large-p with-

small-n regression becomes a statistical problem, so that such regression requires methods

using some type of variable selection or shrinkage estimation (de los Campos et al., 2013).

Due to the potential loss of information through such data and parameter reduction-based

methods, pattern recognition-based methods have become promising alternatives for genome-

enabled approaches. Pattern recognition-based methods are part of a wide range of machine

learning methods, which are able to detect and characterize gene interactions expressed as

genotype patterns (Musani, 2007). In this context especially so called artificial neural

networks (ANNs) are interesting tools in genome-enabled prediction, taking advantage of the

full dimension of data. ANNs are supposed to be universal approximators of complex

functions (Hornik et al., 1989). According to Kolmogorov Theorem, these networks are

proven to capture arbitrary functions, and thus might be able to approximate non-linear

relationships between predictors and responses intuitively. This property is achieved through

a universal learning ability, which is provided by particular training algorithms. With respect

to genome-enabled prediction this concept leads to another advantage e.g. that no fixed

genetic model has to be used, so that the genetic architecture of the target trait remains

unknown a priori.

In the following, first a more general overview is provided about the historical background

and development of ANNs in the field of artificial intelligence. The research work on ANNs

has been inspired by the structure and information processing of biological nervous systems,

in particular the brain with its closely connected neurons. This natural concept is used to

express statistical models in form of network diagrams (Pereira and Rao, 2009). The first

models of ANNs were introduced by Warren S. Mc Culloch and Walter Pitts in 1943. They

described the function of a biological neuron through a simple threshold model based on

logical inputs (active=1, inactive=0; McCulloch and Pitts, 1943). They showed that

combinations of artificial neurons are able to mimic nearly any logical and arithmetic

function. In 1949, the psychologist Donald O. Hebb postulated that a learning ability of such

networks is given by a change of the activation level of neurons which is proportional to a

change in the strength of connection between them (Hebb, 1949). Both, the upcoming of the

11

computer as well as the learning ability of ANNs led to an intense experience and further

improvements of the principle concepts in the following years. In 1956, ANNs were brought

to broad attention, when Marvin Minsky, John McCathy, Nathaniel Rochester and Claude

Shannah organized the first international conference on artificial intelligence on the campus

of Dartmouth College in Hanover, New Hampshire (Garson, 1998). A few years later, a

neuropsychologist, Frank Rosenblatt, introduced a fundamental principle, the Perceptron

model, to the artificial intelligence community (Rosenblatt, 1958). The original concept was

conceived for classification issues, since the perceptron was able to recognize characters,

however, it was not yet capable to generalize its classifications for new patterns. During the

next years substantial developments followed. Then, in 1969, Marvin Minsky and Seymour

Papert pronounced that ANNs were not able to solve any 'exclusive or problems' (XOR;

Minsky and Papert, 1969). Hence, ANNs were almost forgotten, when in 1982 the physician

John Hopfield rediscovered the natural possibilities and introduced his Hopfield neural

network (Hopfield, 1982). This led ANNs to pass again through a period of further

applications and innovations. During this time, the currently well-known multilayer

perceptron was introduced by Paul Werbos (Werbos, 1972). However, first in 1986, ANNs

were brought to a wide popularity, when David Rumelhart introduced a ground-breaking

learning rule, the back-propagation of error algorithm, which conferred the multilayer

perceptron universal learning ability (Rumelhart et al., 1986). Many of the advantages in

using ANNs have to do with the great flexibility of the concept, such as new innovative

architectures, and training rules, which are just as important as the availability of powerful

computers (Hagan et al., 1996). The development and applications in the area of ANNs are

increasing steadily. Hence, ANNs are very flexible machines, which are introduced to

different application fields, like classification, pattern recognition, prediction and forecasting,

process control, optimization and decision support (Hanrahan, 2011). But investigating this

machinery is not trivial, as they are not easy to handle in producing stable outputs.

Furthermore, computational issues have to be considered when ANNs are used in empirical

studies. In the network community, there is a general consensus that the performance of

ANNs in e.g. predictions will strongly depend on data structure, settings of the learning

algorithm as well as network architecture. Because of complexity and computational costs the

number of studies reporting ANNs used for genome-enabled prediction in animal and plant

breeding remain small.

12

The thesis is organized as follows: In Chapter 2 a more detailed survey is provided on the

method and recent applications to genome-enabled predictions in animal and plant breeding.

An application of ANNs to genome enabled prediction using real data is then presented in

Chapter 3. Different data sets as well as network input data and network configurations were

tested, to assess several factors on the predictive performance of milk traits in dairy cattle.

The results emphasize that ANNs can be implemented on large scale SNP-data, but were not

able to outperform standard linear techniques in animal breeding. Differences in prediction

performance were small, when dimension reduction inputs were used, whereas ANNs failed

to produce reasonable predictions when the entire SNP-marker information was used as input

to the network. At a glance, results obtained in the 3rd Chapter showed that, first feature

selection methods are recommended when real data are used in ANNs for genome-enabled

prediction through large scale data. Second, due to complexity ANNs reach limits in

predicting traits which seem to behave principally additively (e.g. milk traits) and can be

predicted well with less parameterized linear methods. ANNs might show their predictive

power when predicting traits where non-linearity is evident (e.g. functional traits) and

parametric methods fail. Therefore, in Chapter 5 ANNs were applied to ketosis prediction, a

functional trait, since here non-linearity is plausible. In predictions of traits where several

factors determine the phenotype, ANNs seem to be a sophisticated tool to incorporate

different sources of information simultaneously to enhance prediction performance and

accuracy. In the 4th Chapter of the present study a modified ANN approach was introduced to

genome-enabled prediction of complex traits in animal breeding. A new learning algorithm

was implemented to ANNs to take advantage of the full dimension of the data, without losing

performance through model complexity. Extreme learning machine (ELM) is a basis function

extension method, introduced to the artificial intelligence community by Huang et al. (2004).

The ELM can be used instead of training an ANN with back-propagation or other learning

rules. Far fewer parameters are used, and therefore, this method is promising in cases where

features vastly exceed sample size and the full dimension of data can be used. Results show

that the modified network outperforms ANNs when using the entire marker matrix, while

reducing computational costs.

13

References

Bishop, C. M. 2006. Pattern Recognition and Machine Learning.

Gianola, D., H. Okut, K. Weigel, and G. Rosa. 2011. Predicting complex quantitative traits

with Bayesian neural networks: a case study with Jersey cows and wheat. BMC

Genetics 12:87.

Hastie, T., R. Tibshirani, J. Friedman, and J. Franklin. 2009. The elements of statistical

learning: data mining, inference and prediction. The Mathematical Intelligencer 27:83-

85.

Manolio, T. A., F. S. Collins, N. J. Cox, D. B. Goldstein, L. A. Hindorff, D. J. Hunter, M. I.

McCarthy, E. M. Ramos, L. R. Cardon, A. Chakravarti, J. H. Cho, A. E. Guttmacher,

A. Kong, L. Kruglyak, E. Mardis, C. N. Rotimi, M. Slatkin, D. Valle, A. S.

Whittemore, M. Boehnke, A. G. Clark, E. E. Eichler, G. Gibson, J. L. Haines, T. F. C.

Mackay, S. A. McCarroll, and P. M. Visscher. 2009. Finding the missing heritability

of complex diseases. Nature 461:747-753.

Musani, S. K., D. Shriner, N. Liu, R. Feng, C. S. Coffey, N. Yi, H. K. Tiwari, and D. B.

Allison. 2007. Detection of Gene times Gene Interactions in Genome-Wide

Association Studies of Human Population Data. Human Heredity 63:67-84.

Pereira, B. D. B. and C. R. Rao. 2009. Data Mining using Neural Networks: A Guide for

Statisticians

http://www.po.ufrj.br/basilio/publicacoes/livros/2009_datamining_using_neural_netw

orks.pdf.

Yang, J., B. Benyamin, B. P. McEvoy, S. Gordon, A. K. Henders, D. R. Nyholt, P. A.

Madden, A. C. Heath, N. G. Martin, G. W. Montgomery, M. E. Goddard, and P. M.

Visscher. 2010. Common SNPs explain a large proportion of the heritability for

human height. Nat Genet 42:565-569.

Rosenblatt, F. 1958. The perceptron: a probabilistic model for information storage and

organization in the brain. Psychological Review 65:386.

Rumelhart, D. E., G. E. Hinton, and R. J. Williams. 1986. Learning representations by back-

propagating errors. Nature 323:533-536.

McCulloch, W. and W. Pitts. 1943. A logical calculus of the ideas immanent in nervous

activity. The bulletin of mathematical biophysics 5:115-133.

14

Minsky, M. and S. Papert. 1969. Perceptrons. M.I.T. Press, Oxford, England.

Hebb, D. 1949. The Organization of Behavior. Wiley, New York.

Hornik, K., M. Stinchcombe, and H. White. 1989. Multilayer feedforward networks are

universal approximators. Neural Networks 2:359-366.

Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective

computational abilities. Proceedings of the National Academy of Sciences 79:2554-

2558.

Hagan, M. T., H. B. Demuth, and M. Beale. 1996. Neural network design. PWS, Boston.

Hanrahan, G. 2011. Artificial Neural Networks in Biological and Environmental Analysis.

CRC Press, Inc.

Hastie, T., R. Tibshirani, J. Friedman, and J. Franklin. 2009. The elements of statistical

learning: data mining, inference and prediction. The Mathematical Intelligencer 27:83-

85.

Garson, G. D. 1998. Neural Networks: an introductory quide for social scientists. Sage

Publications, Inc.

Widrow, B. and M. E. Hoff. 1960. Adapting switching circuits. Pages 96-104.

15

CHAPTER 2

16

17

Artificial neural networks for genome-enabled prediction in animal and plant breeding: A review

Anita Ehret1, Llibertat Tusell

2, Daniel Gianola

2,3,4 and Georg Thaller

1

1 Institute of Animal Breeding and Husbandry, Christian-Albrechts-University, Germany

2 Department of Animal Sciences, University of Wisconsin-Madison, USA 3 Department of Dairy Science, University of Wisconsin, USA

4 Department of Biostatistics and Medical Informatics, University of Wisconsin, USA

Submitted for publication in Theoretical and Applied Genetics

Abstract

In genome-enabled prediction, different regression machines that are non-linear in either

features or parameters have been recently proposed and tested with empirical data. Because of

their flexibility, artificial neural networks are good candidates for enhancing prediction of

complex traits, especially if these are underlined by non-additive gene actions such as

dominance and epistatic interactions. The aim of this work is to give an overview of artificial

neural network methodology and recent applications in animal and plant breeding with a

special focus on genome-enabled prediction of complex traits. Multilayer feed-forward

networks and radial basis function networks are the most frequently used neural networks in

genome-enabled prediction of complex traits. Successful implementation requires avoiding

over-fitting of the training data to enhance the predictive ability of future records while

keeping computational costs affordable. This becomes a data-related and important model

selection task and has a crucial influence in the prediction of yet-to be observed phenotypes.

Aspects such as data preprocessing, choice of a proper network architecture and setting of

parameter values must be assessed carefully. Molecular marker pre-selection or use of

genomic relationships instead of whole-genome markers as inputs to the network seems

advisable to overcome over parameterization, over-fitting and to reduce computational costs.

Recent studies in animal and plant breeding suggest that the superiority of properly specified

artificial neural networks over other prediction machines used in genome-enabled prediction

highly depends on the species, the target trait, the environmental factors and sample size as

well on the genetic relatedness between individuals in the training and testing sets.

Keywords: artificial neural networks, plant breeding, animal breeding, genome-enabled

prediction

18

Introduction

Classical prediction models based on additive inheritance, successfully applied in the

breeding and genetics of agricultural species, often mismatch the actual genetic architecture

of complex traits, which may involve non-additive effects such as various types of gene by

gene interactions and genotype-environmental interactions (e.g. Yang et al., 2010). How e.g.

dominance and other effects can contribute to the understanding of quantitative genetic

variation of complex traits is prepared by Wellmann et al. (2011). To overcome this

assumption in the context of whole genome-enabled prediction field, non-parametric and

machine learning methods, such as artificial neural networks (ANNs), have been recently

proposed with the expectation of outperforming on predictive performance of standard linear

whole-genome approaches (e.g. Okut et al., 2011; Gianola et al., 2011). A good overview and

comparison of prediction skills and requirements of the wide range of methods (also

addressing ANNs) commonly used in the whole genome prediction area in animal and plant

breeding is provided by de los Campos et al. (2013).

The technique of ANNs was developed in the field of artificial intelligence to emulate a

biological nervous system (neural network) as a mathematical information processing system.

In general, ANNs are interconnected artificial neurons organized in several layers thus they

mimic the structure of the human brain (Pereira and Rao, 2009). Algebraically, an ANN can

be viewed as a schematic of Kolmogorov's theorem (1957) for representation of multivariate

functions (Kůrková, 1992) such they can be seen as universal function approximators. A great

advantage of ANNs is their “universal learning ability”, which is achieved during a training

phase, such they are able to learn properties (patterns) from e.g. genomic data without a need

of explicitly define a genetic model. After successful training, the information is stored in the

connections strengths of the ANN and is dynamically adapted in a process that is not

centralized but highly parallelly distributed. Hence, an ANN can approximate solutions to

problems to the same class that were not explicitly trained, even when the problem is highly

non-linear (Kriesel, 2007). A properly specified ANN can be a powerful tool able for learning

about complex traits affected by cryptic genomic interactions and for predict future outcomes,

such as phenotypes in personalized medicine.

In spite of this great potential, the use of such flexible machineries is not straightforward:

ANNs are prone to over-fit the training data and are computationally costly when

generalization methods are used. Moreover, due to their great flexibility, results can be

variable in cross-validation, i.e., outputs may not be stable.

19

This work aims to give an overview of ANNs and their current and potential applications in

animal and plant breeding, with special emphasis in genome-enabled prediction of complex

traits. First, the types of ANN models used for regression analyses in this field are described.

Secondly, specific data setting requirements and evaluation parameters to properly specify an

ANN are discussed. Finally, ANN applications, properties and limitations in the field of

genetics are reviewed and appraised critically.

ANN models for genome-enabled prediction

The capability to transfer ANNs to genome enabled predictions is high; in general ANNs

could be implemented in a non-penalized, penalized and also in a Bayesian framework (de los

Campos et al., 2013). The two ANNs most used in genome-enabled prediction of complex

traits are the multilayer feed-forward networks, also called multilayer feed-forward

perceptrons (MLP) and the radial basis function networks (RBFNN).

Multilayer feed-forward neural networks

The MLP is the most widely used neural network for universal classification and regression

approaches. It is an extension of the concept of Rosenblatt's perceptron (Rosenblatt, 1958),

where single artificial neurons are interconnected across several layers. The information

always flows in one direction, so that the outputs from one layer form the input to the next

layer. The network is completely linked if every single neuron is connected with all neurons

in the next layer.

The two-layer feed-forward neural network, also called single hidden layer Perceptron, covers

most of the statistical applications and is able to solve classification problems that are not

linearly separable (Hagan et al., 1996). Its architecture consists of one input, one hidden and

one output layer, as shown in Figure 1.

20

Figure 1 Architecture of a two-layer feed-forward neural network with three neurons in the hidden

layer.

xip = network input e.g. marker genotype p of individual i; w1.p = network weight from the input to any

neuron of the hidden layer; w2.s; = network weight from the hidden to the output layer for neuron s

(s=1,…,5); yi network output, e.g. fitted phenotype of individual i; f(.) = activation function of the

hidden neurons; g(.) = activation function of the output neuron. Σ denotes a computation of each of the

neurons,”bias1” represents an intercept for each of the neurons in the hidden layer, and “bias2” is an

intercept in the output layer.

Intuitively, a two-layer feed-forward neural network can be thought as a two-step regression

(Hastie et al., 2009). The central idea is to extract linear combinations of the inputs as derived

features in the hidden layer and then model the target as a function of these features in the

output layer.

21

In the hidden layer of an ANN for genome-enabled prediction, the genome derived covariates

of an individual i (i=1,...,n), , (j=1,...,p number of genomic covariates, e.g. marker

genotypes) are linearly combined with a vector of weights, , plus an intercept, or "bias",

, represents one of neurons. The resultant linear combination is then

transformed using an activation function giving the output of hidden neuron t (t=1,...,s)

for individual :

(1)

The activation function can be either linear or non-linear and has to be monotonically

increasing. A non-linear activation function in the hidden layer gives the neural network a

greater flexibility than standard linear regression models (MacKay, 2008) because the

transformation is adaptive via the parameters. In the output layer, the s data-derived scores

and an output intercept (“bias”) b are combined and transformed by the use of another

activation function to yield the predicted phenotype:

(2)

Here, are the weights connecting the s derived functions with the output,

is another activation function, which can be linear or non-linear although an identity function

is used in most cases, when is a continuous-valued phenotype.

The training of a two-layer MLP consists in finding the set of weights and biases of the

network that minimize the differences between the network output (i.e. predicted phenotype)

and the observed value (i.e. true phenotype), typically measured as sum of squared

differences. The adjustment of the weights is done via an iterative process where several

examples of the task are presented to the network (Gurney, 1997). The set of examples, i.e.

genotypes and their corresponding phenotypes, presented to the network during the training is

known as the training set.

The most widely used supervised learning algorithm for MLPs in a regression framework is

the back-propagation of errors algorithm (Rumelhart et al., 1986). This rule is based on a

minimization of the error function and typically employs the gradient descent method to

locate the optimal solution (Werbos, 1994). Algebraically, the back-propagation algorithm is

a generalized delta rule (Kriesel, 2007) and can be viewed as a non-linear least-squares

regression problem. Intuitively, the algorithm can be decomposed into five steps as shown in

Figure 2.

ijx

][1

t

jw

ta st ,...,1= s

(.)tf

i

[ ] )(1 1

][∑ =

+=p

j ij

t

jtt

t

i xwafz

(.)tf

w

[ ]i

s

t

t

iti εzwbgy ++= ∑ =)(

1 2

st www 2221 ,...,, (.)g

iy

22

Figure 2 Scheme representation of the back-propagation algorithm.

MLP = multilayer perceptron

First, weights are initialized with small random values and a small positive value for the

learning constant. Second, in the forward propagation step, each data point is fed to the input

layer, while the output is generated based on the weights. The predicted output is then

compared with the true (observed) output value and the squared error of each unit is

calculated. In a third step, the total squared error is propagated back from the output over the

hidden layers to the input units. In a fourth step, error gradients in both the output and the

hidden layers are computed and used for calculating the weight corrections. Once weight

changes are computed for all units, all weights are updated at the same time. The last step

consists in going back to the second step and iterating the process until the error function

converges. The back-propagation algorithm is powerful and simple. From a practical

perspective, this learning rule and its variations are flexible enough to cover highly complex

23

interactions between predictor variables. However, it can be slow in reaching convergence

and prone to reach local minima. In most cases a certain threshold is used to terminate the

algorithm (Hanrahan, 2011). Flexibility reaches a limit when the network architecture is not

optimally chosen, or when the number of predictors is very large compared to the data cases

available. In this case there is a drastic increase in model complexity and, hence, in number of

parameters to estimate. This is prone to occur when MLPs are used for genome-enabled

prediction problems where p>>n. It also leads to a very slow convergence of the back-

propagation algorithm and can cause over-fitting. For an empirical control of model

complexity, the distribution of the connection strengths (i.e., weights and biases) of the

network can be taken as an indicator of the extent of the regularization attained. Typically,

weight values decrease as model complexity increases (Gianola et al., 2011). A natural and

automated approach of dealing with model complexity is to use penalized (typically

Bayesian) methods, where a compromise between model goodness of fit and complexity of

the network is naturally attained without the need of artificial constraints (Gianola et al.,

2011; Okut et al., 2011). This leads to a subgroup of MPLs, called Bayesian regularized

artificial neural networks (BRNN) that were first introduced by Neal (1992) and MacKay

(1992a). The main difference to MLPs, where an optimal set of weights is chosen by

minimizing a cost function, is that the Bayesian approach involves a prior probability

distribution of the network weights. Regularization is obtained by treating the weights as

random effects following a specified prior distribution, given the observed data (Lampinen

and Vehtari, 2001). In 2011, Gianola et al. (2011) adapted the methodology to prediction of

complex phenotypes through the use of genomic data.

Denote as to all connection strengths of the neural network (including hidden and output

biases). Under a BRNN, estimates of the parameters of the network are obtained from the

posterior density of given the output data, the residual and connection strength variance

parameters ( , respectively):

(3)

Where is the sampling model with variance , is a prior distribution

indexed by a variance , and is the marginal density of the data. Although

this representation implies that biases and connection strength are assigned the same prior it is

advisable to avoid regularization of the biases.

ψ

ψ

22 , ψε σσ

),(

)(),(),,(

22

22

22

ψε

ψε

ψεσσ

σσσσ

y

ψψyyψ

p

ppp =

),( 2εσψyp

2εσ )( 2

ψσψp

2ψσ ),( 22

ψε σσyp

24

Samples from this distribution can be obtained via Markov chain Monte Carlo sampling, but

this process can be computationally expensive given the high number of nonlinearities needed

to be accounted for. An alternative is an empirical evaluation of the Bayesian sampling via a

Hybrid Monte Carlo algorithm (HMC; Neal, 1996) as well as a procedure known as empirical

Bayes, which does not use sampling but can get stuck at local modes. The latter first

computes the conditional posterior mode of assuming that the variance components are

known, and estimate these in a second step. The first step is equivalently achieved by

minimizing the following ‘penalized’ sum of squares:

(4)

where ,

and ei is the difference between observed and predicted

phenotypes for the fitted model. The inference on variance components in the second step is

done via maximization of an approximation to the marginal likelihood of the data

, a procedure known as empirical Bayes. This is based on a Laplacian approximation that can

fail badly is the number of predictors is large. Refer to Gianola et al. (2011) for further details

regarding the whole estimation procedure and the tuning of parameters η and .

MacKay (1992b) states that an effect of the Bayesian framework should be a reduction in the

high cost of the learning process. In comparison to non-regularized ANN, a Bayesian/MCMC

approach gives an effective way of sampling relevant parts of the model space and then

averaging predictions over high-probability models (Hastie et al., 2009). In theory, Bayesian

model averaging results in better and more robust predictive performance.

Radial basis function neural networks

Radial basis function neural networks (RBFNNs) are another type of neural network for

solving regression problems. These were developed later than MLPs and were first proposed

by Broomhead and Lowe (1988) and Poggio and Girosi (1990). RBFNNs are single hidden

layer, feed-forward and completely linked ANNs with an identity output function and radial

basis functions (RBFs) in the hidden layer. The radial basis functions calculated in the hidden

neurons are similar to kernel functions in kernel regressions (Kriesel, 2007). The most

employed radial basis function is the Gaussian with form:

(5)

ψ

∑∑==

+=m

i

i

n

i

ieF1

2

1

2)( ψϕηΨ

221

eση = 22

1ψσ

ϕ =

),( 22ψε σσyp

ϕ

[ ]2][ exp tit

t

i hz cx −−=

25

where is the Euclidean norm between the input vector and a center vector for

neuron t. The bandwidth of the RBF for neuron t needs to be empirically determined. Other

norms like the Mahalanobis distance (Ripley, 1996), as well as other activation functions can

also be used e.g. thin-plate-spline, multi-quadratics (Kriesel, 2007). In an RBFNN the

connections between the input and hidden layer are not weighted, so it can be seen intuitively

that the input layer does not participate in information processing. The output for the data

example i results from weighting and combining the outputs of the RBFs of the hidden

neurons and a bias, as in equation (2).

The training of a RBFNN proceeds in two steps. First the basis function parameters of the

hidden layer (center, bandwidth) are determined, and second, the weights between the hidden

and the output layer are inferred from the output of the hidden layer and the target data

(Pereira and Rao, 2009). There are several ways of dealing with the centers and the bandwidth

parameters of the RBF from the hidden layer: (1) fixed selection, (2) conditional fixed

selection, and (3) selection adaptive to the learning process (Kriesel, 2007). In the latter case

the number of neurons in the hidden layer is not constant. A certain number of neurons and

centers and bandwidths are selected a priori and then are extended or reduced during learning.

This method is particularly suited to and widely used for function approximations (Fritzke,

1994). After setting the default numbers of neurons and centers, the weights are analytically

calculated using, for example, the ordinary least-squares method (Chen et al. 1991). However,

unsupervised learning algorithms like the k-means algorithm can be implemented as well

(Pereira and Rao, 2009). Once all errors of the output values are calculated, the maximum

error is sought. The growth of the network is due to a replacement of this maximum error with

a new RBF neuron. The effect of this new neuron depends on the bandwidth of the RBFs and

the already existing RBF neurons will be adjusted when adding the new neuron. This

adjustment is made by moving the centers of the other neurons away from the new neuron,

reducing their bandwidth slightly. The current output vector of the network is compared to the

teaching input and the weight vector is improved by means of training. Therefore, the selected

centers maximize the decrease in the mean squared error of the RBFNN, and the algorithm

stops when the number of added centers attains a desired precision, or when the number of

centers is equal to the number of input neurons (González-Camacho et al., 2012).

ti cx − ixtc

th

][t

iz

26

Construction of ANNs for genome-enabled predictions

The choice of a network architecture and the setting of specific parameters of an ANN have a

crucial influence on the predictive ability. This is an important model selection task required

to ensure optimal implementation of ANNs. In many cases, the only way to tune the

parameters of an ANN for a given dataset is to run several models and keep the one with the

specifications giving the best predictive performance. Here, general guidelines regarding

specific settings for ANNs in genome-enabled prediction are provided.

Choosing the Network architecture

When using ANNs, a challenge is to keep the network architecture as simple as possible to

avoid capturing random error or noise (i.e. prevent over-fitting), but allowing enough

flexibility to the network to describe the underlying relationship and solve the mapping

problem (Geman, et al. 1992).

The number of input and output neurons in the network is a priori defined by the data. In

genome-enabled prediction the number of input neurons is given by the number of genomic

covariates included in the analyses, and the number of output neurons is commonly

determined by the phenotype (e.g. a continuous or a categorical trait).

In the case of MLPs, the number of hidden layers and the number of hidden neurons per layer

depends strongly on the mapping problem and it becomes a model selection task to avoid a

poor performance of the MLP. Most MLPs have two or three layers maximum. Four layers or

higher dimensional networks are rarely employed. In ANNs for genome-enabled prediction,

only two-layer MLPs have been used with up to 10 neurons tested for predictive performance

(Gianola et al., 2011; Okut et al., 2011). In general, a compromise should be attained: if too

many neurons are kept in the hidden layer, the network is good in memorizing the training

data, but it is worse in predicting yet-to-observed data. On the other hand, a model with few

neurons may impede a good approximation of the underlying function of the problem.

The choice of the activation function used in the hidden layer of the MLP is related to the

expected underlying function for the problem. Typically, the activation function is smooth,

continuous, and increases monotonically with input values, so that its derivative is always

positive. The most common ones are the identity, the step or threshold, the logistic (sigmoid)

and the hyperbolic tangent, as shown in Figure 3. The type of activation function used in the

output layer is determined by type of output. For example, if the output is continuous, a linear

function is used.

27

Figure 3: Activation functions commonly used in artificial neural networks

Setting the parameters of the back propagation algorithm

Minimizing the cost function of the back propagation algorithm in MLP often leads to

suboptimal performance and to a very slow convergence rate. The error function in most

cases is non-convex and possesses many local minima. As a result, the final solution obtained

is dependent on the data quality, the network architecture chosen and on the parameters of the

algorithm such as the initialization of the weights and the learning rate adopted. In practice, it

is always sensible to try a number of different training configurations, and to choose the

solution giving the lowest error. To avoid the algorithm getting stuck in local minima, an

approach is to modify the initialization of the weights in order to optimize the starting point of

the algorithm using a few training iterations. This should be done by use of small random

values, and by modification of their variance values. The learning rate defines the step length

of the weight corrections and controls the speed and accuracy of the learning procedure

(Kriesel, 2007). In general there is no unique optimal learning rate for any ANN model, so it

has to be adjusted individually to the given problem. Starting with a larger learning rate and

choosing smaller values is advisable.

28

Types of genomic information used as input data

Several sources of genomic information have been used as inputs for genome-enabled

prediction with ANNs. If marker genotypes are used, e.g. SNPs or DArTs, the use of an input

matrix of dimension n×p consisting of all marker genotypes of all individuals is

required. With the software available at present this is computationally feasible only when a

small number of markers is used because the number of parameters to estimate increases

drastically with the number of inputs used (Gianola et al., 2011). Up to now, most of the

implementations of ANNs in animal breeding have used only a sub-panel of the original

marker matrix, and 41,995 has been the largest number of markers used as inputs in a MLP

for genome-enabled prediction (Ehret et al., to be published).

An alternative approach that reduces model complexity and computational costs, is the use of

genome-derived relationships among individuals (e.g. suggested by VanRaden

2008) or pedigree information (e.g. the additive relationship matrix computed from

pedigree) instead of the use of the entire marker matrix as inputs to the network

(Perez-Rodriguez et al.2013). This is feasible in ANNs because these are standard metrics in

animal breeding and one can think of the infinitesimal model as a regression on pedigrees or

on a genomic relationship matrix, in the case of genomic BLUP (de los Campos et al., 2009;

Gianola et al., 2011). In these cases, the number of inputs is equal to the number of

individuals (n). Recently, Perez-Rodriguez et al. (2013) presented a software which also

allows inclusion of dominance genomic relationships as inputs into the network. Aiming at

reducing the number of parameters to be estimated, but without losing much of the

information contained in the entire marker matrix , the principal component scores ( )

of can also be used as inputs to the network (Tusell et al., 2013). The is an n×n matrix

obtained from the singular value decomposition of (Mandel, 1982) as .

Overall, ANNs are flexible and can deal well with any kind of genomic covariates used as

inputs. However, given that computational limitations still exist, dimensionality reduction is

advisable when the number of inputs is very large.

Feature scaling

Input and output data need to be adapted to the network peculiarities for the learning

algorithm to work properly. This is known as feature scaling, which can be achieved by

normalizing the data, i.e., putting the input and output data on the same scale (e.g. ranging

from -1 to 1). This ensures that all inputs are treated equally in the regularization process, an

{ }ijx=X

{ }ijg=G

{ }ija=A

{ }ijx=X

X UD

X UD

X 'UDVX =

29

issue that may have a large influence on the final solution. Any output or input value i of a

variable, , can be easily normalized as:

(6)

where is the sample mean of the variable and is the range of y , although the

standard deviation of y can be used instead of the range.

When the input is the matrix of marker genotypes for all SNP or DArT genotypes of

all individuals, feature scaling can be accomplished by using a proper genotype codification.

For example, assuming additive allele effects, SNP genotypes can be coded as −1, 0, and 1 for

the homozygote for the minor allele, heterozygote, and homozygote for the other allele, i.e.

, respectively. Assuming dominance effects, SNP genotypes can be coded as −1

or 0 for homozygotes and the heterozygotes, respectively, i.e. . DARTs are binary

marker genotypes that can be coded as . Subsequently, the codes can be centered

and standardized.

Often, phenotypes are averages of several records. If a varying number of records is used to

compute the phenotypes, a possible way of accounting for heterogeneous residual variance

when using ANNs is to multiply both inputs and phenotypes by where stands for

the number of records used to compute the phenotype i (i=1,...,n). Predictions obtained with

the ANN should be then reconverted to the original scale.

Assessment of the ANN predictive ability

Prediction ability of a neural network is assessed in terms of its ability to generalize to an

independent data set. To achieve this, the assessment must be done in data observations that

have not been presented to the ANN during the estimation (training) of the ANN parameters.

Usually, data is partitioned into two groups. One group, the training set, is used for designing

and estimating parameters of the ANN. The remaining data, the testing set, is only used to

assess prediction performance. In the training set, genotypic as well as phenotypic data are

presented as examples to the network, whereas in the testing set only the genotypes are

presented to the ANN to obtain the predicted phenotypes. The phenotypes in the testing set,

viewed as the yet-to-be observed phenotypes, are only used as a reference value in the

posterior assessment of the predictive ability.

}{ iy=y

yy

yi

iMinMax

yy

−=

µ*

yµ yy MinMax −

{ }ijx=X

{ }1,0,1−=ijx

{ }0,1−=ijx

{ }1,0=ijx ijx

∑=

−n

i

i

i

bn

b

1

1

ib

30

It is also possible to divide the dataset into 3 groups: training, testing and validation set. In

this case, the latter is used to optimize the training by assessing how well the model is

behaving during the training process by means of some error measures calculated in the

validation set. Once the weights are estimated, predictive ability is assessed the testing set. In

order to evaluate the predictive ability of an ANN, any measure of fit that is appropriate for

the data, model, and problems at hand can be used. When the observations are continuously

distributed, the Pearson’s correlation coefficient between the yet-to-be observed and the

predicted phenotypes in the testing set is the most common one:

(7)

where is the number of records in the testing set, and are the yet-to-be

observed and the predicted phenotype for the i th record, respectively, and and are

the average yet-to-be observed and predicted phenotypes, respectively. Some studies use the

“accuracy” of prediction, defined as .

Another goodness of fit criterion used is the predictive mean squared error,

(8)

but the ’root mean squared error’ or ‘median absolute deviation’ could be also used to

summarize the errors. PMSE gives a measure of “closeness” (prediction biases and prediction

variance) whereas measures the degree of scatter. When these measurements are used in

the training set, they provide empirical measurements of the degree of over-fitting of the

ANN; for example, if is close to 1 with training data this indicates over-fitting, and poor

out of sample generalization. For binary classification ANNs, each observation can be

correctly or incorrectly predicted. In this case, other measurements for describing the

goodness of fit can be used such as the misclassification error rate, the sensitivity or

specificity (Cavero et al., 2008).

To obtain a stable estimate, that account for sample variance originated in the creation of the

training and testing sets, the predictive performance of the ANN is commonly evaluated in a

k-fold cross-validation. Here, the entire dataset is randomly split into disjoint subsets,

, of approximately equal size, called folds. The network is then trained and

( )( )

( ) ( )2

1 ,1

2

,

1 ,,

∑∑

==

=

−−

−−=

testtest

test

n

i predipred

n

i trueitrue

n

i predipredtrueitrue

yyyy

yyyyr

testnitruey , ipredy , iy

truey predy

2r

( )∑ =

− −=testn

i ipreditruetst yynPMSE1

2,,

1

r

r

D k

kDDD ,....., 21

31

tested k times. At each time l the ANN is trained on and tested on .

When k is equal to the number of observations in the entire dataset this type of design is

called “leave-one-out” cross-validation, an only one observation at each fold is left in the

testing set. The overall predictive ability assessed in a k-fold cross-validation is obtained as

the average predictive ability of the testing sets over the k folds (Kohavi, 1995). The

advantage of the k-fold cross-validation is that all observations are used for training several

times but each observation is used for validation only once.

Another approach that allows obtaining an empirical distribution of the estimates of the

predictive ability measurements is to use a repeated random sub-sampling validation. With

this method random partitions of the dataset into training and testing are generated. The

model is fitted in the training data of each partition, and predictive ability is assessed using

the testing data. The results are then averaged over the partitions. Some whole genome-

enabled predictions with ANN have used this type of cross-validation design by generating 50

random partitions with 90% of the data in training and 10% in testing sets (González-

Camacho et. al., 2012; Gianola et al., 2011; Pérez et al., 2013). The advantage of this design

in comparison to the k-fold cross validation is that the proportion of the data in the

training/validation split is not dependent on the number of folds. However, some observations

may never be selected in the testing sets, whereas others may be selected more than once.

Software available for implementation of ANNs in genomic prediction

There are few implementations of computational systems using ANNs available. Common

software programs for fitting ANN models include SAS (SAS Institute Inc., 2008), R (R Core

Team, 2011), MATLAB (Beale et al., 2011), and there are also some stand-alone programs

such as the fbm software package of Neal (1996) for BRNNs. In SAS the procedure NEURAL

can be used to fit ANN models, but it is not included in the standard version. An alternative is

MATLAB, there the function trainbr can be used to fit MLPs and BRNNs. RBFNN models

could be modeled using the newrb function (Beale et al., 2011). However, most of the

software available lacks flexibility for dealing with the data and models commonly used in

genome-wide enabled prediction and is not publicly available. Pérez-Rodríguez et al. (2013)

developed an R-package called brnn that implements BRNNs for animal and plant breeding

applications. It includes the feature of fitting together in the same ANN different sources of

information, such as additive and dominance genomic relationships as well as pedigree-based

relationships and provides a scaling function for the data pre-processing (inputs and outputs).

{ }k,....2,1∈ lDD − lD

32

Application of artificial neural networks in animal and plant breeding

The first application of ANNs in animal breeding was by Macrossan et al. (1999), who

employed them as a method for optimizing mating allocation to maximize production traits in

the Australian dairy industry. A linear and a Bayesian linear regression, as well as a BRNN

and a non-regularized MLP model, were compared in their ability to predict first daughter

milk yield phenotypes using dam and sire production and environmental effects as input

variables. All models gave similar predictive ability: the correlation between predicted and

observed response variables in the testing set of all the models was 0.76. This suggested that

there was little or none non-linearity in the dairy traits analyzed. Furthermore, the authors

pointed out that the choice of an optimal training set size was important to deal with the

computational costs of using MLPs and BRNNs in prediction. Aiming at improving reliability

of mastitis detection in practical applications, Cavero et al. (2008) used ANNs as a

classification method for mastitis detection of cows in automatic milking systems. They used

several MLPs with a resilient back-propagation learning rule and tested different network

architectures, including a network with two hidden layers. Days in milk, maximal electrical

conductivity, relative deviation of electrical conductivity, milk yield and milk flow were used

as inputs to the network. The output layer was a single neuron with a continuous output value

that was classified as positive mastitis if the somatic cell counts of the cow exceeded a certain

threshold, or negative otherwise. Sensitivity, specificity and error rate were the parameters

used to assess the average accuracy of the classification in the testing sets of a 5-fold cross-

validation. They concluded that ANNs are generally useful tools for computerized decision

support systems and recommended the use of neural networks as a first screening tool to

detect groups of cows with a high susceptibility of developing mastitis. However, the

performance of the ANNs was not completely satisfactory, since specificity ranged from 51.1

to 74.9% and the error rate was between 51.3 and 80.5%, depending on the threshold used to

diagnose mastitis and for a block-sensitivity of at least 80% in the training set, possibly, due

to insufficient input information related to mastitis.

Models that are non-linear in either features or parameters were proposed as an alternative to

the classical linear ones to enhance genome-enabled prediction of complex traits in animal

and plant breeding (Gianola et al., 2006; Gianola et al., 2011; González-Camacho et al., 2012;

Long et al., 2007). In 2011, Okut et al. (2011) and Gianola et al. (2011) suggested the use of a

BRNN as a potential approach to account for possible non-linear relationships between

genomic covariates and complex traits. Empirical results pointed out that the ability of

predicting body mass index data in mice with BRNN was low irrespective of the network

33

architecture considered, and not different from the one attained by a linear model (the

correlation between predicted and true phenotype in the testing set ranged from 0.14 to 0.18).

However, they concluded that even only using a subset of markers from all the SNPs

available, BRNN could lead to a predictive performance similar to the one obtained with

linear regression models. They found that all the network architectures examined were

consistent in flagging SNPs associated with the trait. Gianola et al. (2011) investigated MLPs

with Bayesian regularization for genome-enabled prediction of grain yield and milk

production traits in wheat and cattle, respectively. The study examined different network

structures with up to 6 neurons in the hidden layer and different activation functions. In order

to avoid the high dimensionality of the SNP data in the cow dataset, they used as inputs to the

network either the additive genomic relationship matrix among individuals (Van Raden,

2008) or the numerator additive genetic relationship matrix among individuals obtained from

the pedigree. For the wheat dataset biallelic markers (DArTs) were used as inputs. They

noticed that a network with one output neuron and one hidden neuron and a linear activation

function is equivalent to a linear random regression, where the weights are the regression

coefficients of a standard linear approach. Here, the weights of a neural network can be

interpreted as marker effects. This is the only exception were an ANN can be linked to an

infinitesimal model. They concluded that the non-linear ANN approaches can outperform

linear models in predictive ability of complex traits when using genomic information, where

the number of inputs is typically much larger than the number of animals in the data sample

(p >> n), especially when Bayesian regularization is used. However, like in other predictive

models, their superiority highly depends on characteristics of the data used.

It has been pointed out that a large source of genetic variability in plant traits comes from

non-additive, interaction effects e.g., dominance and epistasis (Holland, 2008). González-

Camacho et al. (2012) proposed the use of RBFNN to predict maize traits planted in different

environments through the use of genomic markers. A reproducing kernel Hilbert spaces

model (RKHS; Gianola et. al., 2006) and the linear Bayesian LASSO model (Park and

Casella, 2008) were also included in the analyses. RKHS and RBFNN models set a similar

but slight superiority over the linear model in prediction ability of records simulated under

gene × gene interaction, possibly because of their better ability to capture complex non-

additive interactions. Further, the authors cautioned that redundant information derived from

inclusion of a large number of markers as covariates can impair predictive ability of non-

linear models. This can be overcome by directly introducing differential weights of the SNPs

34

in the kernel (Long et al., 2010), using predefined SNP weights, performing a prior SNP

selection, or using marker haplotypes.

Heslot et al. (2012) compared the ability to predict 8 wheat datasets of 11 standard, non-

parametric and machine learning prediction models including an ANN with a back-

propagation algorithm and a linear activation function in the hidden layer. The performance of

the models was assessed by Pearson's correlation coefficient between the observed and

predicted genomic breeding values (referred to accuracy). In cross-validation the average

accuracy over several datasets of the ANN (0.55) was similar to the best performing model

(RKHS), but the ANN performed poorly in terms of the variance of the estimates (mean

squared error). This suggested that the ANN produced large variance and unstable results.

Pérez-Rodríguez et al. (2012) compared linear and non-parametric regression models for

genome-enabled prediction in wheat, including BRNNs and RBFNNs approaches. Predictive

ability of the models was evaluated by the average Pearson's correlation between predicted

and yet-to-be observed output values and by the predictive mean squared error (PMSE)

obtained in 50 random partitions (90% and 10% of the records assigned to the training and

testing sets, respectively). The study indicated clearly that RKHS, RBFNN and BRNN

models had a markedly better predictive accuracy than the linear Bayesian LASSO, Bayes A

or Bayes B procedures. This may be due to the fact that in wheat, like in maize, additive ×

additive as well as genotype × environmental interactions play an important role in the

expression of traits. The authors pointed out the importance of forming training and validation

sets properly, given that the degree of genetic relatedness of the individuals belonging to each

group has an influence on the accuracy of the predicted values. Similarly, Tusell et al. (2013)

compared the performance of different linear and non-linear prediction models in association

with different input covariates for predicting litter size in pigs. They used a pedigree-based

mixed-effects model, Bayesian Ridge Regression, Bayesian LASSO, Genomic BLUP, RKHS,

BRNN and RBFNN. The neural networks employed either the additive genomic relationship

matrix (G) or the principal components scores (UD) of the marker matrix as inputs. The

model comparison was assessed on 3 large datasets: approximately 2,600 and 1,600 purebred

and 1,900 crossbred sows genotyped for 60K SNPs. Predictive ability was measured by the

average correlation between realized and predicted litter size in the testing sets (10% of the

data) in a 10-fold cross-validation, and separately for each line. On average, the non-

parametric models (RKHS, BRNN, RBFNN) gave similar predictions to their parametric

counterparts. However, the predictive ability of BRNN varied drastically, giving the best

prediction when G was used in crossbreds (0.31), where non-additive genetic effects are

35

expected to be important, but the worst when UD was used in one of the purebred lines (0.02).

The predictive ability of RBFNN ranged from 0.16 to 0.23. The authors concluded that the

performance of ANNs was promising, but sensitive to the type of genomic information used

as input and to intrinsic characteristics of the data set, such as population structure and

effective sample size given that predictive ability was higher in the crossbred than in purebred

lines. The latter showed evidences of population substructure and naturally a higher degree of

inbreeding.

In recent years several studies have used genotype imputation approaches aiming at

increasing predictive accuracy and performing genetic evaluation of individuals genotyped

with low density panels, together with individuals with high density SNP genotypes (Weigel

et al., 2010; Mulder at al., 2012). Under the assumption that imputed genotypes do not add

additional information to the one already supplied by the observed genotypes, Felipe et al.

(unpublished), observed that genotype imputation would have little or no effect in predictive

accuracy of models already capable of extracting more information from the data, such as

ANNs. The effect of genotype imputation on predictive ability of two mice traits was

evaluated using the Bayesian LASSO, RKHS and BRNN in a cross validation scheme (2/3 of

the data in the training set and 1/3 in the testing set). The BRNN could not deal well with high

imputation errors and performed poorly with the imputed markers, whereas the other two

models gave similar (and better) predictive results. Using the same dataset but more markers

than Okut et al. (2011), they obtained a lower prediction performance for the same trait,

indicating that BRNN performs better using effect based selected markers instead of evenly

spaced ones. The use of imputation techniques together with the increasing availability of

genomic data makes feature selection methods an important tool to be used for reducing

dimensionality of whole-genome prediction approaches.

The usefulness of ANNs as predictor models for the analyses of functions of interest in

animals breeding was assessed by Okut et al. (2013). To evaluate the predictive performance

of ANNs in a linear additive system, they predicted expected progeny difference of marbling

score in Angus cattle (a trait computed with BLUP under additive inheritance assumptions).

Two different kinds of ANNs were investigated: BRNN and a scaled conjugate gradient back-

propagation ANN (SCGANN; Møller 1993; Kriesel 2007; Okut et al. 2013) which is an

extension of the back-propagation algorithm. They also investigated linear and non-linear

activation functions and different network architectures with 1 to 4 neurons in the hidden

layer. Feature selection was examined as well by using two different datasets, one with a 3K

panel and one with 700 preselected markers. There was no evidence that more complex

36

networks (more neurons in the hidden layer) produced better predictions than the linear model

used as benchmark, possibly because of the additive pre-smoothing of the trait. Within ANNs,

BRNNs produced more accurate predictions than the SCGANNs. Table 1 shows a

summarized description of the studies reviewed above.

37

Table 1 Studies that used artificial neural networks for prediction of complex traits in plant and animal breeding

Year Authors Species Input data Target trait Network

architecture

Network

model

CV Other models

analyzed

1999 Macrossan et al. cattle production + environmental traits

first daughter milk yield

2-layer 1-4,8 neurons

BRNN, MLP with BP

no BLR and LR

2008 Cavero et al. cattle Milk traits mastitis 2-3 layer 10,20,30 neurons

MLP with a resilient BP

yes %

2011 Gianola et al. cattle, wheat A, G, DArT milk traits, yield 2-layer 1-6 neurons

BRNN yes %

2011 Okut et al. mouse SNP BMI 2-layer 1-7 neurons

BRNN yes %

2012 González-Camacho et al.

maize SNP trait-environmental combinations

2-layer RBFNN yes RKHS, BL

2012 Pérez-Rodríguez et al.

wheat DArT trait-environmental combination

2-layer BRNN RBFNN

yes BRR,BL, BayesA, BayesB,RKHS

2012 Heslot et al. wheat, barley, maize

DArT,SNP, SSR

different yield traits 2-layer MLP with BP and weight decay

yes BL, wBSR, RF, BayesC, SVM, RKHS, elastic net, empirical Bayes

2013 Tusell et al. pig UD, G litter size 2-layer 4 neurons

BRNN RBFNN

yes Pedigree model, RKHS,BL

2013 Okut et al. cattle SNP marbling score 2-layer BRNN SCGANN

yes BayesCπ, BayesCπ=0

unpub

lished

Felipe et al. mouse SNP BMI, Body weight

2-layer 5 neurons

BRNN yes BL,RKHS

A = numerator relationship matrix; G = Genomic relationship matrix; SNP = Single nucleotide polymorphism; DArT = Diversity array technology markers; UD = Principal component score of marker matrix; SSR = single sequence repeats; BMI = Body mass index; BP = Back-propagation; RKHS = Reproducing kernel Hilbert spaces regression; BLR = Bayesian linear regression; LR = Linear regression; BRR = Bayesian ridge regression ; BL = Bayesian Lasso; RF = random forest; SVM = support vector machines; wBSR = weighted Bayesian shrinkage regression; CV = cross-validation.

38

Potential benefits and pitfalls in using ANNs for genome-enabled prediction

Due to their great flexibility, ANNs have been used in a wide spectrum of different

applications including genome-enabled prediction of complex traits in animal and plant

breeding. ANNs have a great potential as predictive machineries. One of the most appealing

properties is that ANNs do not require most of the prior assumptions that commonly underlie

parametric statistical models because they can be non-linear in either features or on

parameters and, if properly specified, they may be able to capture complex signals from the

data and result in a better predictive accuracy. However, the ability of ANNs to accommodate

interactions and non-linear relationships between genomic covariates and phenotypes depends

strongly on the quality of the learning process, the data structure and the network model

employed. Hence, although ANNs for genome-enabled prediction can potentially outperform

predictive accuracy of linear prediction methods, this is not always the case. Further, semi-

parametric methods less computationally expensive, such as RKHS have shown more

consistent performance in different datasets (Heslot et al., 2012; Tusell et al., 2013).

The large number of parameters in a richly structured ANN often impairs its predictive power

compared with the one obtained with less parameterized prediction models for a given dataset

size. This is due to the tendency of ANNs to over-fit, which is copy signal and noise present

in the training data. Predictive performance would be higher if the number of examples

presented to the network during training were optimal for correctly estimating connection

strengths. Unfortunately, when the dimension of the network increases markedly, as it is the

case in genome enabled prediction, the number of parameters grows quickly and ANNs are

prone to over-fit the data as stated above. Therefore, the use of generalization approaches, like

Bayesian methods, or adapted algorithms like “weight decay” (Kriesel, 2007) are advisable.

Up to now, even though computations are fast and highly parallel, ANN analyses are still

limited to a certain number of inputs which may be far away from what the high-

dimensionality genomic data demands. This is one of the reasons why, instead of using an

entire marker dataset, SNP selection, genome-derived relationships or other dimensionality

reduction techniques are used to produce ANNs with a reasonable predictive ability.

The studies reviewed show that the performance of ANNs highly depends on the

characteristics of the dataset used. The optimal architecture of an ANN is highly dependent on

intrinsic characteristics of the data, especially when the dataset is small. This is why it is

essential to use a suitable cross-validation scheme for evaluating the results of different

architectures and parameter settings, as well as to attain a sensible measure of uncertainty.

39

If the researcher is not merely interested in prediction but is aiming at inferring causal

relationships or interactions between genomic covariates and traits, the use of ANNs may not

be advisable. The reason is that ANNs have a 'black box' behavior, meaning that (typically)

there is no direct way to interpret the parameters learned by the network. In this sense,

practical implementation of ANN is limited to cases where no biological interpretation of the

parameters is needed, such as breeding value estimation for selection purposes or finding

association between markers and quantitative traits. Overall, ANNs have a great potential

predictive ability but proper knowledge of the network requirements is required for a

successful implementation in genome-enabled prediction of complex traits. Also, prediction

performance of ANNs depends strongly on the peculiar characteristics of the data and the

problem to be solved. Hence, there are no universal recommendations available to ensure

good predictive ability of ANNs that apply to different case studies, and the optimal settings

of an ANN depend specifically to each task. To a large extent, capturing the postulated

superiority of ANNs over other prediction methods strongly depends on the knowledge and

experience of the user and on the data structure obtained.

40

Author’s contributions

AE worked out the object of this study and drafted the manuscript. LT and AE finalized the

manuscript. DG supervised the study and edited the manuscript. GT revised the manuscript.

All authors read and approved the final manuscript.

Acknowledgements

This research was funded by the German Federal Ministry of Education and Research

(BMBF) within the AgroCluster “Synbreed – Synergistic plant and animal breeding”. The

authors thank Paulino Pérez-Rodríguez (Department of Animal Sciences, University of

Wisconsin-Madison, Madison, 53706, USA) for comments and suggestions.

References

Beale, M. H., M. T. Hagan, and H. B. Demuth. 2011. Neural Network Toolbox™ MATLAB

Broomhead, D. and D. Lowe. 1988. Multivariable functional interpolation and adaptive

networks. Complex Syst 2:321-355.

Cavero, D., K. H. Tølle, C. Henze, C. Buxadé, and J. Krieter. 2008. Mastitis detection in dairy

cows by application of neural networks. Livestock Science 114:280-286.

Chen, S., C. F. N. Cowan, and P. M. Grant. 1991. Orthogonal least squares learning algorithm

for radial basis function networks. Neural Networks, IEEE Transactions on 2:302-309.

de los Campos, G., H. Naya, D. Gianola, J. Crossa, A. s. Legarra, E. Manfredi, K. Weigel,

and J. M. Cotes. 2009. Predicting quantitative traits with regression models for dense

molecular markers and pedigree. Genetics 182:375-385.

de los Campos, G., J. M. Hickey, R. Pong-Wong, H. D. Daetwyler, and M. P. L. Calus. 2013.

Whole-genome regression and prediction methods applied to plant and animal

breeding. Genetics 193:327-345.

Fritzke, B. 1994. Fast learning with incremental RBF networks. Neural Processing Letters

1:2-5.

Geman, S., E. Bienenstock, and R. Doursat. 1992. Neural Networks and the Bias/Variance

Dilemma. Neural Computation 4:1-58.

Gianola, D., R. L. Fernando, and A. Stella. 2006. Genomic-assisted prediction of genetic

value with semiparametric procedures. Genetics 173:1761 - 1776.

41

Gianola, D., H. Okut, K. Weigel, and G. Rosa. 2011. Predicting complex quantitative traits

with Bayesian neural networks: a case study with Jersey cows and wheat. BMC

Genetics 12:87.

González-Camacho, J. M., G. los Campos, P. Pérez, D. Gianola, J. E. Cairns, G. Mahuku, R.

Babu, and J. Crossa. 2012. Genome-enabled prediction of genetic values using radial

basis function neural networks. Theoretical and Applied Genetics 125:759-771

Gurney, K. 1997. An Introduction to Neural Networks. Taylor& Francis, Inc.

Hagan, M. T., H. B. Demuth, and M. Beale. 1996. Neural network design. PWS, Boston.

Hanrahan, G. 2011. Artificial Neural Networks in Biological and Environmental Analysis.

CRC Press, Inc.

Hastie, T., R. Tibshirani, J. Friedman, and J. Franklin. 2009. The elements of statistical

learning: data mining, inference and prediction. The Mathematical Intelligencer 27:83-

85.

Holland, JB. 2008. Theoretical and biological foundations of plant breeding. In Plant

Breeding: The Arnel R. Hallauer International Symposium: Blackwell Publishing,

pages127-140.

Hornik, K., M. Stinchcombe, and H. White. 1989. Multilayer feedforward networks are

universal approximators. Neural Networks 2:359-366.

Kohavi, R. 1995. A study of cross-validation and bootstrap for accuracy estimation and model

selection. The Intrnational Joint Conference on Artificial Intelligence IJCAI, pp 1137-

1145

Kriesel, D. 2007. A brief introduction to neural networks. Available at

http://www.dkriesel.com

Kolmogorov AN. 1957. On the representations of continuous functions of many variables by

superpositions of continuous functions of one variable and addition. Doklady

Akademii Nauk USSR 114:953-956.

Kůrková, V. 1992. Kolmogorov's theorem and multilayer neural networks. Neural Networks

5:501-506

Lampinen, J. and A. Vehtari 2001. Bayesian approach for neural networks review and case

studies. Neural Networks 14:257 – 274

42

Long, N., D. Gianola, G. J. M. Rosa, K. A. Weigel, and S. Avendaño. 2007. Machine learning

classification procedure for selecting SNPs in genomic selection: application to early

mortality in broilers. Journal of Animal Breeding and Genetics 124:377-389.

Long, N., D. Gianola, G. J. M. Rosa, K. A. Weigel, A. Kranis, and O. Gonzalez-Recio. 2010.

Radial basis function regression methods for predicting quantitative traits using SNP

markers. Genetics Research 92:209-225.

MacKay, DJC. 1992a. Baysian interpolation. Neural Computation 4:415 – 447

MacKay, DJC. 1992b. A practical bayesian framework for backpropagation networks. Neural

Computation 4:448-472

MacKay DJC. 2008. Information theory, inference and learning algorithms

Macrossan, P., D. Hand, J. Kok, M. Berthold, H. Abbass, K. Mengersen, M. Towsey, and G.

Finn. 1999. Bayesian neural network learning for prediction in the Australian dairy

industry. Advances in Intelligent Data Analysis. Vol. 1642. Springer Berlin

Heidelberg. Pages 395-406

Mandel, J. 1982. Use of the singular value decomposition in regression analysis. The

American Statistician 36:15-24.

Møller, MF. 1993. A scaled conjugate gradient algorithm for fast supervised learning. Neural

Networks 6:525-533

Mulder, H. A., M. P. L. Calus, T. Druet, and C. Schrooten. 2012. Imputation of genotypes

with low-density chips and its effect on reliability of direct genomic values in Dutch

Holstein cattle. Journal of Dairy Science 95:876-889.

Neal, RM. 1992. Bayesian training of backpropagation networks by the hybrid Monte Carlo

method. Dept of Computer Science, University of Toronto, Tech Rep CRG-TR-92-1

Nea,l RM. 1996. Sampling from multimodal distributions using tempered transitions.

Statistics and computing 6:353-366

Okut, H., D. Gianola, G. J. M. Rosa, and K. A. Weigel. 2011. Prediction of body mass index

in mice using dense molecular markers and a regularized neural network. Genetics

Research 93:189 - 201.

Okut, H., X.-L. Wu, G. J. M. Rosa, S. Bauck, B. W. Woodward, R. D. Schnabel, J. F. Taylor,

and D. Gianola. 2013. Predicting expected progeny difference for marbling score in

43

Angus cattle using artificial neural networks and Bayesian regression models.

Genetics Selection Evolution 45:34.

Park, T. and G. Casella. 2008. The Bayesian Lasso. Journal of the American Statistical

Association 103:681-686.

Pereira, B. D. B. and C. R. Rao. 2009. Data Mining using Neural Networks: A Guide for

Statisticians

http://www.po.ufrj.br/basilio/publicacoes/livros/2009_datamining_using_neural_netw

orks.pdf.

Pérez-Rodríguez, P., D. Gianola, J. M. González-Camacho, J. Crossa, Y. Manès, and S.

Dreisigacker. 2012. Comparison between linear and non-parametric regression models

for genome-enabled prediction in wheat. G3: Genes|Genomes|Genetics 2:1595-1605.

Pérez-Rodríguez, P., D. Gianola, K. A. Weigel, G. J. M. Rosa, and J. Crossa. 2013. Technical

Note: An R package for fitting Bayesian regularized neural networks with applications

in animal breeding. Journal of Animal Science.

Poggio, T. and F. Girosi. 1990. Networks for approximation and learning. Proceedings of the

IEEE 78:1481-1497.

R Development Core Team. 2011. R: A Language and Environment for Statistical

Computing. R Foundation for Statistical Computing, Vienna, Austria. URL

http://www.R-project.org. R version 2.14.0.

Ripley, BD. 1996. Pattern recognition and neural networks. Cambridge University Press,

Cambridge

Rosenblatt, F. 1958. The perceptron: a probabilistic model for information storage and

organization in the brain. Psychological Review 65:386

Rumelhart, D. E., G. E. Hinton, and R. J. Williams. 1986. Learning representations by back-

propagating errors. Nature 323:533-536.

SAS Institute Inc. 2008. SAS User's Guide, Version 9.2., Cary, NC.

Tusell, L., P. Pérez-Rodríguez, S. Forni, X. L. Wu, and D. Gianola. 2013. Genome-enabled

methods for predicting litter size in pigs: a comparison. animal 7:1739–1749

44

Van Raden, P. M. 2008. Efficient methods to compute genomic predictions. J Dairy Sci

91:4414 - 4423.

Weigel, K. A., G. de los Campos, A. I. Vazquez, G. J. M. Rosa, D. Gianola, and C. P. Van

Tassell. 2010. Accuracy of direct genomic values derived from imputed single

nucleotide polymorphism genotypes in Jersey cattle. Journal of Dairy Science

93:5423-5435.

Wellmann, R. and J. Bennewitz. 2011. The contribution of dominance to the understanding of

quantitative genetic variation. Genetics Research 93:139.

Yang, J., B. Benyamin, B. P. McEvoy, S. Gordon, A. K. Henders, D. R. Nyholt, P. A.

Madden, A. C. Heath, N. G. Martin, G. W. Montgomery, M. E. Goddard, and P. M.

Visscher. 2010. Common SNPs explain a large proportion of the heritability for

human height. Nat Genet 42:565-569.

45

CHAPTER 3

46

47

An empirical application of neural networks with back-propagation to genome enabled predictions of complex traits in Holstein Friesian and

German Fleckvieh cattle

Anita Ehret1, David Hochstuhl

2, Daniel Gianola

3,4,5 and Georg Thaller

1

1 Institute of Animal Breeding and Husbandry, Christian-Albrechts-University, Germany

2 Institute for Theoretical Physics and Astrophysics, Christian-Albrechts-University, Germany 3 Department of Animal Sciences, University of Wisconsin-Madison, USA

4 Department of Dairy Science, University of Wisconsin, USA 5 Department of Biostatistics and Medical Informatics, University of Wisconsin, USA

Submitted for publication in Genetic Selection Evolution

Abstract

Background: Recently, artificial neural networks (ANN) have been proposed as promising

machines for molecular marker based genomic predictions of complex traits in animal and

plant breeding. ANNs are universal approximators of complex functions, able to capture

cryptic relationships between SNP-markers and phenotypic values without the need of

explicitly defining a genetic model. This concept is attractive for high dimensional and noisy

data, especially when the genetic architecture of the trait is unknown. However, properties of

ANNs in prediction of future outcomes on real data for genomic selection are not well

characterized and, due to high computational costs, using entire marker sets is difficult. We

examined different non-linear network architectures as well as several genomic covariate

structures as network inputs in order to assess the ability of predicting milk traits in three

dairy cattle data sets using large scale SNP data. For training, a regularized back propagation

algorithm was employed. The average correlation between the observed and the predicted

phenotypes in a 20 times 5-fold cross-validation was used to assess predictive ability. A linear

network model served as benchmark.

Results: Predictive abilities of different ANN models varied markedly, whereas differences

between data sets were small. Dimension reduction methods enhanced prediction performance

in all data sets, while at the same time computational cost was reduced. For the Holstein-

Friesian bull data set, an ANN with 10 neurons in the hidden layer achieved a predictive

correlation of r = 0.47 for milk yield when the entire marker matrix was used. Predictive

ability increased with employing the genomic relationship matrix (r = 0.64) as an input and

was best (r = 0.67) when principal component scores of the marker genotypes were used.

Similar results were found for the other traits in all data sets.

Conclusion: ANNs are powerful machines for non-linear genome-enabled predictions in

animal breeding. However, to produce stable and high-quality outputs, variable selection

methods are highly recommended, when the number of markers vastly exceeds sample size.

Keywords: Artificial neural networks; genome-based prediction; milk traits; genomic selection

48

Introduction

In genome-enabled prediction of traits in animal and plant breeding, building appropriate

models can be extremely challenging, especially when the association between predictors and

target variable involves non-additive effects (de los Campos et al., 2009; Wellmann and

Bennewitz, 2009; Gianola et al., 2006; Gianola and van Kaam, 2008). Linear methods

frequently used in genome-enabled predictions typically ignore gene by gene interactions as

well as higher order non-linearities. To meet this challenge, and to take possible non-

linearities into account in prediction, there has been a growing interest in the use of semi- and

non-parametric methods (de los Campos et al., 2013). In this context, machine learning

methods and, in particular, artificial neural networks (ANNs) have been considered to be

promising predictive machineries (Gianola et al., 2011; Tussel et al., 2013; Okut et al., 2013).

Nevertheless, there have been only a few empirical applications of ANNs to genome-enabled

prediction in animal and plant breeding. In what follows, we will first give a short overview

of this method and of the state of the art in the field of animal breeding. For a more detailed

survey on the use of ANNs with respect to genome-enabled predictions, we refer to Ehret et

al. (to be published).

Initially ANNs were developed in the field of artificial intelligence and were first introduced

for image recognition. The central concept was inspired by knowledge of the nervous system,

especially the human brain with its closely connected neurons (Pereira and Rao, 2009). The

idea has been used for defining statistical models in form of neuron diagrams, as shown in

Figure 1.

49

Figure 1 Schematic representation of an artificial neuron (xi = input value; wj = weights linked to

single input values; f(.) = activation function of the artificial neurons ; z = output of artificial neuron; Σ

indicates some computation)

In an idealized artificial neuron, all received input information ix ( i = number of inputs, e.g.

marker genotypes) is weighted via appropriate elements jw and summed up. The sum of all

weighted inputs is transformed by an activation function ( ).f to produce the neuron output z

(e.g. the predicted phenotype). The activation function can be either linear or non-linear and

its purpose is to restrict the amplitude of the neuron's output.

To mimic the physical structure of the human nervous system, artificial neurons are

interconnected and organized in networks with several layers which together form the ANN.

In most applications the information usually flows straight forwardly, thus the output of one

neuron forms the input of another one. Algebraically, an ANN can be represented as a

schematic of Kolmogorov's theorem (Kolmogorov, 1957) for representation of complex

functions, by which ANNs are proven to be universal function approximators (Kurková,

1992). Therefore, in the context of genome-enabled prediction (Meuwissen et al., 2001),

ANNs are theoretically able to account for cryptic interactions between genotypes and

phenotypes without the need of explicitly specifying a fixed genetic model (Gianola et al.,

2011), thus the genetic architecture of the trait may remain unknown a priori.

In short, an appealing property of ANNs is that they do not require most of the prior

assumptions that commonly underlie parametric statistical models. ANNs can be non-linear in

both features and parameters and, if properly specified, may capture complex signals from the

data and deliver a better predictive accuracy. This is achieved through a learning phase where,

for example, several pairs of genotype-phenotype combinations are fed into the network.

50

According to a specific learning rule, the ANN can memorize the function of training

samples. Learning is an iterative process, where at each iteration the weights (connections

between single artificial neurons) of the ANN are steadily adjusted, in order to minimize the

difference between observed and predicted output (e.g. phenotypes) or training error (Gurney,

1997). The process is stopped when a previously defined threshold of the error function in

training samples is reached. Hence, adjusting the weights properly to a given mapping

problem is an optimization task. The most widespread supervised learning rule for ANNs is

the back-propagation of error algorithm (Rumelhart, 1986). It is a supervised learning

algorithm and can be seen either as a gradient descent method to locate the optimal solution

(Werbos, 1994), or as a generalization of the delta rule (Kriesel, 2007). The algorithm is based

on minimization of the error function with respect to the weights and in general employs a

least squares solution, so that the procedure can be viewed as a non-linear regression

algorithm. A trained ANN is able to predict unknown future outcomes of the same physical

process that created the training samples.

ANNs are flexible and powerful and can be implemented in various ways (de los Campos et

al., 2013). In animal breeding, ANNs were claimed to have the ability of outperforming

frequently used standard linear regression models in prediction of yet to be observed

phenotypic values through genomic data (Okut et al., 2013; Tusell et al., 2013; Gianola et al.,

2011). However, ANNs are computationally costly, especially when applied to high

dimensional genomic data, where the number of parameters to be estimated typically exceeds

the number of available samples. Therefore, at least in animal breeding, only subsets of

markers have been used to make ANN computational feasible (Okut et al., 2011; Okut et al.,

2013).

This study is the first application employing an entire genomic marker set as source of input

information for genome-enabled prediction of milk traits in dairy cattle. The predictive

performance of ANNs will depend on network architecture, training phase and peculiar

characteristics of the data (Gianola et al., 2011; Ehret et al., to be published). In order to

assess the importance of these factors, we applied different ANN structures to prediction of

yet to be observed phenotypic values from large scale SNP data. A linear ANN with one

neuron in the hidden layer and with the genomic relationship matrix (VanRaden et al., 2008)

used as input information served as a benchmark. Such an ANN produces results

approximately corresponding to those of GBLUP (Pérez-Rodríguez et al., 2013; Gianola et

al., 2011), which is a standard method for genome-enabled prediction in animal breeding.

51

Material and Methods

Single hidden layer feed-forward ANN with back-propagation

Research on theory and applications of artificial networks is steadily increasing. Several types

of ANNs have been extensively used for various purposes, such as classification, pattern

recognition, prediction and forecasting, process control, optimization and decision support

(Hanrahan, 2011). A frequently used type of ANN for regression and forecasting is the two-

layer feed-forward perceptron (Ehret et al., to be published), also called single hidden layer

feed-forward neural network. These are ANNs where an input layer of source nodes and an

output unit are completely linked, with only one hidden layer between them, as illustrated in

Figure 2.

Figure 2 Architecture of a single hidden layer feed-forward neural network.

xij = network input e.g. marker genotype j of individual i; w1m = network weight from the input to

hidden layer; w2s = network weight from the hidden to the output layer; yi network output e.g.

predicted phenotype of individual ; f(.) = activation function at the hidden neurons; g(.) = activation

function at the output neuron; Σ indicates some computation

52

This kind of network is able to reproduce most mathematical functions fairly well, while

keeping a simple architecture. Moreover, it has good properties when working with high

dimensional data, as it is the case in genome enabled predictions (Ehret et al., to be

published).

Mathematically, the mapping of such an ANN can be viewed as a two-step regression (Hastie,

2009). The central concept is to extract in the hidden layer linear combinations of the inputs

as basis functions and then model the target as a function of these basis functions in the output

layer. In terms of genome-enabled prediction, in the hidden layer the genomic covariates ijx

(for mj ,,1K= , where m denotes the number of genomic covariates) of an individual i (for

ni K1= ) are linearly combined with a vector of weights [ ]t1jw that are specified in the training

phase, plus an intercept, in ANN’s terminology also called "bias" ta with st ,,1K= denoting

a neuron.

The resulting linear score is then transformed using an activation function ( ).f to produce the

output of the single hidden neuron

[ ] [ ]( )ij

m

j

t

jtt

t

i xwafz ∑ =+=

1 1 . (1)

In order to model some non-linear relationship between phenotype and input, the hyperbolic

tangent activation function ( ( )xx

xx

ee

eex

+

−=tanh ) can be employed in the hidden neurons,

giving the ANN a greater flexibility than that of standard linear regression models (Mackay,

2003).

In the output layer, the s genotype-derived basis functions, resulting from the hidden layer,

are also linearly combined by using the swww 22221 ,,, K weights and an output intercept b . In

the output neuron, the resulting linear score is transformed, this time by use of a linear

activation function ( ).g , to calculate the predicted phenotype of individual i (for ni ,,1K= ),

as

[ ]( )t

i

s

t tti zwbgy ∑ =+=

1 2 . (2)

The linear activation function is often an identity function.

At the training phase, the ANN locates the optimal weights by minimization of an error

function of the training set. Here, a back-propagation algorithm was used for training. It is a

convenient and simple iterative algorithm that usually performs well, even with complex data.

Unlike other learning algorithms (like Bayesian learning) it has good computational properties

when dealing with large scale data. To enforce the network generalization ability,

53

regularization was used for training, since this is not naturally achieved through the learning

algorithm. The algorithm either was stopped when the maximum number of 1000 iterations

was reached, or when the averaged mean squared error ( aMSE ) between predicted and true

phenotype reached a certain threshold ( 310−≤aMSE ).

Parameters of the learning algorithm were optimally adjusted in the individual data sets in a

pre-processing step, and subsequently were held constant for all following runs. In each

training round, the weights were initialized to small random values (ranging between -0.1 and

0.1). This helps the algorithm to find the optimal solution. Beyond choosing the best training

configuration of the learning algorithm, we examined different artificial neural network

architectures to assess the best predictive ANN. Up to 20 neurons in the hidden layer were

tested for their influence on predictive quality. All ANN calculations were performed using a

C++ program written by the authors, while the data pre-processing was done with the publicly

available statistical software R (R Core Team, 2011).

Benchmark model

To compare the non-linear ANN models with a standard method used in animal breeding, all

data sets were also evaluated using a quasi GBLUP. Here, the input variable was the genomic

relationship matrix G which was fed into an ANN with one neuron in the hidden layer and

linear activation functions in the hidden as well as in the output layer. The network is similar

to GBLUP, in the sense that the network performs a multiple linear regression, in which the

weights of the hidden layer can be interpreted as regression coefficients. When G is

proportional to 'XX , where X is the incidence matrix of a linear regression model on

markers, this is equivalent to ridge regression (Gianola et al., 2011; Pérez-Rodríguez et al.,

2013), as it is the case in GBLUP (Whittaker, 2000).

Phenotype and genotype data

To evaluate the impact of data structure, three different data sets were presented separately to

the ANN. Inputs were genomic data on 3,341 German Fleckvieh bulls, 2,303 Holstein-

Friesian bulls, and 777 Holstein-Friesian dams. All animals were genotyped with a 50k SNP-

panel and, after quality control, 39,344, 41,995 and 41,718 SNP markers were used in the

analyses respectively, as shown in Table 1. Quality control included eliminating SNPs with

minor allele frequency <0.05 and missing genotype frequency >0.95. For the remaining loci,

missing genotypes were imputed using the population based imputing algorithm Minimac

54

(Howie et al., 2012), a computationally efficient extension of MaCH (Li et al., 2010) which

takes pre-phased haplotypes as inputs (Pausch et al., 2013).

Table 1 Data used.

Dataset Animals in

analysis

Number of markers

after quality control

Type of phenotype

records

German Fleckvieh bulls 3341 39344 SNPs DYD of milk traits

Holstein Friesian bulls 2303 41995 SNPs DYD of milk traits

Holstein Friesian dams 777 41718 SNPs YD of milk traits SNP = single nucleotide polymorphism; YD = yield deviations; DYD = daughter yield deviations.

As phenotypic records, three milk traits (milk, fat and protein yield) were employed. For the

Holstein-Friesian and German Fleckvieh bulls, daughter yield deviations (DYD) were used as

phenotypes and, for the Holstein-Friesian cows, yield deviation (YD) was the response

variable. A summary of the phenotypes is given in Table 2.

Table 2 Summary statistics of the phenotypes.

Mean Variance Min Max

German Fleckvieh bulls

Milk yield DYD 1779.16 219257.40 -852.48 3372.67

Protein yield DYD 59.34 214.18 -23.56 108.65

Fat yield DYD 63.94 320.81 -39.12 137.11

Holstein Friesian bulls

Milk yield DYD 707.44 434324.64 -852.09 3706.01

Protein yield DYD 41.88 391.42 -24.19 104.57

Fat yield DYD 41.14 645.42 -45.81 139.74

Holstein Frisian dams

Milk yield YD 3.26 26.13 -14.36 19.37

Protein yield YD 0.91 1.54 -4.86 4.23

Fat yield YD 0.21 0.50 -4.17 1.80 DYD = daughter yield deviations; YD = yield deviations

55

Feature scaling was applied to the data sets, to enhance numerical stability. This is needed, as

otherwise the learning algorithm may not work properly (Hastie et al., 2009) Feature scaling

ensures that all sources of information are treated equally in the training process as it often

has a large influence on the final solution. In particular, inputs and outputs must be in the

same scale, i.e., ranging approximately from -1 to 1. For all phenotypes the following

normalization was used

y

yi

iMax

yy

µ−=

* , (3)

where yµ is the sample mean of the variable and yMax is its maximum.

Genomic information

To test the impact of different genomic inputs on ability of the ANN to predict yet to be

observed phenotypes, three different genomic covariate structures were used for all data sets.

First, the raw genomic marker matrix { }ijx=X of all SNPs of all individuals was employed.

X is of dimension mn× where n is the number of animals and m the number of markers.

Here, feature scaling was accomplished by coding SNP genotypes as -1, 0, and 1 for the

homozygote for the minor allele, heterozygote, and homozygote for the other allele, assuming

additive allele effects (Ehret et al., to be published).

Second, aiming to reduce model complexity and computational cost, genome-derived

relationships among individuals were also used as inputs. The genomic relationship matrix

{ }ijg=G was calculated following VanRaden (2008),

∑=

−−=

m

j

jj qq1

)1(2

)')(( EXEXG . (4)

Here, G is a standardized genomic relationship matrix, where genotype codes in { }ijx=X

are centered by subtracting their expected frequencies (qj) at each locus. The dimension of the

resulting matrix is nn× . Third, to minimize the loss of information in the original marker

matrix, while keeping the dimension small, principal component scores ( UD ) of X were

used as inputs as well. The UD is obtained from the singular value decomposition of X

(Mandel et al., 1982),

'UDVX = . (5)

56

Here, U is an nn × orthogonal matrix, where the columns consist of eigenvectors of 'XX .

D is a matrix of dimension mn× containing the square roots of the non-zero eigenvalues of

'XX on the diagonal, and columns of the mm × matrix V are the eigenvectors of XX'

(Tusell et al., 2013).

For feature scaling, the G and UD matrices were linearly transformed using the

normalization function of the package brnn (Pérez-Rodríguez et al., 2013) in the R program

(R Core Team 2011) so all elements of the resulting input matrices ranged between -1 and 1.

Model validation

For comparing predictive abilities of the ANN models in the different scenarios, a five-fold

cross validation scheme was applied and repeated 20 times (Kohavi, 1995). The data sets

were randomly divided into five subsets of genotypes and associated phenotypes. One subset

(testing set) was left out for testing the predictive ability of the model, whereas the other four

subsets were used as training samples (training set) to estimate model parameters. During

cross-validation runs, each of the five generated subsets served as testing set in one round,

with missing phenotypes. At every round, Pearson's correlation coefficient ( r ) between

observed and predicted phenotypes in the testing set was calculated. Since 20 different

randomizations assigning the genotype-phenotype combinations to five folds were used, this

scheme yielded 100 independent cross-validation runs. For the ANNs, different initializations

of back-propagation weights in [-0.1, 0.1] were used, to avoid that the algorithm repeatedly

gets stuck in a local minimum. The predictive ability of each model reported here was

Pearson's correlation between observed and predicted phenotype values averaged over all 100

individual cross-validation runs.

Results and Discussion

Figure 3 presents the combined results for all prediction scenarios tested. It includes results

from different data sets in columns, with milk yield, protein yield and fat yield as response

variable in the rows. The single panels (a-h) show the dependency of the average Pearson's

correlation coefficients of cross-validation runs on network architecture (1 to 20 neurons in

the hidden layer) for different genomic covariate structures ( X , G , UD ) used as input to the

ANNs.

57

Figure 3 Comparison of predictive abilities for all scenarios.

Different data sets are in the columns, in rows milk, protein and fat yield are shown. Panels (a-h) show

the average Pearson’s correlation coefficients over cross-validation runs on the vertical axis, and the

number of hidden neurons tested on the horizontal axis. Results of different genomic covariate

structures used as inputs (X, G, UD) are presented in each panel.

Predictive ability of different ANN architectures and input factors

The ANN models differed little in predictive performance in terms of number of neurons in

the hidden layer when either the G matrix or the UD matrix was used. This was observed in

all data sets for all traits, as shown in Figure 3. Results are consistent with those of Okut et al.

(2011), who showed that predictive ability of ANN models did not depend on network

architecture when sample size was larger than the number of markers used in the analyses.

Our results agreed when the number of features was equal to sample size (G and UD ).

Results obtained using the G matrix as input to the network are consistent with those of

58

Gianola et al. (2011), who employed Bayesian regularized artificial networks to genome-

enabled predictions of milk traits in Jersey cows. Even with much larger data sets and

different training algorithm, we also found that increasing the number of neurons up to 6

neurons yielded slightly better predictions of yet-to-be-observed values, than when using very

simple architectures (1 or 2 neurons in the hidden layer) when the G matrix is used as input

to the network.

Furthermore, predictive abilities of the ANN models in the bull data sets were slightly better

when using the UD matrix than when using G as inputs to the network. Differences between

these two inputs in the cow data set were negligible. The same result was obtained by Tusell

et al. (2013) when radial basis function networks were used to predict litter size in pigs. This

might be due to the fact that the UD decomposition re-expresses the data taking its variance

into account. This mapping maybe is more accurate and numerically stable for predictions

than using G .

However, when the marker matrix X was fed into the ANNs, so the number of features

vastly exceeded sample size, Pearson's correlation coefficients in the cross-validation runs

were tightly dependent on the number of neurons used in the hidden layer of the network. A

similar pattern was obtained in all data sets and traits. Averaged over all three data sets, the

correlations achieved differed by 0.09 across different network architectures, when the X

matrix was used. The maximum range was achieved in the Holstein-Friesian dams and protein

yield (Figure 3h) in which r ranged between [0.13, 0.29] across different architectures. The

minimum range of 0.07 (r = [0.41, 0.48]) across different number of neurons in the hidden

layer of the ANN was obtained in the German Fleckvieh data set for milk and protein yield

(Figure 3d,g).

The results indicate, that when using X , so that the number of features is larger than sample

size, a relatively simple ANN architecture (2-6 neurons in the hidden layer) is not able to

learn specifications of the data, whereas with complex architectures (over 15 neurons in the

hidden layer) the ANN will learn irrelevant details of the data. These are phenomena called

under and over-fitting, respectively. Both types of architectures make prediction of future

target values worse. In such situations less parameterized models, like linear ones, might be

advantageous. Furthermore, prediction performance was substantially worse when using the

whole marker set as input into the network. This is probably due to over-parameterization of

the model, as the number of effective parameters grows quickly with the number of inputs.

The effect was independent of the data set used, and using as X increased computational cost

markedly. In the Holstein-Friesian dams the pattern described was slightly different, notably

59

when protein and fat yield were predicted (Figure 3f,i). This might be due to the lower sample

size in this data set. Since the ratio between number of markers and sample size was large,

network complexity grew quickly when the number of neurons increased. Further, in all runs

learning was stopped when the maximum number of iterations was reached, instead of

reaching the optimal aMSE in the training configuration.

Furthermore, when the X matrix was used, the results showed that even an architecture with

one neuron in the hidden layer performed good predictions, as it is approximately a multiple

linear regression on marker genotypes. These results confirm the strength of linear methods in

p>>n problems.

Predictive ability across different data sets and traits

The highest average correlations between observed and predicted phenotypes in the testing

sets were obtained with the German Fleckvieh data set, followed by the prediction of DYD in

the Holstein-Friesian bull data set. Prediction of future YD in Holstein-Friesian dams was the

worst. Using YD yielded lower correlation coefficients and the ANNs failed to give accurate

predictions of measured YD values. This is simply because DYD are always based on much

more information than YD on cows. However, this might be also influenced by the lower

number of animals in the cow data set.

Within data sets the predictive ability of the models varied only slightly between traits (Figure

3). This might reflect strong additive action for the traits examined, and the fact that the linear

pre-corrections used for processing the phenotype tend to normalize the distributions. Thus

the traits behaved similarly, as expected, because phenotypic and marker-based genetic

correlations between traits were high, as shown in Table 3.

60

Table 3 Phenotypic and marker-based genetic correlations between traits within data sets; on

diagonal of singular panels the marker-based heritability is shown; on the upper off-diagonal the

marker-based genetic correlation and on the lower off-diagonal the phenotypic correlation are

presented.

German Fleckvieh bulls

Milk yield Protein yield Fat yield

Milk yield DYD 0.87 (0.04) 0.58 (0.03) 0.73 (0.04)

Protein yield DYD 0.70 (0.01) 0.79 (0.04) 0.62 (0.03)

Fat yield DYD 0.89 (0.01) 0.81 (0.01) 0.77 (0.04)

Holstein Friesian bulls

Milk yield Protein yield Fat yield

Milk yield DYD 0.67 (0.05) 0.24 (0.04) 0.52 (0.04)

Protein yield DYD 0.43 (0.02) 0.82 (0.05) 0.42 (0.04)

Fat yield DYD 0.86 (0.01) 0.63 (0.01) 0.60 (0.04)

Holstein Friesian dams

Milk yield Protein yield Fat yield

Milk yield YD 0.61 (0.08) 0.24 (0.06) 0.51 (0.08)

Protein yield YD 0.48 (0.03) 0.67 (0.08) 0.31 (0.07)

Fat yield YD 0.92 (0.01) 0.60 (0.02) 0.51 (0.08) Standard errors (SE) are shown in brackets; DYD = Daughter yield deviation, YD = Yield deviation.

Predictive ability compared to a standard linear model

Additionally, we investigated a linear ANN, which performs as a quasi GBLUP, as a

benchmark model. As shown in Table 4 more complex non-linear architectures could not

outperform the linear ANN, i.e. a multiple linear regression on marker relationships.

61

Table 4 Model comparison of linear and non-linear models with same architecture and input

information (1 neuron in hidden layer and G matrix as input to the network) and best non-linear ANN.

linear ANN non-linear ANN best non-linear ANN

r r r

German Fleckvieh bulls

Milk yield DYD 0.68 (0.0007) 0.52 (0.0016) 0.68 (0.0008)

Protein yield DYD 0.68 (0.0006) 0.53 (0.0011) 0.67 (0.0005)

Fat yield DYD 0.66 (0.0005) 0.56 (0.0008) 0.65 (0.0005)

Holstein Friesian bulls

Milk yield DYD 0.60 (0.0006) 0.53 (0.0011) 0.58 (0.0008)

Protein yield DYD 0.59 (0.0009) 0.50 (0.0013) 0.57 (0.0009)

Fat yield DYD 0.57 (0.0009) 0.51 (0.0010) 0.56 (0.0009)

Holstein Friesian dams

Milk yield YD 0.47 (0.0031) 0.44 (0.0040) 0.47 (0.0027)

Protein yield YD 0.37 (0.0033) 0.35 (0.0039) 0.35 (0.0032)

Fat yield YD 0.46 (0.0037) 0.39 (0.0049) 0.47 (0.0028) DYD = Daughter yield deviation, YD = Yield deviation, r = average Pearson correlation coefficient of

the cross-validation runs, variance is shown in brackets.

Differences between linear ANN and best predictive ANN were very small. Nevertheless, the

linear ANN was superior to a non-linear ANN with the same architecture, even though they

were fed with the same input information ( G ). This pattern was consistent over all data sets,

independently of the trait investigated, and pronounced when no dimension reduction of input

to the network was made. Over all, the results indicate that linear methods are reliable when

working with large scaled data, and provide as good results as the much more computationally

intensive non-linear ANNs when milk traits are used as response variable in genome-enabled

prediction.

Conclusions

We used several ANN models for genome-enabled prediction using large scale SNP-panels

and investigated the influence of various inputs and architectures with milk traits in different

dairy cattle data sets. The aim was to assess the impact of data structure, type of genomic

information used as input to the network, and network architecture on the predictive

performance.

Our results indicated that dimension reduction yielded higher, more accurate and more

consistent predictions of future phenotypes, irrespective of trait and data set used. Thus, we

recommend feature selection methods and regularization in the training phase of an ANN (e.g.

weight decay (Kriesel, 2007)) for genome enabled-predictions on large SNP-panels. In this

62

context, the large number of parameters in a richly structured ANN impairs its predictive

power, and our results confirm the robustness of linear methods. However, we wish to

underline the potential of ANNs for mapping non-linear relationships between genotype and

phenotype. In particular this method is expected to show its strength in traits were evidence of

non-linearity is present, and where less parameterized linear methods fail. Perhaps ANNs are

maybe more useful for functional traits than for milk traits, which seem to behave additively

and can be predicted well with linear methods. Nevertheless, back-propagation with early

stopping is a useful learning algorithm for ANNs for genome-enabled predictions from large

scale SNP information, in the sense that a regularized back-propagation learning algorithm

keeps computational cost as low as possible, while maintaining good predictive performance,

when feature selection is used.

63

Author’s contributions

AE conceived the study, performed all computations and drafted the manuscript. DH provides

the computation program and supervised the computations. DG helped conceived the study

and provided critical insights and GT supervised the studies and revised the manuscript.

Acknowledgements

This research was supported by the German Federal Ministry of Education and Research

(BMBF) within the AgroClustEr “Synbreed – Synergistic plant and animal breeding”. The

authors thank Claas Heuer (Institute of Animal Breeding and Husbandry, Christian-Albrechts-

University, Olshausenstr. 40, 24098 Kiel, Germany) for the comments and suggestions to the

manuscript.

References

de los Campos, G., H. Naya, D. Gianola, J. Crossa, A. s. Legarra, E. Manfredi, K. Weigel,

and J. M. Cotes. 2009. Predicting quantitative traits with regression models for dense

molecular markers and pedigree. Genetics 182:375-385.

de los Campos, G., J. M. Hickey, R. Pong-Wong, H. D. Daetwyler, and M. P. L. Calus. 2013.

Whole-genome regression and prediction methods applied to plant and animal

breeding. Genetics 193:327-345.

Gianola, D., R. L. Fernando, and A. Stella. 2006. Genomic-assisted prediction of genetic

value with semiparametric procedures. Genetics 173:1761 - 1776.

Gianola, D. and J. van Kaam. 2008. Reproducing kernel hilbert spaces regression methods for

genomic assisted prediction of quantitative traits. Genetics 178:2289 - 2303.

Gianola, D., H. Okut, K. Weigel, and G. Rosa. 2011. Predicting complex quantitative traits

with Bayesian neural networks: a case study with Jersey cows and wheat. BMC

Genetics 12:87.

Gurney, K. 1997. An Introduction to Neural Networks. Taylor&amp; Francis, Inc.

Hanrahan, G. 2011. Artificial neural networks in biological and environmental analysis. CRC

Press

Hastie, T., R. Tibshirani, J. Friedman, and J. Franklin. 2009. The elements of statistical

learning: data mining, inference and prediction. The Mathematical Intelligencer 27:83-

85.

64

Howie, B., C. Fuchsberger, M. Stephens, J. Marchini, and G. R. Abecasis. 2012. Fast and

accurate genotype imputation in genome-wide association studies through pre-

phasing. Nat Genet 44:955-959.

Kohavi, R. 1995. A study of cross-validation and bootstrap for accuracy estimation and model

selection. The Intrnational Joint Conference on Artificial Intelligence IJCAI, pp 1137-

1145

Kolmogorov, A.N. 1957. On the representation of continuous functions of many variables by

superposition of continuous functions of one variable and addition 114:953–956

Kriesel, D. 2007. A brief introduction to neural networks. Available at

http://www.dkriesel.com

Kurková, V. 1992. Kolmogorov’s theorem and multilayer neural networks. Neural Networks

5:501–506 (1992)

MacKay, D.J. 2003. Information theory, inference and learning algorithms. Cambridge

university press (2003)

Mandel, J. 1982. Use of the singular value decomposition in regression analysis. The

American Statistician 36:15–24

Meuwissen, T. H. E., B. J. Hayes, and M. E. Goddard. 2001. Prediction of total genetic value

using genome-wide dense marker maps. Genetics 157:1819 - 1829.

Okut, H., D. Gianola, G. J. M. Rosa, and K. A. Weigel. 2011. Prediction of body mass index

in mice using dense molecular markers and a regularized neural network. Genetics

Research 93:189 - 201.

Okut, H., X.-L. Wu, G. J. M. Rosa, S. Bauck, B. W. Woodward, R. D. Schnabel, J. F. Taylor,

and D. Gianola. 2013. Predicting expected progeny difference for marbling score in

Angus cattle using artificial neural networks and Bayesian regression models.

Genetics Selection Evolution 45:34.

Pausch, H., B. Aigner, R. Emmerling, C. Edel, K.-U. Gotz, and R. Fries. 2013. Imputation of

high-density genotypes in the Fleckvieh cattle population. Genetics Selection

Evolution 45:3.

Pérez-Rodríguez, P., D. Gianola, K. A. Weigel, G. J. M. Rosa, and J. Crossa. 2013. Technical

Note: An R package for fitting Bayesian regularized neural networks with applications

in animal breeding. Journal of Animal Science.

65

Rumelhart, D. E., G. E. Hinton, and R. J. Williams. 1986. Learning representations by back-

propagating errors. Nature 323:533-536.

R Development Core Team. 2011. R: A Language and Environment for Statistical

Computing. R Foundation for Statistical Computing, Vienna, Austria. URL

http://www.R-project.org. R version 2.14.0.

Tusell, L., P. Pérez-Rodríguez, S. Forni, X. L. Wu, and D. Gianola. 2013. Genome-enabled

methods for predicting litter size in pigs: a comparison. animal 7:1739–1749

VanRaden, P. 2008. Efficient methods to compute genomic predictions. Journal of dairy

science 91:4414–4423

Wellmann, R. and J. Bennewitz. 2011. The contribution of dominance to the understanding of

quantitative genetic variation. Genetics Research 93:139

Werbos, P.J. 1994. The Roots of Backpropagation: from Ordered Derivatives to Neural

Networks and Political Forecasting vol. 1. John Wiley & Sons

Whittaker, J. C., R. Thompson, and M. C. Denham. 2000. Marker-assisted selection using

ridge regression. Genetics Research 75:249-252.

66

67

CHAPTER 4

68

69

Use of genomic and metabolic information as well as milk performance records for prediction of subclinical ketosis risk via

artificial neural networks

A. Ehret1, D. Hochstuhl

2, N. Krattenmacher

1, J. Tetens

1, M. S. Klein

3, W. Gronwald

4,

and G. Thaller1

1Institute of Animal Breeding and Husbandry, Christian-Albrechts-University, Germany

2Institute for Theoretical Physics and Astrophysics, Christian-Albrechts-University, Germany 3 Faculty of Kinesiology, University of Calgary, Alberta, Canada

4 Institute of Functional Genomics, University of Regensburg, Germany

Submitted for publication in Journal of Dairy Science

Abstract

Subclinical ketosis is one of the most prevalent metabolic disorders in high producing dairy

cows during early lactation. This renders its early detection and prevention important for both

economical and animal welfare reasons. Construction of reliable predictive models is

challenging, because traits like ketosis are commonly affected by multiple factors. In this

context, machine learning methods offer great advantages due to their universal learning

ability and flexibility in integrating various sorts of data. Here, an artificial neural network

approach was applied to investigate the utility of metabolic, genetic and milk performance

data, alone and in various combinations, for the prediction of milk levels of BHBA, an

established marker of subclinical ketosis risk, within and across consecutive weeks

postpartum. Data were collected from 218 dairy cows during their first five weeks in milk. All

animals were genotyped with a 50k SNP-panel and weekly information on the concentrations

of the milk metabolites glycerophosphocholine (GPC) and phosphocholine (PC) as well as

milk composition data (milk yield, fat and protein percentage) was available. The

concentration of beta-hydroxybutyric acid in milk was used as the target variable in all

prediction models. Average correlations between observed and predicted target values up to

0.643 could be obtained, if milk metabolite and routine milk recording data were combined

for prediction at the same day within weeks. Predictive performance of metabolic as well as

milk performance based models was higher than of models based on genetic information. In

summary, performance of artificial neural networks is encouraging for predicting lowly

heritable traits, such as ketosis.

Keywords: Artificial neural network, prediction, ketosis, milk metabolite

70

Introduction

In dairy cattle, breeding for an increased milk yield has led to a pronounced energy deficit

postpartum (Veerkamp and Koenen, 1999). Cows experiencing a strong negative energy

balance in early lactation are more prone to metabolic disorders (Collard et al., 2000) and

related diseases. Ketosis or acetonaemia is one of the most prevalent metabolic disorders in

high producing dairy cows showing the highest incidence within the first weeks of lactation

(Duffield et al., 2009; Asl et al., 2011). While clinical cases are comparatively rare,

prevalence of up to 28.3% have been reported for subclinical ketosis in European countries

(Suthar et al., 2013). Subclinical stages of ketosis are associated with decreased milk

production and an increased risk of displaced abomasum, lameness and metritis (Dohoo and

Martin, 1984). Detection of the preclinical disease stages would allow for a timely treatment

of affected animals, which is beneficial, both, from an economical and an animal welfare

point of view (Asl et al., 2011). Subclinical ketosis (SCK) is characterized by an increased

production of ketone bodies, such as BHBA, acetone and acetoacetate in the absence of

clinical signs of ketosis (Andersson, 1984). Ketone body concentrations can be readily

measured in both, blood or milk, (Enjalbert et al., 2001) and BHBA levels are well-

established indicators of early stage subclinical ketosis (Dohoo and Martin, 1984; Duffield et

al., 1997; Suthar et al., 2013). A serum BHBA concentration of ≥ 1200-1400 µmol/L is

typically defined as SCK (Geishauser et al., 2000; LeBlanc, 2010). However, even cows with

serum BHBA concentrations exceeding 1000µmol/L in the first week of lactation are already

almost 14 times more likely to develop displaced abomasum (Seifi et al., 2011).

Concentrations of BHBA have been reported to be approximately ten times smaller in milk as

compared to plasma (Klein et al., 2013) and equivalently smaller threshold concentrations

have to be applied in ketosis detection (van der Drift et al., 2012a).

Exceeding a certain BHBA threshold indicates an already existing ketotic state, but in order to

reach the goal of ultimately preventing SCK, an early forecast of individual ketosis risk is

required. However, prediction of functional traits which are commonly affected by multiple

factors is extremely challenging. Typically, those traits are characterized by a low heritability

and phenotypic measurements are often costly and not part of routine recording. Thus, it is

reasonable to exploit new sources of information for the detection of preketotic metabolic

conditions. The rapidly evolving field of metabolomics is promising in terms of enabling a

direct readout of biological processes involved in disease progress. The consequences are

twofold: on the one hand decreasing costs of metabolic measurements allow obtaining an

increasing amount of valuable information on the metabolic status of individual animals. On

71

the other hand, this raises profound questions on how to utilize this sort of information for

disease prevention. Recently, Melzer et al. (2013) proposed the integration of metabolite

profiles in SNP-based prediction of milk traits. However, aiming at a more precise detection

of evolving ketosis, it is of high interest for breeders as well as for developers of processes

and control strategies in dairy production to find ways to integrate multiple sources of

information (e.g. performance records, metabolic and genomic information). For such

purposes, machine learning methods like artificial neural networks (ANN) are promising

tools. They are able to learn specific data patterns and especially ANNs are considered

universal approximators of any type of arithmetic function (Hornik et al., 1989). Therefore,

they have gained considerable popularity in predicting and forecasting complex systems,

including various applications in animal and plant sciences (Gianola et al., 2011). ANNs are

able to cover cryptic interactions between various predictors and response variables without a

need to explicitly specify a fixed model. For the prediction of ketosis risk this approach

allows for the simultaneous incorporation of different types of information to enhance

prediction accuracy and performance.

A recent study on potential predictor variables for ketosis showed, that dairy cows with ratios

of milk glycerophosphocholine (GPC) to milk phosphocholine (PC) equal or greater than 2.5

are significantly less prone to ketosis within and over subsequent lactations (Klein et al.,

2012). Raised GPC/PC ratios and GPC concentrations in milk are believed to reflect an

animal’s ability to mobilize fatty acids from the breakdown of blood phosphatidylcholines

rather than from body fat as a source for milk lipid synthesis (Klein et al. 2012). Additionally,

genome-wide association studies revealed marker genotypes associated with ketosis linked

factors (Buitenhuis et al., 2013; Tetens et al., 2013; Gladdis et al., 2014), underscoring the

potential contribution of genetic factors to disease development. Furthermore, changes in milk

composition are an indirect indicator of the health status of a cow (e.g. Hamann and Krömker

1997). The close relationship between metabolic disorders and energy balance allows for the

utilization of energy balance traits (like fat to protein ratio) in the prediction of disease risk

(Buttchereit et al., 2010). Performance records for milk yield and milk composition like fat

and protein percentage, protein to fat ratio, as well as dry matter intake (DMI) are promising

indicators for the susceptibility to metabolic disorders (Duffield et al., 1997).

The main objective of this study was to systematically assess the applicability of different

sources of information in predicting subclinical ketosis in early lactation. Prediction was

accomplished via ANNs for the same day within one week and for consecutive weeks

(forecast). The concentration of BHBA in milk is a direct measure of the ketotic state of a

72

cow and was used as the target variable. Several combinations of metabolic and genetic

information as well as performance records of cows from the first weeks of lactation were

used alone and in various combinations as predictor variables and their influence on

prediction accuracy was investigated.

Material and Methods

Data collection

Genomic and metabolic data as well as information on milk yield and milk composition of

218 dairy cows in early lactation were available. Data were collected on a weekly basis

between October 2008 and June 2010 on the dairy research farm Karkendamm, which is

affiliated to the Institute of Animal Breeding and Husbandry of Kiel University, Germany.

Milk samples were collected weekly during the first five weeks in milk. Eighteen animals had

full records for five consecutive weeks and 41 animals had complete records for four

consecutive weeks, but the vast majority of cows (N=159) had consecutive records for only

three or less weeks. While precluding the analysis of time series for predicting the SCK risk,

611 weekly records of 218 cows were available for the prediction within and across

consecutive weeks as shown in Table 1.

Table 1. Structure of data sets used for the prediction of milk BHBA within and across the

first five weeks in milk.

Lactation week, predictors1 were

recorded in

Lactation week, BHBA was

predicted for

No. of animals with

complete records

Within weeks

1 1 060

2 2 121

3 3 149

4 4 142

5 5 119

Across weeks (forecast)

1 2 034

2 3 085

3 4 101

4 5 083 1Milk concentrations of glycerophosphocholine and phosphocholine, milk performance

records

73

The milk levels of GPC and PC as well as their ratio were used as predictor variables, while

the concentration of BHBA in milk (see Figure 1) served as the target variable.

Figure 1 Boxplots of BHBA concentrations in milk samples across the first five Weeks in milk.

The three variables had been measured by nuclear magnetic resonance spectroscopy as

described previously (Klein et al., 2012). Additionally, for all cows Weekly milk performance

records of milk yield and protein and fat percentage were available.

All animals were genotyped using the Illumina BovineSNP50 BeadChip® (Illumina Inc., San

Diego, CA, USA) comprising a total number of 54,001 single nucleotide polymorphisms

(SNPs). SNPs with a minor allele frequency < 0.05 and missing genotype frequency > 0.05

were discarded from the analyses resulting in final set of 41,423 markers. Sporadically

missing genotypes were imputed using the population based imputing algorithms MaCH (Li

et al., 2010) and its extension Minimac (Howie et al., 2011).

74

Data pre-processing

Aiming at a reduction of model complexity and computational cost, genomic information was

incorporated in the prediction using the principal component scores, referred to as UD matrix,

of the original genotype matrix X. The UD matrix is of dimension n x n with n being the

number of animals and is obtained from the singular value decomposition of the original

genotype matrix X (Mandel, 1982):

'UDVX = .

Here, U is an orthogonal matrix and the columns are the eigenvectors of 'XX . D is a matrix

containing the square roots of the nonzero eigenvalues of XX' , and the columns of V are the

eigenvectors of XX' (Tussel et al., 2013). The UD decomposition re-expresses the data

taking its variance into account.

Additionally, feature scaling was performed for all metabolites, metabolite ratios and milk

performance records as well as for the obtained principle component score matrix of

genotypes ( UD ), to incorporate different variables measured on different scales into the

model. Basically, feature scaling is a normalization process where variables are linearly

transformed. In order to achieve the scaling of all elements to a range between -1 and 1, the

normalization function of Perez et al. (2013) was used.

Artificial neural networks

For prediction, a multilayer feed-forward artificial neural network (ANN) was used, which is

part of a wide range of machine learning methods for data mining processes and was firstly

introduced to image recognition processes. ANNs are very flexible prediction machines

allowing for the incorporation of several types of data. According to the learning rule, an

ANN can be trained for predicting unknown outcomes of the same task as training samples.

For further information, in particular to applications concerning animal and plant science we

refer to Gianola et al. (2011).

We implemented a single hidden layer feed-forward network with five neurons in the hidden

layer and a learning rate equal to 0.02. For further information regarding fitting ANNs we

refer to Kriesel (2007).

75

Further, we employed the non-linear hyperbolic tangent function to the hidden neurons,

aiming at giving the network a greater flexibility for capturing the relation between different

combinations of information sources and milk BHBA concentrations (in the following also

referred to as SCK risk).

In the output layer one neuron with a linear activation was used. For the training process a

regularized back-propagation algorithm was used, which corresponds to a non-linear least

squares optimization. All models were run using a C++ program written by the authors,

whereas data pre-processing was done in R (Version 2.14.1, R Core Team 2011).

Prediction models

We evaluated the predictive performance of several models for SCK risk. In detail, we

investigated different combinations of information to assess their impact on prediction ability

of BHBA concentrations, within and across consecutive Weeks. An overview of all chosen

prediction models is presented in Table 2.

Table 2. Schematic overview of selected combinations of information sources for the

different prediction models

GPC/PC1 GPC2+PC3 MPR4 UD5

Within Weeks

GPC/PC1 X x

GPC2+PC3 x x x

MPR4 x x x

UD5 x x x

Across Weeks

GPC/PC1

GPC2+PC3 x x x

MPR4 x x x

UD5 x x x 1 GPC/PC = ratio of glycerophosphocholine to phosphocholine, 2GPC =

glycerophosphocholine, 3PC = phosphocholine, 4MPR = milk performance records (milk

yield, fat and protein percentage), 5UD = principle component scores of genotypes, p.p. =

postpartum

76

Model validation

Taking into account the small sample size, we applied a five-fold cross validation with 100

individual repetitions (Kohavi, 1995) to properly assess the predictive ability of the different

models within and between consecutive Weeks. Cross-validation for prediction models was

performed by randomly dividing data sets into five subsets of predictor combinations and

corresponding BHBA status of individual animals, having no missing values for the desired

combination of information. One subset (testing set) was left out to evaluate the predictive

ability of the model, whereas the other four subsets were used as training set to configure the

ANN. This scheme was rotated, so that each subset served as testing set once. In each of the

five rounds the BHBA concentrations of the testing set were set to missing and Pearson’s

correlation between observed and predicted response were calculated. One hundred different

randomizations assigning the different combination of predictors and corresponding BHBA

status to five different subsets were used, yielding 500 independent cross-validation runs per

model. The predictive ability of each model was then calculated as Pearson’s correlation (r)

between the observed and predicted BHBA concentration averaged over all 500 cross-

validation runs. Further, a multiple contrast test (Tukey's Honest Significant Difference test

(HSD)) was applied to cross-validation results, to indicate possible significant differences

between single models (note that in the following results from HSD test are not completely

shown). The HSD was applied simultaneously to the set of all pairwise comparisons and

identifies differences between two means that are greater than the expected standard error.

Results and Discussion

Prediction of subclinical ketosis risk within Weeks

Table 3 shows the average Pearson’s correlation coefficient between observed and predicted

BHBA concentrations of 500 independent cross-validation runs for the different models to

predict yet-to-be observed concentrations of BHBA within the same Week.

77

Table 3. Results of predictions at the same day, within Weeks. Shown are the average

Person’s correlation coefficients between observed and predicted BHBA concentrations in

milk over all 500 CV runs (the variance of estimates is given in brackets).

Prediction model1 Week 1

p.p.

Week 2

p.p.

Week 3

p.p.

Week 4

p.p.

Week 5

p.p.

Average2

GPC/PC -0.027 0.005 0.278 0.140 0.193 0.118

(0.073) (0.037) (0.015) (0.017) (0.027)

GPC+PC -0.127 0.085 0.545 0.322 0.255 0.216

(0.066) (0.040) (0.065) (0.129) (0.050)

MPR 0.489 0.393 0.335 0.542 0.463 0.444

(0.064) (0.027) (0.023) (0.040) (0.037)

UD 0.453 -0.011 0.095 0.198 0.043 0.156

(0.060) (0.041) (0.024) (0.031) (0.035)

GPC/PC+UD 0.439 -0.046 0.108 0.212 0.088 0.160

(0.061) (0.036) (0.026) (0.030) (0.034)

GPC+PC+UD 0.408 0.032 0.259 0.274 0.170 0.229

(0.067) (0.035) (0.026) (0.040) (0.035)

MPR+UD 0.528 0.135 0.278 0.484 0.349 0.355

(0.059) (0.033) (0.020) (0.020) (0.033)

MPR+GPC+PC 0.439 0.212 0.519 0.643 0.371 0.437

(0.047) (0.045) (0.048) (0.068) (0.067)

MPR+GPC+PC+UD 0.526 0.186 0.341 0.510 0.368 0.386

(0.044) (0.030) (0.019) (0.026) (0.027) 1 GPC/PC = ratio of glycerophosphocholine to phosphocholine, GPC =

glycerophosphocholine, PC = phosphocholine, MPR = milk performance records (milk yield,

fat and protein percentage), UD = principle component scores of genotypes, p.p. = postpartum 2 Averaged over all five Weeks

Generally, due to the small number of animals with full records for all available sources of

information, variance in calculated correlations between observed and predicted phenotype in

CV runs were high (Table 3). Further, the very limited number of clinical ketosis cases with

very high BHBA levels (Figure 1) might in particular contribute to the high variance in

correlations obtained from CV runs. Depending on the distribution of the animals with

outstandingly high BHBA levels between training and testing set, predictive performance can

become poor for the upper concentration range. Using a data set with more clinical cases or

restricting the data set to a certain BHBA range, which was precluded by the small sample

size here, might well result in a much better performance.

78

The results indicate that prediction of SCK risk works sufficiently well in Weeks 1, 3, 4 and 5

postpartum. On average and considering all implemented models, Pearson’s correlations of

0.348, 0.306, 0.369, and 0.256, respectively, could be obtained for respective Weeks in milk.

In comparison, predictive performance was, for unknown reasons, worse in Week 2

postpartum (correlation averaged over all models r = 0.110). Against the background of the

small sample size, however, it is possible that animals with particular patterns of predictor

values might have biased predictions. Overall, the predictive performance of ANNs within

Weeks varied strongly with the model used as input to the network, i.e. the different

combination of predictors.

Impact of metabolic information

As shown in Table 3, the prediction performance was, irrespectively of the inclusion of

genetic information, significantly better when using the absolute abundances of GPC and PC

rather than using their ratio (GPC/PC) as input to the network. The only exception is lactation

Week 1, where the ratio combined with the UD matrix achieved an average predictive

performance of r = 0.439, whereas the single metabolites combined with the genetic

information resulted in a performance of r = 0.408. However, evidence is limited, because the

difference was not significant (HSD). These results lead to the assumption that the

relationship between metabolites is not linear, thus, the ANN is able to capture the

relationship better through separate information of single metabolites than through a linear

combination. However, when metabolic information represented the only source of input

information, predictions failed for the first two Weeks postpartum. This effect might be due to

the fast changes in metabolic constitutions of animals in the first Weeks of lactation. In

Week1, the prediction performance increased significantly when genetic information was

added to the metabolic models. Combinations of metabolic and genetic information as

predictors achieved no significant improvement, rather than a decline, in predictive ability in

all other Weeks. These results indicate that genetic information of 50K SNP-panels has low

informative character as predictor of BHBA status. This might, however, also be due to

statistical limitations, as the inclusion of genomic information in predictions increases the

number of parameters to be estimated and, thus, increases model complexity.

79

Overall, the highest prediction performances and lowest variances of cross-validation runs

were obtained with the use of both milk performance and metabolite data. This effect is

independent of the data set (Weeks postpartum) used for the analyses.

Impact of genetic information

Predictive ability of the genetic covariates as the only information to the networks was best in

the first Week and decreased with increasing number of Weeks postpartum, which might

indicate a decrease of heritability during lactation. This is in line with Oikonomou et al.

(2008), who reported a heritability estimates for serum BHBA concentrations to drop from

0.40 in the first lactation Week to 0.08 in Week seven. To analyze the trajectory of heritability

throughout lactation, random regression models can be used, which is hardly possible in the

current study owing to the limited amount of data. Moreover, it is known from literature that

the use of random regression for disease data often results in erratic estimates and

computational difficulties, especially when disease frequencies are low (Carlen et al., 2009).

Our results show that the benefit of genetic information to increase predictive ability of the

prediction model is low. The use of the genomic covariates led to a significant increase of

predictive performance of models only for Week 1, but for all other Weeks this was not true.

Generally, adding genomic information to the prediction models led to a significantly higher

predictive performance in 20% of the models. However, 60% of the models performed

significantly worse than the reduced model without genomic information. The results

therefore indicate that the exclusive use of genetic covariate information fails in the prediction

of SCK risk. This is in concordance with the low heritability of ketosis and BHBA

concentrations reported in literature (e.g. Gaddis et al., 2014 [h²=0.04-0.14 for clinical

ketosis], Heringstad et al., 2005 [h²=0.13-0.15 for clinical ketosis], Van der Drift et al., 2012b

[h²=0.16 and 0.17 for milk and serum BHB, respectively]). Furthermore, we suppose that a

higher number of animals would have been needed to make SNP-based predictions valuable.

And finally, there might be an effect of static distribution of genotypes, as they are not able to

capture the changes of BHBA concentrations of animals within Weeks. Furthermore, using

the matrix might have somewhat blurred genetic information, as single information got lost

upon transforming genotype information into principle component scores. Nevertheless,

transformation was needed to make regression with low sample size feasible.

80

Impact of milk performance records

The use of Weekly milk performance records (milk yield, protein and fat percentage) as input

information to the prediction model consistently led to high prediction performance of BHBA

status of cows in early lactation and, thus, to a good mapping of SCK risk. This was

irrespective of the Week that was studied. As mentioned before, the high negative energy

balance, which is determined by milk yield and can be described by milk composition in early

lactation, is one of the most critical precursors for metabolic disorders (e.g. Hamann and

Krömker 1997). Indeed, the milk related predictors showed an appropriate capability to

predict the BHBA status of cows properly within Weeks. Furthermore, a combination of milk

performance records with metabolic information seems to give a stable and well performing

predictive system.

Additionally, it would have been interesting to include also DMI as another variable with a

substantial effect on the energy status of a cow, to the predictive model, but due to technical

issues this information was not available to us. However, under field conditions milk

performance records such as milk yield, protein and fat percentage are routinely available,

whereas this is not the case for DMI.

Prediction of subclinical ketosis risk across Weeks

In the second part of the study we investigated the forecast of BHBA status. Table 4 shows

the impact of different prediction models on the performance of ANNs to predict

concentration of BHBA across consecutive Weeks. The average Pearson’s correlation

coefficient between observed and predicted BHBA concentration of 500 independent cross-

validation runs is shown. Predictions were achieved with combinations of information

originating from one Week, aiming to predict the BHBA concentrations of the following

Week for individual cows. Due to data structure, the number of animals included in the

different datasets was slightly lower than in the previous analyses. Therefore, variances of

calculated correlations in the CV runs were higher than those obtained in runs within Weeks

(Tables 3 and 4). Increasing the number of independent CV runs (>500) did not decrease

variation of correlations between observed and predicted phenotype. To improve this aspect, a

larger number of animals would be needed in single models, which were not available to us.

Additionally, based on the poor performance in the previous predictions, the GPC to PC ratio

was dropped from the analyses (Table 2).

81

Table 4. Results of predictions across consecutive Weeks. Shown are the average Person’s

correlation coefficients between observed and predicted BHBA concentrations in milk over

all 500 CV runs (the variance of estimates is given in brackets).

Prediction model1 Week 1 to 2

p.p.

Week 2 to 3

p.p.

Week 3 to 4

p.p.

Week 4 to 5

p.p.

Average2

GPC+PC -0.146 0.198 0.324 0.171 0.137

(0.140) (0.077) (0.094) (0.051)

MPR 0.053 0.285 0.262 0.285 0.221

(0.132) (0.063) (0.032) (0.087)

UD -0.144 -0.105 0.183 0.093 0.007

(0.141) (0.040) (0.066) (0.051)

GPC+PC+UD -0.133 -0.045 0.289 0.099 0.053

(0.140) (0.046) (0.048) (0.062)

MPR+UD -0.063 -0.035 0.253 0.310 0.116

(0.159) (0.038) (0.046) (0.058)

MPR+GPC+PC -0.020 0.046 0.369 0.133 0.132

(0.151) (0.063) (0.104) (0.061)

MPR+GPC+PC+UD -0.119 -0.011 0.270 0.294 0.109

(0.160) (0.041) (0.051) (0.055) 1GPC = glycerophosphocholine, PC = phosphocholine, MPR = milk performance records

(milk yield, fat and protein percentage), UD = principle component scores of genotypes, p.p.

= postpartum, 2 Averaged over all Weeks

Results indicate that the prediction of BHBA status across consecutive Weeks is generally

applicable from Week 3 to 4 and Week 4 to 5. Averaged over the seven implemented

predictive models a Pearson’s correlation between observed and predicted BHBA

concentrations of 0.279 and 0.198 could be obtained in the data sets, respectively. In

comparison, predictive performance was worse from Week 2 to 3 (over all models averaged

correlation r = 0.048) and predictions completely failed from the first to the second Week

postpartum (over all models averaged correlation r = -0.081). In the latter data set, no

correlations between predicted and true value could be achieved. Due to the rapid changes in

metabolism at the beginning of lactation, milk-related and metabolic measurements can vary

strongly in the first Weeks in milk. Our assumption that some factors are mapping the

concentration of BHBA across consecutive Weeks might not be appropriate in the first two

Weeks postpartum. Thus, a dynamic prediction model e.g. time-series analyses might be more

reasonable than a static one. However, this also could be biased through low effective sample

size as well as the limited number of animals with high milk BHBA concentrations.

82

Nevertheless, the results show that the forecast of SCK risk based on data from the previous

Week is possible with a reasonable quality from Week 2 to 5 postpartum. The best

combination of predictors is, however, varying between Weeks (Table 4). Results indicate no

significant advantage of milk performance records over metabolic information as information

sources. However, when milk performance records (milk yield, protein and fat percentage)

were provided as input to the ANN, more consistent results across different data sets were

achieved. Unfortunately, the use of milk performance records in conjunction with milk

concentrations of GPC and PC, both recorded one Week before, did not result in significantly

better prediction performances of the network. Moreover, the performance of using genomic

covariates as single models or in combination with other information was also low for Weeks

1 to 2 and 2 to 3, respectively, whereas adding genetic information to prediction models for

Weeks 3 to 4 and 4 to 5, respectively, seemed to achieve more consistent prediction

performances than single models.

Conclusions

The performance of incorporating metabolic, genetic, and milk performance data alone and in

various combinations was compared in predicting BHBA concentrations by means of ANNs

in individual dairy cows within and across consecutive Weeks postpartum. Predictive abilities

differed across models. Overall, metabolic and performance based models had a higher

predictive ability than models based on genetic information. However, further studies using

more comprehensive data sets have to be awaited to confirm these promising results.

Furthermore, there might be metabolites or metabolite profiles that possess a higher predictive

ability than those used in the current study.

83

Author’s contributions

AE conceived the study, performed all computations and drafted the manuscript. DH provided

the C++ code and supervised the computations. JT and NK participated in the conception of

the study and revised the manuscript. MK and WG revised the manuscript. GT supervised the

study and revised the manuscript

Acknowledgements

This research was supported by the German Federal Ministry of Education and Research

(BMBF) in the context of the Fugato-plus MeGA-M framework (grant no. 0315131B) and the

AgroClustEr “Synbreed – Synergistic plant and animal breeding” (grant no. 0315528F). The

genotyping of the cows was financially supported by the German Federal Ministry of

Education and Research (project FUGATO-plus GENOTRACK, grant no. 0315134A), the

KMSH (Kompetenzzentrum Milch – Schleswig-Holstein, Kiel, Germany), and the NOG

Nord-Ost Genetic GmbH & Co. KG (Verden, Germany). The authors would like to thank P.

Oefner for a thorough review of the manuscript and many helpful suggestions.

References

Andersson, L. 1988. Subclinical ketosis in dairy cows. The Veterinary clinics of North

America. Food Anim. Prac. 4:233-251.

Asl, A. N., S. Nazifi, A. R. Ghasrodashti, and A. Olyaee. 2011. Prevalence of subclinical

ketosis in dairy cattle in the Southwestern Iran and detection of cutoff point for NEFA

and glucose concentrations for diagnosis of subclinical ketosis. Prev. Vet.

Med.100:38-43.

Buitenhuis, A. J., U. K. Sundekilde, N. A. Poulsen, H. C. Bertram, L. B. Larsen, and P.

Sørensen. 2013. Estimation of genetic parameters and detection of quantitative trait

loci for metabolites in Danish Holstein milk. J. Dairy Sci. 96:3285-3295.

Buttchereit, N.; Stamer, E.; Junge, W. and G. Thaller. 2010. Relationship of energy balance

and fat protein ratio of milk to disease liability in dairy cattle. In Proc. W.C.G.A.L.P.

2010: 9, 315.

Carlen, M., K. Meletis, C. Goritz, V. Darsalia, E. Evergren, K. Tanigaki, M. Amendola, F.

Barnabe-Heider, M. S. Y. Yeung, L. Naldini, T. Honjo, Z. Kokaia, O. Shupliakov, R.

M. Cassidy, O. Lindvall, and J. Frisen. 2009. Forebrain ependymal cells are Notch-

84

dependent and generate neuroblasts and astrocytes after stroke. Nat Neurosci. 12:259-

267.

Collard, B. L., P. J. Boettcher, J. C. M. Dekkers, D. Petitclerc, and L. R. Schaeffer. 2000.

Relationships between energy balance and health traits of dairy cattle in early

lactation. J. Dairy Sci. 83:2683-2690.

Dohoo, I. R. and S. W. Martin. 1984. Subclinical ketosis: prevalence and associations with

production and disease. Can. J. Comp. Med. 48:1-5.

Duffield, T. F., D. F. Kelton, K. E. Leslie, K. D. Lissemore, and J. H. Lumsden. 1997. Use of

test day milk fat and milk protein to detect subclinical ketosis in dairy cattle in

Ontario. Can. Vet. J. 38:713-718.

Duffield, T. F., K. D. Lissemore, B. W. McBride, and K. E. Leslie. 2009. Impact of

hyperketonemia in early lactation dairy cows on health and production. J. Dairy Sci.

92:571-580.

Enjalbert, F., M. C. Nicot, C. Bayourthe, and R. Moncoulon. 2001. Ketone bodies in milk and

blood of dairy cows: relationship between concentrations and utilization for detection

of subclinical ketosis. J. Dairy Sci. 84:583-589.

Gaddis, K. L. P., J. B. Cole, J. S. Clay, and C. Maltecca. 2014. Genomic selection for

producer-recorded health event data in US dairy cattle. J. Dairy Sci. (in press)

Geishauser, T., K. Leslie, J. Tenhag, and A. Bashiri. 2000. Evaluation of eight cow-side

ketone tests in milk for detection of subclinical ketosis in dairy cows. J. Dairy Sci.

83:296-29.

Gianola D, Okut H, Weigel KA, Rosa GJ. 2011. Predicting complex quantitative traits with

Bayesian neural networks: a case study with Jersey cows and wheat. BMC Genet

12:87.

Hamann, J. and V. Krömker. 1997. Potential of specific milk composition variables for cow

health management. Livest. Prod. Sci. 48:201-208.

Heringstad, B., Y. M. Chang, D. Gianola, and G. Klemetsdal. 2005. Genetic analysis of

clinical mastitis, milk fever, ketosis, and retained placenta in three lactations of

Norwegian Red cows. J. Dairy Sci. 88:3273–3281.

Hornik, K., M. Stinchcombe, and H. White. 1989. Multilayer feedforward networks are

universal approximators. Neural Networks 2:359-366.

85

Howie B., J. Marchini, and M. Stephens. 2011. Genotype imputation with thousands of

genomes. G3 1:457–470.

Klein M.S., M.F. Almstetter, N. Nürnberger, G. Sigl, W. Gronwald, S. Wiedemann, K.

Dettmer , and P.J. Oefner. 2013. Correlations between Milk and Plasma Levels of

Amino and Carboxylic Acids in Dairy Cows. J. Prot. Res. 12:5223-5232.

Klein, M. S., N. Buttchereit, S. P. Miemczyk, A.-K. Immervoll, C. Louis, S. Wiedemann, W.

Junge, G. Thaller, P. J. Oefner, and W. Gronwald. 2012. NMR metabolomic analysis

of dairy cows reveals milk glycerophosphocholine to phosphocholine ratio as

prognostic biomarker for risk of ketosis. J. Prot. Res. 11:1373-1381.

Kohavi, R. 1995. A study of cross-validation and bootstrap for accuracy estimation and model

selection. Pages 1137-1145 in the International Joint Conference on Artificial

Intelligence (IJCAI).

Kriesel D. 2007. A brief introduction to neural networks. Available at

http://www.dkriesel.com

LeBlanc, S. 2010. Monitoring metabolic health of dairy cattle in the transition period. J.

Reprod. Dev. 56:S29–S35.

Li, Y., C. J. Willer, J. Ding, P. Scheet, and G. R. Abecasis. 2010. MaCH: using sequence and

genotype data to estimate haplotypes and unobserved genotypes. Gen. Epidem.

34:816-834.

Mandel, J. 1982. Use of the Singular Value Decomposition in Regression Analysis. Am. Stat.

36:15-24.

Melzer, N., D. R. Wittenburg, and D. Repsilber. 2013. Integrating milk metabolite profile

information for the prediction of traditional milk traits based on SNP information for

Holstein Cows. PLoS ONE 8:e70256.

Oikonomou, G., G. E. Valergakis, G. Arsenos, N. Roubies, and G. Banos. 2008. Genetic

profile of body energy and blood metabolic traits across lactation in primiparous

Holstein cows. J. Dairy Sci. 91:2814–2822.

Pérez-Rodríguez, P., D. Gianola, K. A. Weigel, G. J. M. Rosa, and J. Crossa. 2013. Technical

Note: An R package for fitting Bayesian regularized neural networks with applications

in animal breeding. J. Anim Sci. 91:3522-3531.

86

R Development Core Team. 2010. R: A Language and Environment for Statistical

Computing. R Foundation for Statistical Computing, Vienna, Austria. URL

http://www.R-project.org. R version 2.14.0.

Seifi, H. A., S. J. LeBlanc, K. E. Leslie, and T. F. Duffield. 2011. Metabolic predictors of

post-partum disease and culling risk in dairy cattle. Vet. J. 188:216–220.

Suthar, V. S., J. Canelas-Raposo, A. Deniz, and W. Heuwieser. 2013. Prevalence of

subclinical ketosis and relationships with postpartum diseases in European dairy cows.

J. Dairy Sci. 96:2925-2938.

Tetens, J., T. Seidenspinner, N. Buttchereit, and G. Thaller. 2013. Whole-genome association

study for energy balance and fat/protein ratio in German Holstein bull dams. Anim.

Genet. 44:1-8.

Tusell, L., P. Pérez-Rodríguez, S. Forni, X. L. Wu, and D. Gianola. 2013. Genome-enabled

methods for predicting litter size in pigs: a comparison. Animal 7:1739-1749.

Van der Drift, S.G.A., R. Jorritsma, J.T. Schoneville, H.M. Knijn, and J.A. Stegeman. 2012a.

Routine detection of hyperketonemia in dairy cows using Fourier transform infrared

spectroscopy analysis of ß-hydroxybutyrate and acetone in milk in combination with

test-day information. J. Dairy Sci. 95:4886-4898.

Van der Drift, S.G.A., K.J.E. van Hulzen, T.G. Teweldemedhn, R. Jorritsma, M. Nielen, and

H.C.M. van Heuven. 2012b. Genetic and nongenetic variation in plasma and milk ß-

hydroxybutyrate and milk acetone concentrations of early lactating dairy cows. J.

Dairy Sci. 95: 6785:6787.

Veerkamp, R. F. and E. P. C. Koenen. 1999. Genetics of food intake, live weight, condition

score and energy balance. Pages 63-73 in Occasional Publication British Society of

Animal Science, No. 24, Edinburgh

87

CHAPTER 5

88

89

Extreme Learning Machine: A New Approach for Genomic Prediction of Complex Traits

Anita Ehret1, David Hochstuhl

2 and Georg Thaller

1

1 Institute of Animal Breeding and Husbandry, Christian-Albrechts-University, Germany

2 Institute for Theoretical Physics and Astrophysics, Christian-Albrechts-University, Germany

Published in Proceedings, 10th World Congress of Genetics Applied to Livestock Production

Abstract

A wide range of methods for predicting phenotypes based on genomic data has become

available. Increasingly, the focus is also being set to machine learning methodology. Despite

this progress, the prediction of complex traits from high density SNP-panels remains an

extremely demanding task, particularly in terms of the computational effort required in

calibration and prediction. We present a fast learning algorithm for artificial neural networks

which was introduced by Huang et al. in 2004. Our experimental results show that this

approach is able to achieve good generalization performance with much less computational

effort while outperforming the traditional gradient-based learning in artificial neural networks,

which is a great advantage when analyzing high dimensional data. We demonstrate the

capabilities of the new approach to genomic predictions in animal and plant breeding.

Keywords: Genomic selection, Extreme learning machine, Complex traits

90

Introduction

Today, in animal and plant breeding, non- and semi-parametric, as well as machine learning

methods, have become increasingly important in performing genome-enabled predictions of

complex traits. At the current times of high-density panels and sequence information,

prediction become challenging in terms of computational costs and efficiency. The number of

markers now vastly exceeds the number of records, with the effect that typical fitting

methods, such as least squares regression, often require some prior variable selection or

shrinkage estimation procedure. When using machine learning approaches, one is usually

facing a large number of parameters, non-linear training procedures and the problem of over-

fitting. Therefore, the computational costs are usually even larger than those of standard

parametric methods, which only require the solution of linear systems. A notorious example is

the back-fitting algorithm for the training of artificial neural networks, which must be

processed iteratively. However, the additional effort is usually compensated by an improved

generalization performance. To overcome this computational challenge, a fast learning

algorithm for single-hidden-layer feed-forward neural networks, called extreme learning

machine (ELM), was proposed by Huang and coworkers (Huang et al., 2004; Huang et al.,

2006).

Since its invention, the ELM has had a tremendous impact in the field of machine learning as

well as on its close relatives, such as data mining and pattern recognition. For a survey on its

various applications see Huang et al. (2011). The reason for its success is twofold. On one

hand, the underlying algorithms are almost trivial when compared to more sophisticated

methods such as support vector machines (SVM) or artificial neural networks trained by back-

propagation (to which we refer here as ANN). On the other hand, despite this simplicity,

several studies have revealed that its predictive performance is generally comparable to that of

the standard methods mentioned (SVM and ANN), with often diminished training times. In

this contribution, we employ the ELM to predict milk traits from dairy cattle data with

molecular marker information, which is to our knowledge the first application of this kind,

and compare the results to an ANN approach with the same input and output configuration.

Material and Methods

The theory of the Extreme Leaning Machine may be approached from two directions, starting

either from ANN, or from basis function expansion methods. As the neural network

perspective seems to be common in the literature, we consider the second alternative in the

brief summary that follows.

91

For this, assume we are given a data set (��, ��)���� of size N , where �� ∈ ℝ� denotes an

M-dimensional predictor and �� ∈ ℝ is the corresponding response variable. A common

method to describe the relationship between the predictors and the observations is to assume a

linear expansion into a number of K basis functions f�:R� → R , �(�|�; ß) = �(�; �) = ∑ ����(�)���� (1)

Given this model, the task is then to minimize the residual sum of squares of the data set,

min#∑ $�% − �'�%; ��, … , ��)*+%�� (2)

which, through variation with respect to β quickly leads to the usually over-determined (since

, ≤ .) linear system

/��(��) ⋯ ��(��)⋮ ⋱ ⋮��(�) ⋯ ��(�)3/��⋮��3 = /

��⋮�3 (3)

Denoting the components by F, β, and y, the least-squares solution of this equation is given

by # = 456 , (4)

where 45 stands for the Moore-Penrose pseudo-inverse (Golub and van Loan, 2012). Though

there are several methods to calculate this pseudo-inverse, the singular value decomposition is

usually a convenient choice due to its good stability properties (Press, 2007). Solving the

least-squares problem for the basis expansion model therefore requires a SVD of a matrix of

size KN × , which for NK ≤ scales as ( )2KNσ .

Note that the predictor dimension M has not entered the derivation so far, and will

eventually do so only in the function evaluation step for the construction of the coefficient

matrix. Thus, at least in principle, this allows for large dimensions of the predictor space to be

treated, as is the case in this work.

The crucial point in model (1) is the selection of an appropriate basis set, and several

statistical methods exist which differ only in this respect, such as linear regression, logistic

regression, spline interpolation, and many more. The particular choice leading to the ELM is

given by

��(�) = 7(∑ 8�9�9�9�� + ;�) (5)

92

Here, kmw and kb are randomly chosen real numbers (according to any random distribution),

whereas g denotes a function which in the ELM is usually taken to be the sigmoid function.

In combination with model (1), equation (5) allows for its interpretation as a single-hidden

layer artificial neural network with randomly chosen connections and biases. More precisely,

this network consists of M input neurons, K hidden neurons with sigmoid activation function

and a single output neuron with linear activation function. The kmw stands for the connection

strength between input neuron m and hidden neuron k, kb for the corresponding bias

(intercept), and the coefficients kß are the connection strengths between the thk − hidden

node and the output layer, as shown in Figure 1.

Figure 1 Architecture of a single-hidden layer artificial neural network with ELM approach

93

With the model being the same as a conventional neural network, the major difference that

makes up the ELM is its training. The standard method proceeds by optimizing all involved

parameters, i.e., kkm bw , and kß . Due to their non-trivial dependencies, the training needs to

be done layer by layer in a process called back-propagation. In contrast, the ELM randomly

fixes a large number of coefficients in advance, and leaves only the task to determine the

output layer connection strengths. Due to the linear activation function in the output node, this

can be accomplished by the least-squares solution outlined in Eqs. (2) to (4). Further, the

method presented can easily be extended to include a regularization parameter (via

modification of Eq. (2)) or to use kernels such as radial Gaussian functions (via modification

of (5)).

Benchmark model

To compare the predictive ability of the new algorithm with a standard network we used a

single-hidden layer artificial neural network with back-propagation algorithm as benchmark.

Here, the connections and biases were chosen optimally using a given training set. To avoid

poor generalization ability early stopping was implemented in training. Both models, ANN

and ELM, were set up with a sigmoid activation function in the hidden layer and a linear

function in the output layer. For the ANN, an architecture of ten hidden neurons achieved the

best predictive ability when dealing with the entire marker set.

Data

Genotypic and phenotypic information of 3,341 German Fleckvieh bulls were employed in

the machine. All animals were genotyped with a 50k SNP-panel and after quality control

39,344 SNP markers remained in the analyses. Quality control included the elimination of

SNPs with minor allele frequency < 0.05 and missing genotype frequency > 0.95. For the

remaining loci, missing genotypes were imputed using the population based imputing

algorithm Minimac (Howie et al., 2012), and haplotypes were inferred using MaCH (Li et al.,

2010).

DYD of three milk traits (milk yield, fat yield and protein yield) were used as phenotype

records in the analysis. To account for ANN and ELM peculiarities feature scaling was

applied to both phenotypic and genomic data, so the data was normalized to the [-1,1] range,

to enhance numerical stability.

To assess the predictive performance of the ELM and of the standard ANN, a 10 fold cross-

validation scheme was used. The dataset was randomly split into ten equal folds of

94

phenotypes and genotypes. Then, nine folds were used for training and one for testing. This

was rotated ten times so that every fold served as test set once. The average correlations

between predicted and true phenotype in the testing sets of all runs then were used to evaluate

predictive ability.

Results and Discussion

The results represent our first and recent application of the ELM and serve as an illustration of

the advantages of this promising approach. Results are given in Table 1, showing the

correlation obtained in the cross-validation runs for a ELM with 1000 hidden nodes (the

number of hidden nodes needs to be much larger than for the ANN, in order to compensate

for the random initialization).

Table 1 Estimated predictive abilities for three milk traits

Trait ELM ANN

r r

Milk yield 0.468 0.469

Fat yield 0.453 0.453

Protein yield 0.450 0.477 r = average Person correlation of cross-validation runs, ELM = Extreme learning machine with 1000

hidden neurons, ANN = Artificial neural network with 10 hidden neurons and back-propagation

The ELM was able to give predictions with accuracy well comparable to that of the ANN

approach. The crucial point here is that ELM needed significantly less computation time. On

average, for a ten-fold cross-validation, the ELM required 1.3 hours, whereas the ANN with

ten neurons in the hidden layer needs 11.2 hours for the same task. Another advantage is that,

with increasing numbers of hidden neurons, the prediction performance of the ELM was

prone to increase as well. As an example, we applied the cross-validation procedure to an

ELM with 10000 hidden neurons for the trait protein yield. This resulted in a largely

increased average correlation coefficient of 0.618 between true and predicted phenotype

(computation time for a ten-fold cross-validation: 26.6 hours). As shown in our recent work

(Ehret et al. to be published), it is much more difficult to tune the standard ANN to such

levels of prediction accuracy, when the entire genotype matrix is used for predictions.

95

Conclusion

With a low number of neurons the ELM matches the performance of the ANN in predicting

milk traits from genomic information, while using the whole marker information but

drastically reducing the computational cost. Increasing the number of hidden neurons

produced promising results, as shown for the trait protein yield. Although the study was based

on limited data, our results suggest that the ELM is likely to be useful for predicting complex

traits using high-dimensional genomic information, a situation where the number of

coefficients that need to be estimated exceeds sample size. The ELM, in the same way as

SVM or ANN, has the ability of capturing nonlinearities, but its great advantage is in keeping

the computational cost at a reasonable level. The predictive ability seemed to be enhanced by

using a larger number of neurons in the ELM, which is in contrast to the standard ANN

approach. However, further studies are needed to confirm these results and explore the

capabilities of the ELM.

In summary, the ELM is a promising method for prediction of future phenotypes from high

dimensional genomic data sets, as has been suggested by the presented study. Hence, a

detailed investigation of its capabilities will be an important component in our future work.

96

Author’s contributions

AE and DH conceived the study and wrote the manuscript. AE performed computations. DH

provides the computation program. GT supervised the study and revised the manuscript.

Acknowledgements

This research was supported by the German Federal Ministry of Education and Research

(BMBF) within the AgroClustEr “Synbreed – Synergistic plant and animal breeding”.

References

Golub, G. H. and C. F. Van Loan. 2013. Matrix computations (Vol. 4). Johns Hopkins

University Press.

Huang, G.-B., Q.-Y. Zhu, and C.-K. Siew. 2004. Extreme learning machine: a new learning

scheme of feedforward neural networks. Proceedings 2004 IEEE International Joint

Conference on. Pages 985-990 vol.982

Huang, G.-B., Q.-Y. Zhu, and C.-K. Siew. 2006. Extreme learning machine: Theory and

applications. Neurocomputing 70:489-501.

Huang, G.-B., D. Wang, and Y. Lan. 2011. Extreme learning machines: a survey.

International Journal of Machine Learning and Cybernetics 2:107-122.

Howie, B., C. Fuchsberger, M. Stephens, J. Marchini, and G. R. Abecasis. 2012. Fast and

accurate genotype imputation in genome-wide association studies through pre-

phasing. Nat Genet 44:955-959.

Li, Y., C. J. Willer, J. Ding, P. Scheet, and G. R. Abecasis. 2010. MaCH: using sequence and

genotype data to estimate haplotypes and unobserved genotypes. Genetic

Epidemiology 34:816-834.

Press, W. H. 2007. Numerical recipes 3rd edition: The art of scientific computing. Cambridge

university press.

97

CHAPTER 6

98

99

General Discussion

The present thesis comprises investigations on machine learning, with a focus on artificial

neural networks (ANNs), at the interface between statistics, breeding and genetics, to enhance

accuracy of genome-enabled prediction of complex traits. The purpose was to provide an

overview of published research in this area, insights on ANN modelling issues, applications to

real data problems in cattle and future research directions in the field of animal breeding. In

this Chapter the main findings and additional topics are elucidated and discussed separately.

Firstly the main properties of ANNs regarding genome-enabled predictions are discussed and

the method is ranked in the wide area of common methods used for this purposes. Secondly,

the advantages and possible pitfalls in the construction of ANNs for genome-enabled

prediction of complex traits using large-scale marker data are evaluated. Thirdly, the impact

of the genetic architecture of traits and data representation on the performance of ANNs in

empirical applications to genome-enabled prediction on real data problems is specified.

Finally, the potential and the limitations of using ANNs for genome-enabled prediction of

complex traits on large-scale genomic data are detected and further developments and

prospects for genome-enabled prediction using ANNs are propounded.

Properties of ANNs in genome-enabled prediction of complex traits in animal breeding

Today, several methods exist that allow genome-enabled prediction of complex traits with

reasonable accuracy for yet-to-be-observed phenotypes. Many statistical and mathematical

methods can be adapted for the purpose of prediction. Traditionally, model-based methods

(linear or other parametric models) are used to estimate genetic values of individuals using

genomic data. Here, the basis functions used to regress phenotypes on markers are assigned to

a small class of particular functions a priori, which restricts the type of genetic patterns

(genetic architecture) that these models can capture from data (de los Campos et al., 2013).

The genetic architecture of a complex trait is described by the number of loci affecting trait

variation, mode of inheritance, and distribution of allelic effects (Hayes et al., 2010). Hence,

this may include a variety of genetic interactions which are not necessarily linear, which is

supposed to influence prediction accuracy of yet-to-be-observed trait values. Usually, the

highest accuracy is achieved with a model which best fits the genetic architecture of the trait

under investigation. The formulation of a non-linear parametric model to a particular data set

is a very demanding task, since there are a huge number of possible nonlinear patterns. It

would be difficult to choose suitable non-linear basis functions, and modelling all patterns

simultaneously is hardly feasible. Therefore, a pre-specified non-linear model may not be

100

general enough or misspecified to capture all important features of data samples. Contrary to

model-based parametric methods, ANNs are data-driven self-adaptive methods. There is no

need for assumptions about models being studied. ANNs learn from data examples and

capture functional relationships among data even if the underlying relationship is unknown or

hard to describe. ANNs can be considered as multivariate non-linear, non-parametric models

(Chen and Titterington, 1994). This means, that there is a wide class of non-linear basis

functions (used to regress the phenotype) available to be inferred from data by a supervised

learning algorithm (e.g. back-propagation algorithm). Here the basis functions are not fixed a

priori and their particular shape is inferred during learning from data. This in turn leads the

network to a higher flexibility than standard parametric models. Hence, ANNs are well suited

for problems, where a solution is difficult to specify, but enough data is available (Zhang et

al., 1998). Additionally, ANNs are very flexible machines with regard to their application

ability and to extensions of the basic concept. However, many of the commonly used methods

in animal breeding can be presented by an analogue specified ANN. Table 1 shows a

comparison of important properties of ANNs in comparison to other methods used for

genome-enabled predictions in animal breeding.

101

Table 1 Comparison of methods used for genome-enabled predictions on marker data (modified

according to de los Campos et al., 2013)

Method linear*

non-

linear

penalize-

able

parametric non-

parametric

Bayesian ridge regression1 x x x Ridge regression BLUP2/ Genomic BLUP 3 x x x BayesA4 x x BayesB5 x x BayesC6 x x Bayesian LASSO 7 x x LASSO8 x x x Reproducing kernel hilbert spaces 9 x x x x Support vector regression 10 x x x x Random forest11 x x x Elastic net12 x x x Partial least squares 13 x x x semi Principal component regression14 x x x semi Artificial neural networks15 x x x x * Model is linear on parameters (SNP effects) 1 Meuwissen et al. 2001; 2 Moser et al. 2009, Meuwissen et al. 2001, Piepho et al. 2009 3; VanRaden et

al. 2008; 4 Meuwissen et al. 2001; 5 Meuwissen et al. 2001; 6 Habier et al. 2011; 7 Park and Casella

2008, de los Campos et al. 2009, Legarra et al. 2011; 8 Usai et al.2009; 9 Gianola et al. 2006; 10 Vapnik

1999, Moser et al. 2009; 11 Breiman et al. 2001, Gonzalez-Recio and Forni 2011; 12 Croiseau et al.

2011; 13 Moser et al. 2009; 14 Solberg et al. 2009; 15 Gianola et al. 2011, Okut et al. 2011;

Construction of specified ANNs for genome-enabled predictions

Despite the encouraging properties of ANNs, building a neural network predictor for whole-

genome regression on large-scale genomic data is not a trivial task. The field of possible

applications of ANNs is huge and there are many different ways to construct and implement

neural networks for prediction problems. The selection of an appropriate ANN model using

large-scale genomic data in animal breeding is often a question of available computing

resources and infrastructure. This thesis focuses on multilayer feed-forward networks (see

Chapter 3 and 4) and a modification (Extreme learning machine, see Chapter 5), since large-

scaled data often need too large computational effort (i.e. computational time) if alternative

network types, e.g. Bayesian Regularized Networks or Radial Basis Function Networks, are

used.

There are no general recommendations available to ensure good predictive ability of ANNs

and the optimal settings of an ANN always depend on the specific task. Hence, fitting ANNs

to particular data sets usually results in a trial and error approach, in which different training

settings are tested and the best fit is chosen afterwards to learn the intrinsic structure of the

102

present data. However, there is no guarantee of finding the optimal configuration. One critical

task is to determine an appropriate network architecture (number of hidden layers and number

of hidden neurons per layer) to establish a well performing ANN predictor. Several authors

have shown that one hidden layer in a feed-forward network is sufficient for ANNs to

approximate any non-linear function with any desired accuracy (Hornik et al., 1989), given

that the number of neurons is sufficiently large. The drawback is a detrimental effect on

generalization ability which occurs likely due to over-fitting when the number of hidden

neurons is increased. However, in preliminary studies it was empirically found that higher

number of hidden layers only increased computation time, while no significant improvement

in predictive performance of the ANN could be achieved.

The number of hidden neurons determines the networks flexibility by approximating arbitrary

functions. Unfortunately, there is no simple method for selecting this parameter from theory.

The empirical results in this thesis show that a reasonable architecture of an ANN for

genome-enabled prediction purposes particularly depends on the dimensionality of genotypes.

The number of inputs (i.e. genotypes) in a completely linked ANN is directly related to model

complexity. Considering the findings presented in Chapter 3, the number of input nodes is

probably the most critical variable to be decided in regression-based prediction problems,

since it contains the important information about the complex correlation structure in data and

is directly related to model selection tasks. Therefore, an understanding of network

configurations and data representations is essential for making efficient use of ANN’s

features, as well as it can also be the reason for ANNs to fail.

Furthermore, the choice of an appropriate neural network training algorithm is also important

for the successful implementation of ANNs to genome-enabled prediction. Usually, training is

an unconstrained nonlinear optimization problem, and there is no algorithm available that

guarantees the global optimum solution for this kind of problem in a reasonable amount of

time (Zhang et al., 1998). The findings reported in Chapter 3 suggest that back-propagation is

an appropriate learning algorithm of ANNs for genome-enabled prediction on large scale

SNP-data sets, because it provides reasonable solutions in an appropriate amount of time.

In summary, there are a variety of possible configurations and misconfigurations of ANNs

with a huge impact on success of application. The extent between potential and failure of

ANNs used for genome enabled prediction depends to a large extent on the knowledge and

experience of the user as well as on data quality and computational aspects.

103

Impact of genetic architecture of traits and data representation on performance of

ANNs in genome-enabled prediction on empirical data

There is general agreement in the scientific community of animal breeders that data structure

mainly influences prediction performance (4th International Conference of Quantitative

Genetics: Understanding Variation in Complex Traits, 2012) Data structure in genome-

enabled predictions includes different aspects, like sample size, sample selection (e.g.

relatedness between individuals), type of phenotypes (e.g. complex traits), genetic

architecture of traits under investigation, type of genetic information (e.g. type and number of

markers), quality of measurements, as well as form of representation (e.g. pre-selection of

variables, compression methods) and pre-corrections (e.g. use of linear models to form the

phenotype, quality controls of marker data). Most of the traits studied in cattle breeding have

a quantitative genetic background which means that the observed phenotypes are measured on

a continuous scale and the observed genetic variance is caused by many genes. It is assumed

that e.g. the phenotype of an individual is generated from linear processes, as a summation of

thousands of various gene variants, in which each of the contributions to the phenotypic

variant is assumed to be small. However, in the last few years, a paradigm shift becomes

apparent. Due to the discovery of the effect of "missing heritability", researchers in animal

and plant breeding as well as in the field of human genetics revealed that phenotypic variance

in traits under investigation maybe is affected by various inter and intra genetic as well as

epigenetic effects, and therefore maybe is of more complex nature than just a linear

approximation of genotype effects. A problem is that the underlying rules which create a

phenotype in most cases are not evident. This is especially true in complex traits, since

observations of phenotypes and genotypes are additionally masked e.g. by environmental

noise. Thus using an ANN approach is a good choice but may lead to over-fitting, when

ANNs are not properly specified for the given mapping problem. The findings in the

empirical application (see Chapter 3) using several data sets could not confirm the theoretical

superiority of ANNs over classical linear approaches. In detail, the expectations for the

application capability of ANNs were not met when non-linear ANN models were used for

prediction of milk traits in cattle breeding. Maybe a mismatch of assumed genomic

architecture was the limiting factor for the poor performance of inferring milk traits from

genomic data via complex non-linear ANN models. The great flexibility of ANNs to map any

non-linear continuous function comes along with a high number of parameters to be estimated

and the tendency to over-fit linear problems. This is related to the so called “no free lunch

theorem” (Wolpert et al., 1997) which affects commonly all machine learning algorithms. The

104

fundamental statement of this theorem is that no machine is learning "for free" just by looking

at training samples. With respect to ANNs that means, some network models may have such

strong biases that they can only approximate certain kinds of functions. For instance, a

powerful non-linear network architecture is biased to learn only complex non-linear functions.

This might result in a better performance compared to a simple linear architecture if the

problem is linear, but can also be a weakness in cases where the input is more complex. This

issue may finally end in a poor performance of non-linear ANNs in prediction of milk traits

on genomic data. Additionally, findings in this thesis reveal that the use of ANNs in

prediction of traits (see Chapter 4) for which the relationship of predictor variables is not

known but expected to be complex, as it was the case in Ketosis prediction (see Chapter 5),

seems to be a good advice.

Potential and Limitations of using ANNs for genome-enabled prediction of complex

traits on large-scale genomic data

Using ANNs in genome-enabled prediction in animal breeding are promising in terms of

flexibility and capability which regards to capture non-linear SNP-marker relationships

through data-driven learning. However, they also implicate some uncertainty, since

researchers do not have sound knowledge about the effect of key factors on prediction

performance of ANNs and data representation issues. The attractiveness of ANNs comes from

the remarkable information processing characteristics of the biological system such as non-

linearity, high parallelism, robustness, fault and failure tolerance, learning and their capability

to generalize (Jain et al., 1996). As addressed in Chapter 2 and extensively discussed above,

the great potential of this method used for genome-enabled prediction is, that they do not

require most of the prior assumptions that commonly underlie parametric statistical models,

because they can be non-linear in either features or on parameters. If nothing is known about

the underlying relationships between genotypes and phenotypes, ANNs might be a good

choice. If ANNs training is properly specified, they may be able to capture complex signals

from the data and result in a better predictive accuracy than a standard linear model used in

genome-enabled prediction.

Despite their great potential findings in the present thesis indicate two main limitations in

using ANNs for genome-enabled prediction. The first is called 'black box' behavior, the

second model complexity in conjunction with their tendency to over-fit training data.

105

'Black box' behavior

Per definition, the “black box” nature of artificial neural networks prevents systematic testing

of theories and communication of results into practical settings. With respect to animal

breeding, ANNs can mainly be used for predictions of e.g. future genetic or phenotypic values

rather than for association studies between e.g. marker genotypes and quantitative traits. If the

researcher is not merely interested in prediction but is aiming at inferring causal relationships

or interactions between genomic covariates and traits, the use of ANNs may is not be

appropriate. Typically, there is no direct way to interpret the parameters learned by a network,

since they cannot be fully translated into a meaningful set of rules of marker contributions to

quantitative traits. The learned weight distribution mimics the function between input and

output data dependent on a certain network constitution, rather than describing a causal

relationship between marker genotypes and phenotypes. In fact, ANNs do not allow

interpreting the physical process which formed the phenotype. As mentioned in Chapters 2

and 3, the only exception is the use of an ANN with one neuron in the hidden layer and a

linear activation function in both, hidden and output layer, as such a network presents a

multiple linear regression. Here, the obtained weights can be interpreted as usual regression

coefficients estimated by a linear regression of marker genotypes on the given phenotype. In

this sense, practical implementation of ANN is limited to cases where no biological

interpretation of the parameters is needed, such as breeding value estimation for selection

purposes (see Chapters 3, 4 and 5). Commonly, there is e.g. no interpretation of causal

relationships between different input factors on the risk of developing a subclinical ketosis

(see Chapter 4). However, ANNs are able to incorporate different information simultaneously

without making assumptions about their relationship a priori, while at the same time

providing reasonable prediction accuracy for yet-to-be-observed phenotypes of this

multifactorial trait (see Chapter 4).

Model complexity and over-fitting

Findings in this thesis revealed that in genome-enabled prediction on large scaled SNP

markers the high number of parameters in a strongly structured ANN may reduce its

predictive power compared to the one obtained with less parameterized prediction models.

Results show, that when the entire number of markers is used in predictions an ANN with

back-propagation fails to provide reasonable predictions of yet-to-be-observed phenotypes.

Unfortunately, when the input dimensions of the network increases markedly, as is the case

with increasing number of marker genotypes (input values) the number of parameters usually

106

grows quickly. Therefore, as long as the sample size is much lower than the number of

markers, ANN analyses might be limited to a certain number of inputs which may be far away

from what the high-dimensionality genomic data demands. In cases where the number of

input variables (e.g. marker genotypes) exceeds sample size, the use of generalization

approaches, like Bayesian methods, or adapted algorithms like “weight decay” (Kriesel, 2007)

are preferable. Otherwise, dimension-reduction or variable selection methods applied to

genotype data improve performance of ANNs in genome-enabled prediction. Further, the

employed extreme learning machine (ELM), a learning architecture for ANNs that replaces

the parameter intensive supervised learning algorithms (i.e. back-propagation algorithm)

revealed reasonable results (Chapter 5), without losing information through dimension-

reduction or variable selection methods. The ELM is a less parameterized predictive machine,

in which the weights of an ANN are not adjusted via learning rather than randomly generated.

This approach is also able to handle model-complexity and over-fitting and seems to be a

good way forward when the use of dimension-reduction or variable-selection methods should

be omitted, in terms of using full information of data.

In summary, ANNs offer a great potential in predictive ability but proper knowledge of

network peculiarities is required for a successful implementation in genome-enabled

prediction of complex traits. Model complexity and over-fitting are main reasons why, instead

of using an entire marker dataset, SNP selection, genome-derived relationships or other

dimensionality reduction techniques are used to produce ANNs with a reasonable predictive

ability. Also, prediction performance of ANNs depends strongly on the peculiar

characteristics of the data and the problem to be solved. Genome-enabled prediction has long

been the domain of linear statistics (Zhang et al., 1998). Linear models have the advantage

that they can be understood and analyzed in great detail and they are easy to explain and to be

implemented. However, they may be totally inappropriate if the underlying relationship

between genotype and phenotype is nonlinear. Hence, ANN analysis cannot replace

conventional statistical methods in genome-enabled prediction but may be applicable in

specific cases such as prediction of traits in which complex relationships between different

information sources are evident and no biological interpretation is needed.

107

Future directions and prospects for the use of ANNs for genome-enabled prediction

In future, the analysis of whole genome sequencing data may provide very useful information

for genome-enabled predictions, because of the possible hypotheses to allow for inferring

whole genetic variance and to discover the secrets of genomics. Initiatives to sequence large

panels are under way for humans (The 1000 Genomes Project Consortium) and cattle (The

1000 Bull Genomes Project). However, up to now it has not been possible to incorporate all

the desired data into genome-enabled prediction systems using ANNs. Therefore, the

requirement for much larger data samples as well as powerful computation systems has to be

met. However, we are currently in the age of “omics”, where a lot of different amounts of data

are generated. Thus, ANNs may be valuable in times, a huge amount of available data is

revolutionizing the animal breeders live. Today, especially, in functional traits the use of

ANNs will be favourable, since observations of phenotypes are costly and masked by noise

and the underlying relationship between several possible predictor variables is hardly

tangible.

At a glance, understanding complex genetic systems and their impact on phenotypic variance,

requires research at the interface between different disciplines, like computer science,

statistics, bioinformatics, animal breeding and genetics. Developments of new technologies

and availability of powerful computers have created a promising platform. Hence, it is

necessary that different fields in academics progressively work together to create more

effective solutions to enhance performance in genome-enabled predictions.

108

References

Breiman, L. 2001. Random Forests. Machine Learning 45:5-32.

Croiseau, P., A. S. Legarra, F. O. Guillaume, S. B. Fritz, A. L. Baur, C. Colombani, C. L.

Robert-Granie, D. Boichard, and V. Ducrocq. 2011. Fine tuning genomic evaluations

in dairy cattle through SNP pre-selection with the Elastic-Net algorithm. Genetics

Research 93:409-417.

de los Campos, G., D. Gianola, and G. J. M. Rosa. 2009. Reproducing kernel Hilbert spaces

regression: A general framework for genetic evaluation. Journal of Animal Science

87:1883-1887.

de los Campos, G., J. M. Hickey, R. Pong-Wong, H. D. Daetwyler, and M. P. L. Calus. 2013.

Whole-genome regression and prediction methods applied to plant and animal

breeding. Genetics 193:327-345.

Gianola, D., R. L. Fernando, and A. Stella. 2006. Genomic-assisted prediction of genetic

value with semiparametric procedures. Genetics 173:1761-1776.

Gianola, D., H. Okut, K. Weigel, and G. Rosa. 2011. Predicting complex quantitative traits

with Bayesian neural networks: a case study with Jersey cows and wheat. BMC

Genetics 12:87.

Gonzalez-Recio, O. and S. Forni. 2011. Genome-wide prediction of discrete traits using

bayesian regressions and machine learning. Genetics Selection Evolution 43:1-12.

Habier, D., R. Fernando, K. Kizilkaya, and D. Garrick. 2011. Extension of the bayesian

alphabet for genomic selection. BMC Bioinformatics 12:186.

Hornik, K., M. Stinchcombe, and H. White. 1989. Multilayer feedforward networks are

universal approximators. Neural Networks 2:359-366.

Jain, A. K., M. Jianchang, and K. M. Mohiuddin. 1996. Artificial neural networks: a tutorial.

Computer 29:31-44.

Kriesel D. 2007. A brief introduction to neural networks. Available at

http://www.dkriesel.com

Legarra, A. S., C. L. Robert-Granie, P. Croiseau, F. O. Guillaume, and S. B. Fritz. 2011.

Improved Lasso for genomic selection. Genetics Research 93:77-87.

Meuwissen, T. H. E., B. J. Hayes, and M. E. Goddard. 2001. Prediction of total genetic value

using genome-wide dense marker maps. Genetics 157:1819-1829.

109

Okut, H., D. Gianola, G. J. M. Rosa, and K. A. Weigel. 2011. Prediction of body mass index

in mice using dense molecular markers and a regularized neural network. Genetics

Research 93:189 - 201.

Park, T. and G. Casella. 2008. The Bayesian Lasso. Journal of the American Statistical

Association 103:681-686.

Piepho, H. P. 2009. Ridge Regression and Extensions for Genomewide Selection in Maize.

Crop Sci 49:1165 - 1176.

Solberg, T., A. Sonesson, J. Woolliams, and T. Meuwissen. 2009. Reducing dimensionality

for prediction of genome-wide breeding values. Genetics Selection Evolution 41:29.

Usai, M. G., M. E. Goddard, and B. J. Hayes. 2009. LASSO with cross-validation for

genomic selection. Genetics Research 91:427-436.

Van Raden, P. M. 2008. Efficient methods to compute genomic predictions. J Dairy Sci

91:4414-4423.

Vapnik, V. N. 1999. An overview of statistical learning theory. Neural Networks, IEEE

Transactions on 10:988-999.

Wolpert, D. H. and W. G. Macready. 1997. No free lunch theorems for optimization.

Evolutionary Computation, IEEE Transactions on 1:67-82.

Zhang, G., B. Eddy Patuwo, and M. Y. Hu. 1998. Forecasting with artificial neural networks:

The state of the art. International Journal of Forecasting 14:35-62.

110

111

CHAPTER 7

112

113

General Summary

Genome-enabled predictions have become an important tool in many animal and plant

breeding programs. However, the application to predictions of complex traits based on

thousands of molecular markers faces several statistical and computational challenges.

Enhancing prediction performance and accuracy is demanding, because models have to cope

with the curse of dimensionality, co-linearity between markers and the complexity of

quantitative traits. The aim of the present thesis was to investigate a suitable, well performing

and broadly applicable method considering various relationships between markers in high

dimensional data for genome-enabled prediction of complex traits. A state of the art survey

and applications of artificial neural networks (ANNs) to real problems of genome-enabled

prediction in animal breeding are presented. ANNs are a part of a wide range of machine

learning methods that are characterized by their general learning ability, especially for non-

linear problems, their generalization capability and noise tolerance.

The focus of Chapter 1 and 2 was to provide an overview of artificial neural network

methodology and recent applications in animal and plant breeding. A comprehensive

literature review demonstrates the great potential of ANNs as well as it points to possible

problems in utilization. Several network types and training algorithms were analyzed and

major preconditions that have to be met for proper application were identified. Aspects, such

as data pre-processing, choice of a proper network architecture and setting of parameter

values of the training algorithm were prepared carefully. Furthermore, recent studies

employing genome-enabled prediction via ANNs indicate that the superiority of properly

specified ANNs over other prediction machines especially depends on the species, target trait,

environmental factors, sample size, and on the genetic relatedness between individuals in the

training and testing sets.

In Chapter 3, the theoretical considerations from the previous section have been implemented

real data applications, aiming to infer empirical evidence for the expected suitability of ANNs

to genome-enabled prediction. An important aspect when using ANNs for genome-enabled

prediction concerning large-scale data are the computational costs, which have to be kept

feasible. Programs were written and implemented in C++ to facilitate computational effort. A

regularized back-propagation algorithm was used to train a multilayer feed-forward network.

In order to assess the major influencing factors on the predictive performance of yet-to-be-

observed phenotypes of three milk traits (milk yield, fat and protein percentage), three dairy

cattle data sets were analyzed. Inputs to the network were data on 3,341 German Fleckvieh

bulls, 2,303 Holstein-Friesian bulls, and 777 Holstein-Friesian dams. All animals were

114

genotyped with a 50k SNP panel. Different non-linear network architectures as well as the

impact of data structure were evaluated. A linear network model was used as a benchmark.

Results showed that predictive abilities of different ANN models varied markedly, whereas

differences between data sets were small. Dimension reduction methods enhanced prediction

performance and consistency in all data sets. The implemented learning algorithm as training

rule to the ANN in conjunction with dimension reduction methods kept computational cost as

low as possible, while maintaining a good predictive performance. However, the large number

of parameters in a richly structured ANN seems to prevent its predictive power, if full

dimension of SNP-data is used as input to the network. Moreover, if the number of markers

vastly exceeds sample size (p>>n problem), the results confirm the robustness of linear

methods in genome-enabled prediction. ANNs may proof their superiority in traits for which

evidence of non-linearity is present, and where less parameterized linear methods fail. Hence,

ANNs may be more useful for predictions of functional traits than for milk traits, which seem

to be additively inherited and can be well predicted with linear methods.

This issue is addressed in Chapter 4. An ANN approach was used for the prediction of risk for

developing ketosis, a trait where non-linear relationship between predictors and response

variable is plausible. Building reliable predictive models of functional traits is in general

challenging, because such traits are commonly affected by multiple factors. In this study,

several combinations of metabolic, genetic and milk performance records were tested

simultaneously with respect to their influence on prediction performance. Data were collected

from 218 dairy cows during the first five Weeks of lactation. Predictions were computed

within and across Weeks. All animals were genotyped with a 50k SNP-panel. Metabolic

information includes two milk metabolites (GPC and PC) and milk yield, fat and protein

percentage, as milk performance records, were available. The betahydroxybutyric acid

(BHBA) concentration in milk samples was used as target variable in all prediction models,

since it reflects the risk of suffering from subclinical ketosis. Average correlation between

observed and predicted target values of up to 0.643 could be obtained, if combinations of

metabolic information with test-day records were used for prediction within Weeks. The

predictive performance of metabolic as well as of milk performance based models was higher

than of models based on genetic information. In summary, performance of ANNs is

encouraging for traits, which are influenced by several factors.

The aim of Chapter 5 was to find an adequate method, which maintains important advantages

of ANNs without losing too much information of the entire complexity through for example

alternatively required variable pre-selection methods (as shown in Chapter 3). The extreme

115

learning machine (ELM), a modified ANN approach, was implemented for genome-enabled

prediction using the entire marker set of large scaled data. It is a fast learning architecture for

ANNs, which replaces the parameter intensive learning. In this Chapter, important properties

of the ELM method are presented and the relationship to ANNs and other machine learning

methods, like support vector machines (SVM), were evaluated. The ELM, analogous to SVM

or ANN, has the ability of capturing nonlinearities, whereas the great advantage lies in

keeping the computational costs at a low level. Experimental results show that the ELM

approach is able to achieve good generalization performance with much less computational

effort. With a low number of neurons, the ELM matches the results of the ANN in predicting

milk traits from genomic information. Increasing the number of hidden neurons leads to

promising results as shown for the trait protein yield. In summary, the results indicated that

the ELM architecture is able to outperform the traditional gradient-based learning in ANNs

when analyzing the entire information of large-scale genomic data.

116

117

CHAPTER 8

118

119

Zusammenfassung

Genomisch basierte Leistungsvorhersagen sind zu einem wichtigen Instrument in Tier- und

Pflanzenzuchtprogrammen geworden und werden heutzutage z.T. routinemäßig für die

Selektion der Eltern für die nächste Generation benutzt. Entsprechende Vorhersage-Modelle

sollten mit dem großen Umfang genomischer Daten, den Abhängigkeiten zwischen Markern

und der Komplexität von quantitativen Merkmalen bestmöglich umgehen können. Um die

Genauigkeit der genomischen Leistungsvorhersage zu verbessern, ist die Suche nach

Methoden, die diesen Anforderungen angemessen Rechnung tragen, von besonderem

Interesse.

Das Ziel der vorliegenden Arbeit war es, unter Berücksichtigung komplexer genetischer

Zusammenhänge, eine geeignete und breit anwendbare Methode zur genomischen

Leistungsvorhersage in der Tier- und Pflanzenzucht zu finden und auf ihre praktische

Anwendbarkeit hin zu untersuchen. Künstliche Neuronale Netze (KNN) stellen eine

vielversprechende Methode für die Vorhersage von komplexen Merkmalen dar, da sie in der

Lage sind, eine Vielzahl von verschiedenen Geninteraktionen zu berücksichtigen. KNNs

zählen zu den Verfahren des maschinellen Lernens und zeichnen sich durch eine universelle

Lerneigenschaft, ihre Verallgemeinerungsfähigkeit und die Toleranz gegenüber Rauschen in

den Daten aus.

Im ersten und zweiten Kapitel dieser Arbeit wurde ein umfassender Literaturüberblick über

die Methode und ihre Anwendungen im Bereich der Pflanzen- und Tierzucht gegeben.

Verschiedene Netzwerktypen und -topologien, sowie Lernalgorithmen und

Datenvoraussetzungen wurden analysiert und im Hinblick auf die Anwendung für genomisch

basierte Leistungsvorhersagen in der Tierzucht bewertet. Dabei wurde ein Fokus auf die

sogenannten mehrschichtigen Netzwerke, sowie Radialbasis-Funktionen-Netzwerke gelegt,

da diese für den Einsatz in Zuchtwertschätzungen besonders vielversprechend erscheinen. Bei

der Vorhersage von unbekannten Phänotypen spielen Aspekte wie die Aufbereitung der

Daten, die Anpassung der Parameter des Lernalgorithmus, als auch die optimale Auswahl der

Netzwerktopologie für den erfolgreichen Einsatz von KNNs eine große Rolle. Bisher

veröffentlichte Studien belegen, dass die Überlegenheit von KNNs gegenüber anderen

Methoden stark von der jeweiligen Datenstruktur, der Tierart, dem Zielmerkmal, den

Umweltfaktoren, der Stichprobengröße, sowie von der genetischen Verwandtschaft zwischen

den Individuen in den Trainings- und Testdatensätzen abhängt.

Im dritten Kapitel wurden die Erkenntnisse aus der Literaturrecherche in die praktische

Anwendung umgesetzt. Dafür wurde ein eigenes Programm in C++ geschrieben, welches

120

auch die Auswertung von größeren Datensätzen ermöglichte. Ein regularisierter Back-

Propagation-Algorithmus wurde als Lernregel für ein mehrschichtiges KNN eingesetzt. Um

die Haupteinflussfaktoren auf die Vorhersagegüte von drei Milchmerkmalen (Milchleistung,

Fett-%, Eiweiß-%) erfassen zu können, wurden drei verschiedene Datensätze ausgewertet.

Daten von 3341 Fleckvieh Bullen, 2303 Holstein Friesian Bullen und 777 Holstein Friesian

Kühen konnten verwendet werden. Alle Tiere wurden mit einem 50k SNP-Array

genotypisiert und nach der Qualitätskontrolle wurden 39.344, 41.995 und 41.718 SNP-Marker

für die Vorhersage benutzt. In den Analysen wurden verschiedene nicht-lineare

Netzwerktopologien und unterschiedliche Dateneingabestrukturen hinsichtlich der

Vorhersagegüte untersucht. Die Ergebnisse zeigten, dass dimensions-reduzierende

Maßnahmen zu wesentlich genaueren und konsistenteren Vorhersagegenauigkeiten in den

untersuchten Merkmalen führten und dass der Einsatz der gesamten genetischen Information

die Vorhersagekraft verminderte. Bei wesentlich mehr Markern als Tieren in der Analyse

konnte die Robustheit von linearen Methoden bestätigt werden. Zudem wiesen die Ergebnisse

darauf hin, dass das Potential von KNNs besonders im Rahmen der Vorhersage von

Merkmalen zum Tragen kommt, bei denen auch nicht-additive Genwirkungen eine Rolle

spielen (z.B. funktionelle Merkmale). Die Methode ist demnach v.a. dann zu bevorzugen,

wenn die genomisch basierte Leistungsvorhersage mit linearen, weniger parametrisierten

Methoden fehlschlägt.

Auf diese Ergebnisse aufbauend behandelt das vierte Kapitel die Vorhersage des Ketose-

Risikos bei Milchkühen, da dieses multifaktoriell bedingt ist und die Vorhersage mit linearen

Methoden nur bedingt möglich ist. In der Analyse wurden mittels eines KNN-Ansatzes

mehrere unabhängige Variablen und Variablenkombinationen simultan auf ihre

Vorhersageeignung für das Merkmal Ketose-Risiko getestet. Unter anderem wurden

verschiedene Metabolitenkonzentrationen in der Milch, SNP-Genotypen, sowie

Milchleistungsmerkmale bezüglich ihres Einflusses auf die Genauigkeit der Vorhersage hin

evaluiert. In allen Vorhersagemodellen wurde die in Milchproben gemessene

Betahydroxybuttersäure (BHBA)-Konzentration als Biomarker für das Risiko einer Kuh, an

einer subklinischen Ketose zu erkranken, verwendet. Dabei konnten durchschnittliche

Korrelationen zwischen beobachteten und vorhergesagten Merkmalswerten von bis zu 0,643

erzielt werden, wenn Kombinationen von Stoffwechselinformationen und

Milchleistungsdaten für die Vorhersage verwendet wurden. Stoffwechsel-, sowie

milchleistungsbasierte Modelle erzielten eine höhere Vorhersagegenauigkeit als Modelle, die

genetische Informationen verwendeten. Die Verwendung von KNN zur Vorhersage von

121

niedrig erblichen, komplexen Merkmalen (wie beispielsweise der Ketose) ist somit

vielversprechend.

Bezugnehmend auf die Ergebnisse des dritten Kapitels wurde im fünften Kapitel die Eignung

einer sogenannten „Extreme Learning Machine“ (ELM) zur genomisch basierten

Leistungsvorhersage von Milchmerkmalen untersucht. Es handelt sich hierbei um eine

schnelle Lernarchitektur für KNNs, die das sehr parameterintensive Lernen ersetzt und somit,

im Gegensatz zu einem Lernalgorithmus, die Nutzung der gesamten Marker Informationen

gewährleisten kann. Wichtige Besonderheiten des ELM-Verfahren wurden vorgestellt und die

Beziehung zu KNNs und anderen Methoden des maschinellen Lernens, wie den „Support

Vector Machines“ (SVM), herausgearbeitet. Die ELM hat, in der gleichen Weise wie SVM

oder KNN, die Möglichkeit der Erfassung von Nichtlinearitäten in den Daten. Experimentelle

Ergebnisse zeigten, dass der ELM-Ansatz in der Lage ist, eine gute Vorhersagegüte zu

erreichen, während der Rechenaufwand gering gehalten wird. Mit einer niedrigen Anzahl von

Neuronen stimmten die Ergebnisse des ELM-Verfahrens mit den Ergebnissen des KNN-

Ansatzes hinsichtlich der Vorhersagekraft bezüglich der Milchmerkmale überein, wobei die

gesamten SNP-Genotyp-Informationen genutzt wurden. Das Erhöhen der Anzahl der

versteckten Neurone führte zu vielversprechenden Ergebnissen für das Merkmal

Proteinmenge. Die Ergebnisse zeigten, dass die ELM in der Lage ist, das traditionelle

Gradienten-basierte Lernen in KNNs in der Analyse groß skalierter genomischer Daten zu

ersetzen.

122

123

Acknowledgements At this point I would like to thank all those who have contributed in various ways to my

research.

First of all I want to thank my supervisors Prof. Georg Thaller and Prof. Daniel Gianola for

their constant support, their never-ending encouragement and their belief in me. Thank you

for always being available for my questions, even in busy times, and for giving me advice

whenever I requested it. I am grateful for giving me the opportunity to spend four month with

the wonderful animal science group at the University of Wisconsin-Madison to work on my

first paper.

I warmly thank David Hochstuhl for drawing my attention to artificial neural networks and

his steady support in all questions of computations and related fields. Thank you for sharing

the ANN program with me and for all the encouraging discussions.

Thanks to Llibertat Tusell for making my stay in Madison as comfortable as possible and

your steady cross-border support way beyond Madison. Thank you for sharing your scientific

creativity and for your continuous interest in my first manuscript drafts.

I am also indebted to my coauthors and colleagues from my working group for joint work on

my papers and all the lively discussions. Above all, I would like to mention Nina

Krattenmacher, Jens Tetens, Dirk Hinrichs, Jennifer Salau and especially Claas Heuer, being

my number one contact person for all questions related to animal breeding and genetics.

My research was made possible through the financial support I received from the German

Federal Ministry of Education and Research (BMBF) within the AgroClustEr “Synbreed –

Synergistic plant and animal breeding”.

I also want to thank my fellow PhD students for regular lunch times, for quality times on

courses and conferences, for frequent cheering up, for their patient companionship and for

their friendship. My special thanks go to Marvin Gertz, Lukas Roos, Christoph Scheel, Karo

Reckmann, Kathrin Büttner and Danica Sindt.

Finally, I would like to thank my family who always encouraged me to pursuit my aims and

supported me wherever they could. Last but not least this PhD would not have been possible

without Anna Eigen and Anja Feddern. Thank you for your deep friendship and your

assistance in all situations of life. Ultimately, I want to thank Achim Seidel for his love and

his never-ending patience and support in the finish of the PhD.

124

125

Curriculum Vitae

Anita Ehret

Born, December 26th, 1985 in Hannover, Germany

Education

11/2010- 07/2014

PhD student at the Institute of Animal Breeding and Husbandry Christian Albrechts University Kiel

10/2010 MSc Agricultural Sciences Master Thesis: Ehret, Anita (2010) Analyse von Defensintranskripten im Knochenmark des Pferdes.

01/2009 BSc Agricultural Sciences Bachelor Thesis: Ehret, Anita (2009) Datenbank-basierte Kandidatengensuche für equine Osteochondrose.

10/2005 – 10/2010

Study of Agricultural Sciences at Christian Albrechts University Kiel

2005 Higher education entrance qualification at Kronwerk Gymnasium, Rendsburg

Qualifikation

08/2012- 12/2012

Research stay at the University of Wisconsin-Madison, Madison, USA

10/2011 Course: Statistical Learning Methods For DNA-based Prediction Of Complex Traits, Gianola

03/2011 Course: Programming in Animal Breeding, Misztal