Robust ridge regression for highŒdimensional data · 2010. 12. 26. · with the Supplemental...
Transcript of Robust ridge regression for highŒdimensional data · 2010. 12. 26. · with the Supplemental...
Robust ridge regression for high�dimensional
data
Ricardo A. Maronna�
University of La Plata and C.I.C.P.B.A.
Abstract
Ridge regression, being based on the minimization of a quadratic loss
function, is sensitive to outliers. Current proposals for robust ridge re-
gression estimators are sensitive to �bad leverage observations�, cannot
be employed when the number of predictors p is larger than the number
of observations n; and have a low robustness when the ratio p=n is large.
In this paper a ridge regression estimate based on Yohai�s (1987) repeated
M estimation (�MM estimation�) is proposed. It is a penalized regression
MM estimator, in which the quadratic loss is replaced by an average of
� (ri=b�), where ri are the residuals and b� the residual scale from an initial
estimator, which is a penalized S estimator; and � is a bounded function.
The MM estimator can be computed for p > n and is robust for large p=n:
A fast algorithm is proposed. The advantages of the proposed approach
over its competitors are demonstrated through both simulated and real
data.
Key words and phrases: MM estimate; Shrinking; S estimate.
Note: Supplemental materials are available on line.
�Ricardo Maronna (E-mail: [email protected]) is professor, Mathematics Department,University of La Plata, C.C. 172, La Plata 1900, Argentina; and researcher at C.I.C.P.B.A.
1
1 Introduction
Ridge regression, �rst proposed by Hoerl and Kennard (1970), is a useful tool for
improving prediction in regression situations with highly correlated predictors.
For a regression data set (X;y) withX 2 Rn�p and y 2 Rn; the ridge estimator
can be de�ned as
b� = �b�0; b�01�0 = argmin fL (X;y;�) : �0 2 R; �1 2 Rpg (1)
where the loss function L is
L (X;y;�) = kr (�)k2 + � k�1k2 (2)
and
r = (r1; :::; rn)0= y� b�01n �X b�1 (3)
is the residual vector, with 1n = (1; 1; :::; 1)0 where in general A0 denotes the
transpose of A: Here � is a penalty parameter that must be estimated in order
to optimize the prediction accuracy.
Despite more modern approaches such as the Lasso (Tibshirani 1996), Least
Angle Regression (LARS) (Efron et al 2004) and boosting (Bühlmann 2006),
ridge regression (henceforth RR) continues to be useful in many situations, in
particular in chemometrics. It is particularly useful when it can be assumed that
all coe¢ cients have approximately the same order of magnitude (unlike sparse
situations when most coe¢ cients are assumed to be null). Zou and Hastie (2005)
state that �For usual n > p situations, if there are high correlations between
predictors, it has been empirically observed that the prediction performance of
the lasso is dominated by ridge regression�. Frank and Friedman (1993) compare
the performances of several estimators commonly used in chemometrics, such as
2
partial least squares (PLS) and principal components regression (PCR). They
conclude (p. 110) that �In all of the situations studied, RR dominated the other
methods, closely followed by PLS and PCR, in that order�.
Since RR is based on least squares, it is sensitive to atypical observations.
Lawrence and Marsh (1984) proposed a model for predicting fatalities in U.S.
coal mines, based on a data set with n = 64 and p = 6 a¤ected by both atypical
observations and high collinearity. They demonstrate that a robust version of
RR yields results superior to those from ordinary Least Squares (OLS) and from
classical RR, in particular in their compatibility with theoretical expectations.
From today�s perspective however, their approach to robusti�cation could be
greatly improved.
Several other approaches have been proposed to make RR robust towards
outliers, the most recent and interesting ones being the ones by Silvapulle (1991)
and Simpson and Montgomery (1996). These estimators however present two
drawbacks, which will be analyzed in Section 4. The �rst one is that they are
not su¢ ciently robust towards �bad leverage points�especially for large p: The
second is that they require an initial (standard) robust regression estimator.
When p > n this becomes impossible; and when p < n but the ratio p=n is
�large�the initial estimator has a low robustness, as measured by the breakdown
point.
The situation of prediction with p >> n has become quite usual, especially
in chemometrics. One instance of such a situation appears when attempting to
replace the laboratory determination of the amount of a given chemical com-
pound in a sample of material, with methods based on cheaper and faster spec-
trographic measurements. Section 7 of this article deals with one archaeological
instance of such a situation, where the material is glass from ancient vessels,
with n = 180 and p = 486. The analysis indicates that the data contain about
3
7% atypical observations, which a¤ect the predictive power of classical RR, while
the robust method proposed in this article has a clearly lower prediction error
than classical RR, PLS, and a robust version of PLS.
A further area that stimulates the development of robust RR for high�
dimensional data is that of functional regression. Some approaches for functional
regression, such as the one by Crambes et al, (2009) boil down to an expression
of the form (2) with k�1k2 replaced by �01A�1; where A is a positive�de�nite
matrix. i.e., a problem that is similar to RR, generally with p >> n. Therefore,
a reliable robust RR estimator for p >> n is a good basis to obtain a reliable
robust functional regression estimator. This approach has been employed by
Maronna and Yohai (2009) to propose a robust estimator for functional regres-
sion based on splines.
The purpose of this article it to develop ridge estimators that can be used
when p > n; are robust when p=n is large, and are resistant to bad leverage
points. In Section 2 we describe an estimator based on Yohai�s (1987) approach
of MM estimation, to allow controlling both robustness and e¢ ciency. The
MM estimator requires a robust (but possibly ine¢ cient) initial estimator. In
Section 3 we describe an estimator based on the minimization of a penalized
robust residual scale (�S estimator�) to be used as a starting point. Sections
2.4 and 2.5 deal with the choice of �:
In Section 4 we recall the estimators proposed by Silvapulle and by Simpson
and Montgomery, describe their drawbacks, and propose simple modi�cations
of them to improve their robustness. Section 5 describes a simulation study
for the case n > p in which the proposed estimator is seen to outperform its
competitors, and the modi�ed Silvapulle estimator performs well for not too
extreme situations. Section 6 describes a simulation for p > n in which the
MM estimator is seen to outperform a robust version of Partial Least Squares.
4
Section 7 deals with the application of the proposed estimator to a real high-
dimensional data set with p >> n. Section 8 compares the computing times
of the di¤erent estimators, Section 9 describes the di¢ culties in estimating the
estimator�s variability, and �nally Section 10 summarizes the conclusions. An
Appendix containing proofs and details of de�nitions and of calculations is given
with the Supplemental Material.
2 MM estimators for ridge regression
To ensure both robustness and e¢ ciency under the normal model we employ
the approach of MM estimation (Yohai 1987). Start with an initial robust
but possibly ine¢ cient estimator b�ini; from the respective residuals compute
a robust scale estimator b�ini: Then compute an M estimator with �xed scale
b�ini; starting the iterations from b�ini; and using a loss function that ensures thedesired e¢ ciency. Here �e¢ ciency�will be loosely de�ned as �similarity with
the classical RR estimator at the normal model�.
Recall �rst that a scale M estimator (an M�scale for short) of the data vector
r = (r1; :::; rn)0 is de�ned as the solution b� = b� (r) of
1
n
nXi=1
��ri�
�= �; (4)
where � is a bounded ��function in the sense of Maronna et al. (2006) and
� 2 (0; 0:5) determines the breakdown point of b�: Recall that the breakdownpoint (BDP) of an estimator is the maximum proportion of observations that
can be arbitrarily altered with the estimator remaining bounded away from the
border of the parameter set (which can be at in�nity). Note that when � (t) = t2
and � = 1 we have the classical mean squared error (MSE): b�2 = aveifr2i g: We
5
shall henceforth employ for � the bisquare ��function
�bis (t) = minf1; 1��1� t2
�3g: (5)
Let b�ini be an initial estimator. Let r = r�b�ini� and let b�ini be an M�scale
of r :1
n
nXi=1
�0
�rib�ini�= �; (6)
where �0 is a bounded ��function and � is to be chosen. Then the MM estimator
for RR (henceforth RR-MM) is de�ned by (1) with
L (X;y;�) = b�2ini nXi=1
�
�ri (�)b�ini
�+ � k�1k2 ; (7)
where � is another bounded ��function such that � � �0: The factor b�2ini beforethe summation is employed to make the estimator coincide with the classical
one when � (t) = t2:
We shall henceforth use
�0 (t) = �bis
�t
c0
�; � (t) = �bis
�t
c
�(8)
where the constants c0 < c are chosen to control both robustness and e¢ ciency,
as described later in Section 2.3.
The classical estimator (2) will henceforth be denoted by RR�LS for brevity.
Remark: Centering and scaling of the columns of X are two important
issues in ridge regression. Since our estimator contains the intercept term b�0;centering would a¤ect only b�0 and not the vector of slopes b�1; and is thereforenot necessary. However, in the algorithm for RR�MM the data are previously
centered by means of the bisquare location estimator, but this is done only for
the convenience of the numerical procedure. As to scaling, we consider it to be
6
the user�s choice rather than an intrinsic part of the de�nition of the estimator.
In our algorithm, the columns of X are scaled by means of a scale M estimator.
2.1 The �weighted normal equations� and the iterative
algorithm
It is known that the classical estimator RR�LS satis�es the �normal equations�
b�0 = y � x0 b�1; (X 0X + �Ip) b�1 =X 0�y� b�01n� ; (9)
where Ip is the identity matrix and x and y are the averages of X and y
respectively. A similar system of equations is satis�ed by RR�MM. De�ne
(t) = �0 (t) ; W (t) = (t)
t: (10)
Let
ti =rib�ini ; wi = W (ti)
2; w = (w1; :::; wn)
0; W = diag (w) : (11)
Setting the derivatives of (7) with respect to � to zero yields for RR-MM
w0�y� b�01n �X b�1� = 0 (12)
and
(X 0WX + �Ip) b�1 =X 0W�y� b�01n� : (13)
Therefore RR�MM satis�es a weighted version of (9). Since for the chosen
�; W (t) is a decreasing function of jtj; observations with larger residuals will
receive lower weights wi.
As is usual in robust statistics, these �weighted normal equations�suggest an
iterative procedure. Starting with an initial b� : Compute the residual vector r7
and the weights w: Leaving b�0 and w �xed, compute b�1 from (13). Recompute
w, and compute b�0 from (12). Repeat until the change in the residuals is small
enough. Recall that b�ini remains �xed throughout.This procedure may be called �Iterative reweighted RR�. It is shown in the
Supplemental Material that the objective function (7) descends at each iteration.
The initial estimator, which is a penalized S estimator, is described in Section
3.
2.2 Equivalent degrees of freedom
Let w be de�ned by (10)-(11) and
ex = w0X
w01; X =X � 1ex: (14)
Then it follows that the �tted values by =X b�1 + b�01n verifyby = �H +
1w0
10w
�y (15)
where the �hat matrix�H is given by
H =X�X 0WX + �Ip
��1X 0W: (16)
Then the equivalent degrees of freedom (edf) (Hastie et al. 2001) are de�ned by
edf = tr
�H +
1w0
10w
�= bpMM + 1 (17)
with
bpMM = tr (H) ; (18)
8
where tr (:) denotes the trace. When � = 0 we have tr (H) = rank (X) : We
may thus call bpMM the �equivalent number of predictors�.
The edf may be considered as a measure of the degree of �shrinking�of the
estimator.
2.3 Choosing the constants
We now deal with the choice of � in (6) and of c0 and c in (8). For the case � = 0;
i.e., the standard M�estimation of a regression with q = p + 1 parameters, the
natural choice is � = 0:5 (1� q=n) (henceforth a standard estimator will refer
to one with � = 0):
This choice of � is �natural�in that it maximizes the BDP of b�. It can alsobe justi�ed by the fact that the scale can be written as
1
n� q
nXi=1
��ri�
�= 0:5;
and the denominator n � q is the same as the one used for the classical s2 in
OLS.
For � > 0; the natural extension is
� = 0:5
�1� bq
n
�(19)
where bq is a measure of the �actual number of parameters�, which depends on �:We would use bq = edf de�ned in (17), but bpMM cannot be known a priori. For
this reason we shall employ a measure of edf derived from the initial estimator
in a manner analogous to (17), as described in Section 3.
Since � ceases to be �xed at the value 0.5, c0 must now depend on � in order
to ensure consistency of b� for normal data when � = 0. We therefore employ
9
c0 (�) de�ned by
E�
�z
c0 (�)
�= �; z � N(0; 1)
which can be computed numerically.
As to c; it is chosen to attain the desired e¢ ciency of the standard estimator
at the normal model. If p is �xed and n ! 1; the value c = 3:44 yields 0.85
asymptotic e¢ ciency at the normal model when � = 0:
However, for high�dimensional data both b�ini and c require a correction.
Maronna and Yohai (2010) point out two problems that appear when �tting a
standard MM estimator to data with a high ratio p=n : (1) The scale based on
the residuals from the initial regression estimator underestimates the true error
scale, and (2) Even if the scale is correctly estimated, the actual e¢ ciency of
the MM estimator can be much lower than the nominal one.
For this reason, b�ini is (upwards) corrected using formula (9) of their paper,namely as
e�ini = b�ini1� (k1 + k2=n)bq=n with k1 = 1:29; k2 = �6:02:
As to c; it must be also increased: we put c = 4 when bq=n > 0:1:2.4 Choosing � through cross�validation
In order to choose � we must estimate the prediction error of RR�MM for di¤er-
ent values of �: Call CV(K) the K-fold cross validation process, which requires
recomputing the estimate K times. For K = n (�leave�one�out�) we can use an
approximation to avoid recomputing. Call by�i the �t of yi computed withoutusing the i�th observation; i.e., y�i = b�(�i)0 + x0i
b�(�i)1 ; where�b�(�i)0 ; b�(�i)01
�is the RR�MM estimate computed without observation i: Then a �rst�order
10
Taylor approximation of the estimator yields the approximate prediction errors
r�i = yi � by�i � ri
�1 +
W (ti)hi1� hi 0 (ti)
�(20)
with
hi = x0i
nXi=1
0 (ti)xix0i+ 2�I
!�1xi
where W; and ti are de�ned in (10)-(11) and xi is the i-th row of X de�ned
in (14). This approximation does not take the initial estimator into account.
Henceforth CV(n) will refer to the above approximation.
Given the prediction errors r� = (r�1; :::; r�n)0, we compute a robust mean
squared error (MSE) as b� (r�)2 ; where b� is a ���scale�with constant c� = 5
(Yohai and Zamar 1988); and choose the � minimizing this MSE. Initially an
M�scale had been employed, but the resulting �s were somewhat erratic, and
this led to the use of the more e¢ cient ��scale. Note that the choice of c� must
strike a balance between robustness and e¢ ciency; the value 5 was chosen on
the basis of exploratory simulations.
It turns out that b� (r�), as a function of �; may have several local minima(as will be seen in Figure 4), which precludes the use of standard minimization
algorithms to choose the �best��: For this reason we prefer to compute b� (r�)for a selected set of candidate �s in a range [�min; �max]: The de�nition of this
set is given in Section 2.5.
2.5 The set of candidate �s and the edf
It is intuitively clear that the robustness of RR�MM depends on that of the
initial estimator, which, as we shall see, depends in turn on the edf, which are a
function of �. For this reason we shall need an a priori approximation to the
edf corresponding to each �.
11
For the classical estimator RR-LS it is known that the edf are bp (�)+1 withbp (�) = pX
j=1
j j + �=n
(21)
where j are the eigenvalues of C; the sample covariance matrix of X: We use
a robust version of (21) replacing those eigenvalues by the Spherical Principal
Components (SPC) of X (Locantore et al. 1999):
bpSPC (�) = pXj=1
j j + �=n
(22)
where j are the �pseudo eigenvalues�; the details are described in the Supple-
mental Material. Note that bpSPC (�) is a decreasing function. The choice ofSPC was based on the fact that it is very fast, can be computed for p >> n;
and has some interesting theoretical properties (see the Discussion of the article
by Locantore et al. (1999)).
Call "� the BDP of b�ini: It is not di¢ cult to show that if � > 0; the BDP
of RR�MM is � "�: However, since the estimator is not regression�equivariant,
this result is deceptive. As an extreme case, take p = n � 1 and "� = 0:5:
Then the BDP of b� is null for � = 0; but is 0.5 for � = 10�6! Actually, for
a given contamination rate less than 0.5, the maximum bias of the estimator
remains bounded, but the bound tends to in�nity when � tends to zero. This fact
suggests keeping the candidate �s bounded away from 0 when p=n is large. This
idea is more easily expressed through the edf. As will be seen in Section 3, to
control the robustness of the initial estimator it is convenient that the �e¤ective
dimension�be bounded away from n; more precisely, that bpSPC (�) + 1 � 0:5n:We therefore choose [�min; �max] such that
�min = inff� � 0 : bpSPC (�) + 1 � 0:5ng and bpSPC (�max) = 1; (23)
12
so that �min = 0 if p < (n � 1)=2: We take N� values of � between �min and
�max; with N� depending on p; so that the N� values of bpSPC (�) are equispaced.As a compromise between precision and computing time, we take
N� = minfp; n; 20g: (24)
A technical detail appears. The set of candidate �s is chosen so that RR�LS
has a given degree of shrinking. Note however that even if there is no conta-
mination, RR�LS and RR�MM with the same � need not yield approximately
the same estimators. For a given �; we want to �nd �0 such that RR�MM with
penalty �0 yields approximately the same degree of shrinking as RR�LS with
penalty �: There is a simple solution. Note that limk!1k2
3 �bis (r=k) = r2; and
therefore RR�LS with a given � is approximately �matched�to RR-MM with
penalty �0 = 3�=c2 with c in (8).
3 The initial estimator for RR�MM: S estima-
tors for RR
As in the standard case, the initial estimator for RR�MM will be the one based
on the minimization of a (penalized) robust scale (an S estimator). We de�ne an
S estimator for RR (henceforth RR�SE) by replacing in (2) the sum of squared
residuals by a squared robust M�scale b� (4), namelyL (X;y;�) = nb� (r (�))2 + � k�1k2 : (25)
Here the factor n is employed to make (25) coincide with (2) when b�2 (r) =avei
�r2i�; corresponding to � (t) = t2 and � = 1 in (4).
A straightforward calculation shows that the estimator satis�es the �normal
13
equations�(12)-(13) with wi =W (ti) and � replaced by �0 = �n�1Pn
i=1 wit2i :
These equations suggest an iterative algorithm similar to the one at the end
of Section 2.1, with the di¤erence that now the scale changes at each iteration,
i.e. now in the de�nition of ti in (11), b�ini is replaced by b� �r �b���. Althoughit has not been possible to prove that the objective function descends at each
iteration, this fact has been veri�ed in all cases up to now.
Expressing the estimator as a weighted linear combination yields an expres-
sion for the equivalent number of parameters bpSE as in (17).As with RR�MM, it is not di¢ cult to show that if � > 0 the BDP of RR�SEb� coincides with that of b�: Recall that this BDP equals min (�; 1� �) with � in
(4). In the case of standard regression, recall that for X in general position, the
BDP of a regression S estimator with intercept ismin (�; 1� (p+ 1) =n� �) ; and
therefore its maximum (which coincides with the maximum BDP of regression-
equivariant estimators) is attained when � = 0:5 (1� (p+ 1) =n) (see Maronna
et al. 2006). For � > 0; the above facts suggest using
� = 0:5
�1� bp+ 1
n
�; (26)
where bp is a measure of the �actual number of parameters�, which will depend on�: It would be ideal to use bp = bpSE; but this value is unknown before computingthe estimator; for this reason we use bp = bpSPC: Once b�SE computed, we compute� for RR�MM using (19) with bq = bpSE+1, the reason being that bpSE is expectedto give a more accurate measure of the edf than the tentative bpSPC:It follows from (26) that if we want � � 0:25 we get the value of �min in (23).
The technical detail of matching � described at the end of Section 2.5 occurs
also with RR�SE, namely, the same value of � yields di¤erent degrees of shrink-
ing for RR�LS and RR�SE. Since its solution is more involved, it is deferred to
the Supplemental Material.
14
4 Other robust RR estimators for n > p
4.1 The Silvapulle estimator
Silvapulle´s (1991) estimator is based on the following fact. Let X be centered
to zero mean. Call b�LS = (b�0; b�01)0 the ordinary LS (OLS) estimator. Thenthe RR�LS estimator
�b�0; b�01�0 ful�llsb�0 = b�0; b�1 = Z (�) b�1; (27)
where
Z (�) = (X 0X + �I)�1X 0X: (28)
Let now b�M = (b�M;0; b�0M:1) be a standard regression M estimator and b� arobust residual scale. Then Silvapulle proposes to de�ne a RR estimator by
(27)�(28) with b�M instead of b�LS; and to choose � in the same way as Hoerl etal. (1975), namely b� = pb�2
kb�M;1k2 : (29)
Silvapulle uses a simultaneous estimator of � and � of �Huber�s Proposal II�
type (Huber 1964), which has the following drawbacks: (1) The M estimator has
a monotonic = �0, which makes it non�robust towards high leverage outliers.
(2) The scale has a low BDP. (3) The matrix Z in (28) is also sensitive to
leverage points and (4) The centering of X may also be in�uenced by leverage
points.
For this reason the following modi�cations are proposed. Call b� a robustlocation vector of X obtained by applying a robust location estimator to each
column (we use the bisquare M estimator). Call X= X � 1nb�0 the centeredX matrix. Let b�M =
�b�M;0; b�0M;1�0 be a standard MM regression estimator
of (X;y) with initial scale b�: This estimator yields a vector of weights w. Let15
W = diag (w) and
Z (�) =�X 0WX + �I
��1X 0WX:
Finally de�ne the modi�ed Silvapulle estimator by
b�1 = Z (�) b�M;1; b�0 = b�M;0 � b�01b�:4.2 The Simpson-Montgomery estimator
Simpson and Montgomery (1996) employ a Schweppe GM type of estimator (see
Maronna et al 2006, Sec. 5.11). They start by computing a standard regression
S estimator b� = (b�0; b�01)0 with residual scale b�; then � is estimated accordingto Hoerl and Kennard�s (1976) iterative method. Call (t; V ) the robust loca-
tion/covariance M estimator proposed by Krasker and Welsch (1982), and letX
be the centered matrix X=X� 1t0 and let y be robustly centered. De�ne the
Mahalanobis distances d2i = xi0V �1xi and the weights wi = Wmult (ri= (b��i)) ;
where �i = Cmedianj (dj) =di; with the constant C chosen to regulate e¢ ciency;
andWmult is the weight function given by the multivariate M estimator. Finally
their RR estimator is given by
�X 0�X + b�I� b�1 =X 0�y with � = diag (�1; :::; �n) : (30)
The main drawbacks of this approach are: (1) The multivariate estimator
has a very low BDP. (2) The estimation of � depends on X 0X which is not
robust and (3) The Hoerl-Kennard iterations are unstable.
For these reasons the following modi�cations were introduced: (1) (t; V ) is
a multivariate S estimator using the bisquare scale. (2) Call W the matrix of
weights given by the initial regression S-estimator; then the modi�ed estimator is
16
given by (30) with� replaced byW; and (3) Only two Hoerl-Kennard iterations
are employed.
5 Simulations for p < n
The performances of the robust estimators described above were compared for
n > p through a simulation study. For each simulation scenario we generate
Nrep datasets as yi = x0i�true + ei; i = 1; :::; n with xi � Np (0; V ) where
V = [vjk] with vjj = 1 and vjk = � for j 6= k; and ei � N(0; 1). Since
the estimators are neither regression- nor a¢ ne�equivariant, the choice of �true
matters. For a given value of the �signal to noise ratio� (SNR) R de�ned as
R = k�truek2 = (pVar (e)) ; we take �true at random: �true =ppRb where b
has a uniform distribution on the unit spherical surface. To introduce conta-
mination we choose two constants klev and kslo which respectively control the
leverage and slope of the contamination, their e¤ect being to mislead b�1 towards�true (1 + kslo) : For a given contamination rate " 2 (0; 1) let m = [n"] where
[:] denotes the integer part. Then the �rst m observations (xi; yi) are replaced
by xi = x0 and yi = x00�true (1 + kslo) with x0 = kleva=pa0V �1a; where a is
random with a01p = 0: Thus x0 is placed on the directions most in�uential to
the estimator. The values of kslo were taken in a grid aimed at determining the
maximum MSE for each estimator. We took two values for � : 0.5 and 0.9, and
two for klev : 5 and 10.
Before computing the estimators, the columns of X are centered and nor-
malized to unit scale. For RR�LS we use the mean and standard deviation,
and for the robust estimators we use the bisquare location and scale estimators.
Then b� is de�normalized at the end.The set of �s for RR�LS is chosen according to (21), in order to obtain N�
values of � with N� = p; so that the N� values of bp (�) are equispaced between17
1 and p.
To evaluate the performance of a given estimator b� = �b�0; b�01�0 ; recall thatthe conditional prediction mean squared error for a new observation (x0; y0) is
MSE�b�� = E�y0 � b�0 � x00 b�1�2 = 1 +�;
where
� = b�20 + �b�1 � �true�0 V �b�1 � �true� :It must be recalled that the distributions of robust estimators under conta-
mination are themselves heavy-tailed, and it is therefore prudent to evaluate
their performance through robust measures. For this reason, to summarize the
Nrep values of MSE�b�� we employed both their average, and their 10% (upper)
trimmed average. The results given below correspond to this trimmed MSE,
although the ordinary average yields qualitatively similar results.
The estimators employed were:
� RR�LS with N� = p and KCV�fold CV with KCV = n and KCV = 5 and
=10.
� RR-MM with the same N� and KCV:
� Simpson�Montgomery estimator, original and modi�ed versions.
� Silvapulle estimator, original and modi�ed versions.
Table 1 displays the maximum trimmed MSEs for p = 10; n = 50; � = 0:9
and klev = 5 and three values of the SNR. The values for " = 0 correspond
to no contamination, and those for " = 0:1 to the maximum of the trimmed
MSEs. The results for RR�MM and RR�LS with KCV = 10 are omitted since
18
SNR 10 1 0.1" 0 0.1 0. 0.1 0 0.1
RR� LS, CV(n) 1.34 43.35 1.28 5.33 1.13 3.01RR� LS, CV(5) 1.34 44.05 1.30 5.45 1.13 3.06RR�MM, CV(n) 1.38 1.70 1.31 1.58 1.13 1.32RR�MM, CV(5) 1.41 1.74 1.32 1.47 1.13 1.23Simp-Mont. orig. 1.48 36.47 1.37 17.03 1.12 3.27
Simp-Mont. modif. 1.50 2.53 1.40 1.92 1.23 1.55Silv. orig 1.35 40.48 1.29 5.85 1.16 3.40
Silv. modif. 1.35 1.66 1.28 1.57 1.14 1.38
Table 1: Maximum trimmed MSEs of estimators for p = 10; n = 50; � = 0:9
they were not very di¤erent from those with KCV = 5: Both versions of RR�
LS have similar behaviors, as well as both versions of RR�MM. All robust
estimators show a high e¢ ciency for " = 0. The original Simpson�Montgomery
and Silvapulle estimators are clearly not robust; it is curious that under " = 0:1
the former can attain higher MSEs than RR�LS. It must be recalled that the
maximum MSE of RR�LS for all kslo is in�nite, but here the range of kslo was
limited to �nding the maximum MSE of each robust estimator. The modi�ed
Simpson�Montgomery estimator is robust, but for both " = 0 and " = 0:1
it exhibits higher MSEs than the two versions of RR�MM and the modi�ed
Silvapulle estimator.
For these reasons we henceforth drop the original Silvapulle estimator and
both versions of Simpson�Montgomery from the study, as well as RR�LS with
KCV = 5.
The modi�ed Silvapulle estimator appears as a good competitor to RR�MM,
despite its simplicity. We now consider the same scenario but with " = 0:2: The
results are given in Table 2.
Among the three robust estimators, RR�MM with CV(5) appears as the best
and the modi�ed Silvapulle estimator as the worst, with RR�MM with CV(n)
19
SNR 10 1 0.1RR� LS, CV(n) 231.82 24.28 3.41RR�MM, CV(n) 3.13 2.80 2.48RR�MM, CV(5) 3.13 2.14 1.85
Silv. Mod 3.55 3.22 3.18
Table 2: Maximum trimmed MSEs of estimators for p = 10; n = 50; � = 0:9 and� = 0:2
in between. Figure 1 displays the MSEs of the three estimators as a function of
kslo:
0 5 10 15 200
0.5
1
1.5
2
2.5
3
kslo
MS
E
SNR=1
MM(5)
MM(n)
SilMod
0 10 20 300
0.5
1
1.5
2
2.5
3
kslo
SNR=0.1
MM(5)
MM(n)
SilMod
Figure 1: Simulation MSEs for p = 10; n = 50 and " = 0:2 of MM estimatorswith 5-fold and n-fold cross-validation, and the modi�ed Silvapulle estimator,as a function of contamination slope. The values for kslo = 0 are those for " = 0:
We �nally consider a situation with a high p=n ratio, namely p = 10 and
n = 25:
Table 3 and Figure 2 show the results. The e¢ ciency of all robust estimators
20
SNR 10 1 0.1" 0 0.1 0 0.1 0 0.1
RR� LS, CV(n) 2.03 41.93 1.63 5.38 1.22 4.58RR�MM, CV(n) 2.36 2.80 1.82 2.09 1.33 1.54RR�MM, CV(5) 2.32 2.70 1.73 1.81 1.25 1.36
Silv. Mod 2.34 3.09 1.74 2.27 1.43 2.02
Table 3: Maximum trimmed MSEs of estimators for p = 10; n = 25; � = 0:9
is lower than for n = 50; though still larger than 0.85. Again RR�MM, CV(5) is
the best and the modi�ed Silvapulle estimator the worst, with RR�MM, CV(n)
in between.
A plausible explanation for the superiority of KCV = 5 over KCV = n is
that the estimation of the MSE for each � is more robust in the former, which
would lead to a better choice of �.
It is seen in both tables that all MSEs increase with increasing SNR. This
phenomenon appears more clearly in the next Section. A plausible explanation
it that for large SNR, the bias caused by shrinking dominates the variance
reduction, and all estimators �shrink too much�.
All simulations were repeated for � = 0:5 and for klev = 10; yielding the
same qualitative conclusions as above.
6 Simulations for n < p
Since RR estimators are equivariant with respect to orthogonal transformations
of X; the case n < p can always be reduced to one with p � n� 1: To this end,
X is �rst normalized, the principal components of the transformed matrix are
computed, and the data are expressed in the coordinate system of the principal
directions with (numerically) non�null eigenvalues. Note that this implies that
now the edf will be � n:
Many situations with p > n correspond to data such as e.g. spectra, for
21
0 5 10 15 20 25 300
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
MS
E
kslo
MM(5)
MM(n)
SilMod
Figure 2: Simulation MSEs for p = 10; n = 25; SNR=0.1 and " = 0:1 of MMestimators with 5-fold and n-fold cross-validation, and the modi�ed Silvapulleestimator, as a function of contamination slope. The values for kslo = 0 arethose for " = 0:
which X usually exhibits a �local� dependence structure. For this reason we
now take xi � Np (0; V ) with vjk = �jj�kj: All other details of the simulation
are like in the former Section. The values chosen for � are 0.9 and 0.5, and those
for klev are 5 and 10. We took n = 30 and p = 40: This is not very realistic,
but it was necessary to limit the computing time. The values of kslo ranged
between 1 and 15 for " = 0:1 and between 1 and 20 for " = 0:2: Again, the
estimators were RR�LS and RR�MM with KCV = n and = 5: Since the Silva-
pulle and the Simpson�Montgomery estimators cannot be computed for n < p;
in order to have competitors against RR we included the robust Partial Least
Squares (PLS) estimator RSIMPLS (Hubert and Vanden Branden 2003), and
also the classical PLS estimator CSIMPLS, both as implemented in the library
22
LIBRA (see http://wis.kuleuven.be/stat/robust/LIBRA.html). The number of
components was chosen in both through cross�validation. The breakdown point
of RSIMPLS was the default value of 0.25; attempts to employ a larger break-
down point such as 0.5 or 0.4 caused the program to crash due to some matrix
becoming singular. Table 4 gives the results for � = 0:9 and klev = 5: The SNR
was taken as 0.1, 0.5 and 1. The case SNR=10 was omitted since all estimators
had very high MSEs.
SNR 1 0.5 0.1" 0 0.1 0.2 0. 0.1 0.2 0 0.1 0.2
RR� LS, CV(n) 4.0 38.1 88.2 3.2 19.9 45.7 2.1 5.7 11.7RR� LS, CV(5) 4.1 22.7 52.3 3.0 12.1 27.0 1.8 3.6 6.8RR�MM, CV(n) 4.6 5.8 9.2 3.1 3.9 6.0 1.8 2.1 3.5RR�MM, CV(5) 4.8 5.9 10.7 3.1 3.8 5.7 1.8 2.1 3.0
RSIMPLS 7.3 10.4 85.7 4.5 6.1 42.6 2.2 2.8 10.2CSIMPLS 4.7 28.2 61.1 3.3 14.8 31.2 2.1 3.7 7.7
Table 4: Maximum trimmed MSEs of estimators for p = 40; n = 30; � = 0:9
It is seen that: both versions of RR�MM have similar performances, and are
rather e¢ cient. RR�LS with KCV = 5 is remarkably better than with KCV = n:
RSIMPLS has a low e¢ ciency, and for " = 0:2 it breaks down, performing even
worse than CSIMPLS; this behavior is puzzling.
7 A real data set with p >> n
We consider a dataset corresponding to electron-probe X ray microanalysis of
archaeological glass vessels (Janssens et al, 1998), where each of n = 180 vessels
is represented by a spectrum on 1920 frequencies. For each vessel the contents of
13 chemical compounds are registered. Since for frequencies below 15 and above
500 the values of xij are almost null, we keep frequencies 15 to 500, so that we
23
have p = 486: Figure 3 shows the residual QQ�plots for compound 13 (PbO).
Both robust estimates exhibit seven clear outliers plus six suspect points, while
RR�LS exhibits one outlier and CSIMPLS none. The dotted lines correspond
to a 95% envelope based on 200 bootstrap samples. It is curious that the
envelopes corresponding to the two non-robust estimators are relatively much
more dispersed than the ones corresponding to the two robust ones.
2 1 0 1 2
0.2
0.1
0
0.1
0.2
0.3
0.4
0.5LS
Normal quantiles
resi
dual
s
2 1 0 1 2
0
1
2
3
4
5
6
7MM
Normal quantiles
resi
dual
s
2 1 0 1 21
0
1
2
3
4
5
6
7RPLS
Normal quantiles
resi
dual
s
2 1 0 1 2
0.4
0.2
0
0.2
0.4
0.6
CPLS
Normal quantiles
resi
dual
s
Figure 3: Vessel data: residual QQ plots for compound 13, with bootstrap 95%envelopes.
The edf for RR�LS and for RR�MM with KCV = n and KCV = 5 were
respectively 108.3, 10.0 and 5.4, and both versions of PLS chose 5 components.
The predictive behavior of the estimators was assessed through 5�fold CV;
the criteria used were the root mean squared error (RMSE) and RMSE with
upper 10% trimming (RMSE(0.9)), considered to be safer in the presence of
outliers. Table 5 shows the results.
24
RMSE RMSE(0.9)RR� LS, CV(n) 0.194 0.133RR� LS, CV(5) 0.186 0.119RR�MM, CV(n) 0.871 0.091RR�MM, CV(5) 0.892 0.088
RSIMPLS 0.890 0.113CSIMPLS 0.510 0.194
Table 5: Vessel data: RMSEs for compound 13
According to RMSE(0.9), both versions of RR�MM have a similar perfor-
mance, which is clearly better than those of the other estimators. However,
according to RMSE they perform much worse than RR�LS. The plausible rea-
son is that the outliers in�ate RMSE, and this demonstrates the importance of
using robust criteria to evaluate robust estimators.
The left panel of Figure 4 shows the ridge trace for coe¢ cients 1, 50,100
and 150 of �1 as a function of � (in logarithmic scale), and the right one shows
the estimated robust MSEs. The latter panel shows that the absolute minimum
occurs at � = 2575; i.e. log10 (�) = 3:41: The coe¢ cients are seen to oscillate
wildly for low � and then to stabilize.
8 Computing times
Table 6 gives the computing times (in minutes) for the robust estimators. The
�rst four rows are the averages of �ve samples generated as in Section 6, and the
last one corresponds to the data in Section 7. The computation was performed
in Matlab on a 3.01 GHz PC with an Intel Core Duo CPU.
It is seen that in the last two rows, the times required by RR�MM with
KCV = 5 are about �ve times those for KCV = n; what is to be expected,
since the latter does not actually perform the cross-validation. But in the other
cases, the di¤erence is about ten�fold, which is surprising. It is also seen that
25
2 3 40.01
0.005
0
0.005
0.01
0.015
log(lambda)
beta
s
2 3 40.011
0.0115
0.012
0.0125
0.013
0.0135
0.014
log(lambda)
MS
E
Figure 4: Vessel data: Ridge trace for coe¢ cients 1 (o), 50 (*), 100 (x) and 150(+) (left) panel) and estimated MSE (right), as a function of log10 (�) :
RSIMPLS is very fast, and the modi�ed Silvapulle estimator is still faster.
9 The problem of estimating variability
An approach to estimate the variability of estimators and �tted values is to
derive an (exact or approximate) expression for their variances. For RR�LS it
is easy to derive a simple formula for the covariance matrix of b�, for a givenn p RR-MM, CV(n) RR-MM, CV(5) RSIMPLS Silv. Mod100 10 0.0203 0.2705 0.1242 0.0019100 20 0.0597 0.6847 0.1251 0.0028100 40 0.1114 1.0261 0.1305 0.0054100 200 0.4040 2.7384 0.1413 �180 486 3.413 15.932 0.517 �
Table 6: Computing times in minutes
26
�: But the fact that actually � depends on the data makes the analysis rather
complicated. I do not know of any theoretical results in this direction.
The situation is even more complex for the robust estimators. Deriving
the asymptotic distributions of robust RR estimators for �xed �; seems feasible
(though not trivial) if n!1 with p �xed. But the result would be irrelevant, for
in practical terms this means that p=n is negligible, and in this case there would
not be much di¤erence between ridge and standard (i.e., with � = 0) regression.
The practically interesting cases are those where p=n is not negligible (or even
>1), and the corresponding asymptotic theory would have p tend to in�nity
with n; with p=n tending to a positive constant. This seems a very di¢ cult
task, even for �xed �; and even more so for an estimated �:
Another approach is to use the bootstrap, which implies recomputing the
estimate at least several hundred times. As Table 6 shows, this is feasible when
p < n for the modi�ed Silvapulle estimator, or for RR-MM if the data set is
small. In the case of large p > n, when only the latter can be employed, the
required time seems too large for practical purposes. Therefore, estimating the
variability of the robust RR estimators for large data sets would require either
a solution to the problem of approximating their asymptotic distribution, or a
way to drastically reduce computing times.
10 Conclusions
The modi�ed Silvapulle estimator is simple and fast. The results of Section
5 for n > p indicate that it has a good performance if p=n is small and the
contamination rate is low. Otherwise, RR�MM with either KCV = n or = 5
perform better, the latter being the best.
When n < p the modi�ed Silvapulle estimator cannot be employed. The
results of Sections 6 and 7 indicate that (at least for the situations considered
27
there) both versions of RR�MM outperform RSIMPLS, with respect to both
e¢ ciency and robustness. Although in general RR�MM with KCV = 5 does
better than with KCV = n; the advantages of the former do not seem so clear
as for n > p: Since the di¤erence between the respective computing times is at
least �ve-fold, KCV = n should be preferred for high�dimensional data.
11 Supplemental material
Appendix Contains details of the description of the estimators and of the
algorithms, and details of derivations (.pdf �le).
RidgeCode Matlab code for the computing of RR�MM (zipped RAR �les).
ACKNOWLEDGEMENTS
The author thanks the editor, the associate editor, and two referees for many
constructive comments and suggestions, which greatly improved the quality of
the article. This research was partially supported by grants X�018 from Univer-
sity of Buenos Aires, PID 5505 from CONICET and PICT 00899 from ANPCyT,
Argentina.
References
Bühlmann, P. (2006). �Boosting for High-Dimensional Linear Models�. The
Annals of Statistics, 34, 559-583.
Crambes, C., Kneip, A. and Sarda, P. (2009). �Smoothing Splines Estima-
tors for Functional Linear Regression�, The Annals of Statistics, 37, 35-72.
Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). �Least Angle
Regression�. The Annals of Statistics, 32, 407�499.
Frank, I.E. and Friedman, J.H (1993). �A Statistical View of Some Chemo-
metrics Regression Tools�, Technometrics, 2, 109�135.
28
Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statis-
tical Learning; Data Mining, Inference and Prediction. New York: Springer.
Hoerl, A.E., and Kennard, R.W. (1970), �Ridge Regression: Biased Estima-
tion for Nonorthogonal Problems,�Technometrics, 8, 27-51.
Hoerl, A.E., Kennard, R.W. and Baldwin, K.F. (1975). �Ridge Regression:
Some Simulations�. Communications in Statistics A (Theory and Methods), 4,
105-123.
Huber, P.J. (1964), �Robust Estimation of a Location Parameter�, The An-
nals of Mathematical Statistics, 35, 73-101.
Hubert, M. and Vanden Branden, K. (2003). �Robust Methods for Partial
Least Squares Regression�, Journal of Chemometrics, 17, 537�549.
Janssens, K., Deraedt, I., Freddy, A. and Veekman, J. (1998). �Composition
of 15-17th Century Archeological Glass Vessels Excavated in Antwerp, Belgium�.
Mikrochimica Acta, 15 (Suppl.), 253-267.
Krasker, W.S. and Welsch, R.E. (1982). �E¢ cient bounded in�uence regres-
sion estimation�. Journal of the American Statistical Association, 77, 595�604.
Lawrence, K.D. and Marsh, L.C. (1984). �Robust Ridge Estimation Meth-
ods for Predicting U.S. Coal Mining Fatalities�, Communications in Statistics,
Theory and Methods, 13, 139-149.
Locantore, N., Marron, J.S., Simpson, D.G., Tripoli, N., Zhang, J.T. and
Cohen, K.L. (1999), �Robust Principal Components for Functional Data�, Test,
8, 1�28.
Maronna, R.A., Martin, R.D. and Yohai, V.J. (2006). "Robust Statistics:
Theory and Methods", John Wiley and Sons, New York.
Maronna, R.A. and Yohai, V.J. (2009). �Robust Functional Linear Regres-
sion Based on Splines�. Available at http://mate.unlp.edu.ar/publicaciones/23112009231134.pdf
Maronna, R.A. and Yohai, V.J. (2010). �Correcting MM Estimates for �Fat�
29
Data Sets.�Computational Statistics & Data Analysis, 54, 3168-3173.
Peña, D. and Yohai, V.J. (1999), �A Fast Procedure for Outlier Diagnostics
in Large Regression Problems�. Journal of the American Statistical Association,
94, 434-445.
Silvapulle, M.J. (1991). �Robust Ridge Regression Based on an M�estimator�.
Australian Journal of Statistics, 33, 319-333.
Simpson, J.R. and Montgomery, D.C. (1996). �A Biased-robust Regression
Technique for the Combined Outlier-multicollinearity Problem�. Journal of
Statistical Computing and Simulation, 56, 1-20.
Tibshirani, R. (1996). �Regression Shrinkage and Selection via the Lasso�.
Journal of the Royal Statistical Society, Series B, 58, 267�288.
Yohai, V.J. (1987). �High Breakdown�point and High E¢ ciency Estimates
for Regression�, The Annals of Statistics 15, 642�65.
Zou, H. and Hastie, T. (2005). �Regularization and Variable Selection via
the Elastic Net�. Journal of the Royal Statistical Society (B), 67, 301-320.
30