Robust ridge regression for highŒdimensional data · 2010. 12. 26. · with the Supplemental...

Robust ridge regression for high�dimensional

data

Ricardo A. Maronna�

University of La Plata and C.I.C.P.B.A.

Abstract

Ridge regression, being based on the minimization of a quadratic loss

function, is sensitive to outliers. Current proposals for robust ridge re-

gression estimators are sensitive to �bad leverage observations�, cannot

be employed when the number of predictors p is larger than the number

of observations n; and have a low robustness when the ratio p=n is large.

In this paper a ridge regression estimate based on Yohai�s (1987) repeated

M estimation (�MM estimation�) is proposed. It is a penalized regression

MM estimator, in which the quadratic loss is replaced by an average of

� (ri=b�), where ri are the residuals and b� the residual scale from an initial

estimator, which is a penalized S estimator; and � is a bounded function.

The MM estimator can be computed for p > n and is robust for large p=n:

A fast algorithm is proposed. The advantages of the proposed approach

over its competitors are demonstrated through both simulated and real

data.

Key words and phrases: MM estimate; Shrinking; S estimate.

Note: Supplemental materials are available on line.

�Ricardo Maronna (E-mail: [email protected]) is professor, Mathematics Department,University of La Plata, C.C. 172, La Plata 1900, Argentina; and researcher at C.I.C.P.B.A.

1

1 Introduction

Ridge regression, �rst proposed by Hoerl and Kennard (1970), is a useful tool for

improving prediction in regression situations with highly correlated predictors.

For a regression data set (X;y) withX 2 Rn�p and y 2 Rn; the ridge estimator

can be de�ned as

b� = �b�0; b�01�0 = argmin fL (X;y;�) : �0 2 R; �1 2 Rpg (1)

where the loss function L is

L (X;y;�) = kr (�)k2 + � k�1k2 (2)

and

r = (r1; :::; rn)0= y� b�01n �X b�1 (3)

is the residual vector, with 1n = (1; 1; :::; 1)0 where in general A0 denotes the

transpose of A: Here � is a penalty parameter that must be estimated in order

to optimize the prediction accuracy.

Despite more modern approaches such as the Lasso (Tibshirani 1996), Least

Angle Regression (LARS) (Efron et al 2004) and boosting (Bühlmann 2006),

ridge regression (henceforth RR) continues to be useful in many situations, in

particular in chemometrics. It is particularly useful when it can be assumed that

all coe¢ cients have approximately the same order of magnitude (unlike sparse

situations when most coe¢ cients are assumed to be null). Zou and Hastie (2005)

state that �For usual n > p situations, if there are high correlations between

predictors, it has been empirically observed that the prediction performance of

the lasso is dominated by ridge regression�. Frank and Friedman (1993) compare

the performances of several estimators commonly used in chemometrics, such as

2

partial least squares (PLS) and principal components regression (PCR). They

conclude (p. 110) that �In all of the situations studied, RR dominated the other

methods, closely followed by PLS and PCR, in that order�.

Since RR is based on least squares, it is sensitive to atypical observations.

Lawrence and Marsh (1984) proposed a model for predicting fatalities in U.S.

coal mines, based on a data set with n = 64 and p = 6 a¤ected by both atypical

observations and high collinearity. They demonstrate that a robust version of

RR yields results superior to those from ordinary Least Squares (OLS) and from

classical RR, in particular in their compatibility with theoretical expectations.

From today�s perspective however, their approach to robusti�cation could be

greatly improved.

Several other approaches have been proposed to make RR robust towards

outliers, the most recent and interesting ones being the ones by Silvapulle (1991)

and Simpson and Montgomery (1996). These estimators however present two

drawbacks, which will be analyzed in Section 4. The �rst one is that they are

not su¢ ciently robust towards �bad leverage points�especially for large p: The

second is that they require an initial (standard) robust regression estimator.

When p > n this becomes impossible; and when p < n but the ratio p=n is

�large�the initial estimator has a low robustness, as measured by the breakdown

point.

The situation of prediction with p >> n has become quite usual, especially

in chemometrics. One instance of such a situation appears when attempting to

replace the laboratory determination of the amount of a given chemical com-

pound in a sample of material, with methods based on cheaper and faster spec-

trographic measurements. Section 7 of this article deals with one archaeological

instance of such a situation, where the material is glass from ancient vessels,

with n = 180 and p = 486. The analysis indicates that the data contain about

3

7% atypical observations, which a¤ect the predictive power of classical RR, while

the robust method proposed in this article has a clearly lower prediction error

than classical RR, PLS, and a robust version of PLS.

A further area that stimulates the development of robust RR for high�

dimensional data is that of functional regression. Some approaches for functional

regression, such as the one by Crambes et al, (2009) boil down to an expression

of the form (2) with k�1k2 replaced by �01A�1; where A is a positive�de�nite

matrix. i.e., a problem that is similar to RR, generally with p >> n. Therefore,

a reliable robust RR estimator for p >> n is a good basis to obtain a reliable

robust functional regression estimator. This approach has been employed by

Maronna and Yohai (2009) to propose a robust estimator for functional regres-

sion based on splines.

The purpose of this article it to develop ridge estimators that can be used

when p > n; are robust when p=n is large, and are resistant to bad leverage

points. In Section 2 we describe an estimator based on Yohai�s (1987) approach

of MM estimation, to allow controlling both robustness and e¢ ciency. The

MM estimator requires a robust (but possibly ine¢ cient) initial estimator. In

Section 3 we describe an estimator based on the minimization of a penalized

robust residual scale (�S estimator�) to be used as a starting point. Sections

2.4 and 2.5 deal with the choice of �:

In Section 4 we recall the estimators proposed by Silvapulle and by Simpson

and Montgomery, describe their drawbacks, and propose simple modi�cations

of them to improve their robustness. Section 5 describes a simulation study

for the case n > p in which the proposed estimator is seen to outperform its

competitors, and the modi�ed Silvapulle estimator performs well for not too

extreme situations. Section 6 describes a simulation for p > n in which the

MM estimator is seen to outperform a robust version of Partial Least Squares.

4

Section 7 deals with the application of the proposed estimator to a real high-

dimensional data set with p >> n. Section 8 compares the computing times

of the di¤erent estimators, Section 9 describes the di¢ culties in estimating the

estimator�s variability, and �nally Section 10 summarizes the conclusions. An

Appendix containing proofs and details of de�nitions and of calculations is given

with the Supplemental Material.

2 MM estimators for ridge regression

To ensure both robustness and e¢ ciency under the normal model we employ

the approach of MM estimation (Yohai 1987). Start with an initial robust

but possibly ine¢ cient estimator b�ini; from the respective residuals compute

a robust scale estimator b�ini: Then compute an M estimator with �xed scale

b�ini; starting the iterations from b�ini; and using a loss function that ensures thedesired e¢ ciency. Here �e¢ ciency�will be loosely de�ned as �similarity with

the classical RR estimator at the normal model�.

Recall �rst that a scale M estimator (an M�scale for short) of the data vector

r = (r1; :::; rn)0 is de�ned as the solution b� = b� (r) of

1

n

nXi=1

��ri�

�= �; (4)

where � is a bounded ��function in the sense of Maronna et al. (2006) and

� 2 (0; 0:5) determines the breakdown point of b�: Recall that the breakdownpoint (BDP) of an estimator is the maximum proportion of observations that

can be arbitrarily altered with the estimator remaining bounded away from the

border of the parameter set (which can be at in�nity). Note that when � (t) = t2

and � = 1 we have the classical mean squared error (MSE): b�2 = aveifr2i g: We

5

shall henceforth employ for � the bisquare ��function

�bis (t) = minf1; 1��1� t2

�3g: (5)

Let b�ini be an initial estimator. Let r = r�b�ini� and let b�ini be an M�scale

of r :1

n

nXi=1

�0

�rib�ini�= �; (6)

where �0 is a bounded ��function and � is to be chosen. Then the MM estimator

for RR (henceforth RR-MM) is de�ned by (1) with

L (X;y;�) = b�2ini nXi=1

�

�ri (�)b�ini

�+ � k�1k2 ; (7)

where � is another bounded ��function such that � � �0: The factor b�2ini beforethe summation is employed to make the estimator coincide with the classical

one when � (t) = t2:

We shall henceforth use

�0 (t) = �bis

�t

c0

�; � (t) = �bis

�t

c

�(8)

where the constants c0 < c are chosen to control both robustness and e¢ ciency,

as described later in Section 2.3.

The classical estimator (2) will henceforth be denoted by RR�LS for brevity.

Remark: Centering and scaling of the columns of X are two important

issues in ridge regression. Since our estimator contains the intercept term b�0;centering would a¤ect only b�0 and not the vector of slopes b�1; and is thereforenot necessary. However, in the algorithm for RR�MM the data are previously

centered by means of the bisquare location estimator, but this is done only for

the convenience of the numerical procedure. As to scaling, we consider it to be

6

the user�s choice rather than an intrinsic part of the de�nition of the estimator.

In our algorithm, the columns of X are scaled by means of a scale M estimator.

2.1 The �weighted normal equations� and the iterative

algorithm

It is known that the classical estimator RR�LS satis�es the �normal equations�

b�0 = y � x0 b�1; (X 0X + �Ip) b�1 =X 0�y� b�01n� ; (9)

where Ip is the identity matrix and x and y are the averages of X and y

respectively. A similar system of equations is satis�ed by RR�MM. De�ne

(t) = �0 (t) ; W (t) = (t)

t: (10)

Let

ti =rib�ini ; wi = W (ti)

2; w = (w1; :::; wn)

0; W = diag (w) : (11)

Setting the derivatives of (7) with respect to � to zero yields for RR-MM

w0�y� b�01n �X b�1� = 0 (12)

and

(X 0WX + �Ip) b�1 =X 0W�y� b�01n� : (13)

Therefore RR�MM satis�es a weighted version of (9). Since for the chosen

�; W (t) is a decreasing function of jtj; observations with larger residuals will

receive lower weights wi.

As is usual in robust statistics, these �weighted normal equations�suggest an

iterative procedure. Starting with an initial b� : Compute the residual vector r7

and the weights w: Leaving b�0 and w �xed, compute b�1 from (13). Recompute

w, and compute b�0 from (12). Repeat until the change in the residuals is small

enough. Recall that b�ini remains �xed throughout.This procedure may be called �Iterative reweighted RR�. It is shown in the

Supplemental Material that the objective function (7) descends at each iteration.

The initial estimator, which is a penalized S estimator, is described in Section

3.

2.2 Equivalent degrees of freedom

Let w be de�ned by (10)-(11) and

ex = w0X

w01; X =X � 1ex: (14)

Then it follows that the �tted values by =X b�1 + b�01n verifyby = �H +

1w0

10w

�y (15)

where the �hat matrix�H is given by

H =X�X 0WX + �Ip

��1X 0W: (16)

Then the equivalent degrees of freedom (edf) (Hastie et al. 2001) are de�ned by

edf = tr

�H +

1w0

10w

�= bpMM + 1 (17)

with

bpMM = tr (H) ; (18)

8

where tr (:) denotes the trace. When � = 0 we have tr (H) = rank (X) : We

may thus call bpMM the �equivalent number of predictors�.

The edf may be considered as a measure of the degree of �shrinking�of the

estimator.

2.3 Choosing the constants

We now deal with the choice of � in (6) and of c0 and c in (8). For the case � = 0;

i.e., the standard M�estimation of a regression with q = p + 1 parameters, the

natural choice is � = 0:5 (1� q=n) (henceforth a standard estimator will refer

to one with � = 0):

This choice of � is �natural�in that it maximizes the BDP of b�. It can alsobe justi�ed by the fact that the scale can be written as

1

n� q

nXi=1

��ri�

�= 0:5;

and the denominator n � q is the same as the one used for the classical s2 in

OLS.

For � > 0; the natural extension is

� = 0:5

�1� bq

n

�(19)

where bq is a measure of the �actual number of parameters�, which depends on �:We would use bq = edf de�ned in (17), but bpMM cannot be known a priori. For

this reason we shall employ a measure of edf derived from the initial estimator

in a manner analogous to (17), as described in Section 3.

Since � ceases to be �xed at the value 0.5, c0 must now depend on � in order

to ensure consistency of b� for normal data when � = 0. We therefore employ

9

c0 (�) de�ned by

E�

�z

c0 (�)

�= �; z � N(0; 1)

which can be computed numerically.

As to c; it is chosen to attain the desired e¢ ciency of the standard estimator

at the normal model. If p is �xed and n ! 1; the value c = 3:44 yields 0.85

asymptotic e¢ ciency at the normal model when � = 0:

However, for high�dimensional data both b�ini and c require a correction.

Maronna and Yohai (2010) point out two problems that appear when �tting a

standard MM estimator to data with a high ratio p=n : (1) The scale based on

the residuals from the initial regression estimator underestimates the true error

scale, and (2) Even if the scale is correctly estimated, the actual e¢ ciency of

the MM estimator can be much lower than the nominal one.

For this reason, b�ini is (upwards) corrected using formula (9) of their paper,namely as

e�ini = b�ini1� (k1 + k2=n)bq=n with k1 = 1:29; k2 = �6:02:

As to c; it must be also increased: we put c = 4 when bq=n > 0:1:2.4 Choosing � through cross�validation

In order to choose � we must estimate the prediction error of RR�MM for di¤er-

ent values of �: Call CV(K) the K-fold cross validation process, which requires

recomputing the estimate K times. For K = n (�leave�one�out�) we can use an

approximation to avoid recomputing. Call by�i the �t of yi computed withoutusing the i�th observation; i.e., y�i = b�(�i)0 + x0i

b�(�i)1 ; where�b�(�i)0 ; b�(�i)01

�is the RR�MM estimate computed without observation i: Then a �rst�order

10

Taylor approximation of the estimator yields the approximate prediction errors

r�i = yi � by�i � ri

�1 +

W (ti)hi1� hi 0 (ti)

�(20)

with

hi = x0i

nXi=1

0 (ti)xix0i+ 2�I

!�1xi

where W; and ti are de�ned in (10)-(11) and xi is the i-th row of X de�ned

in (14). This approximation does not take the initial estimator into account.

Henceforth CV(n) will refer to the above approximation.

Given the prediction errors r� = (r�1; :::; r�n)0, we compute a robust mean

squared error (MSE) as b� (r�)2 ; where b� is a ��scale�with constant c� = 5

(Yohai and Zamar 1988); and choose the � minimizing this MSE. Initially an

M�scale had been employed, but the resulting �s were somewhat erratic, and

this led to the use of the more e¢ cient ��scale. Note that the choice of c� must

strike a balance between robustness and e¢ ciency; the value 5 was chosen on

the basis of exploratory simulations.

It turns out that b� (r�), as a function of �; may have several local minima(as will be seen in Figure 4), which precludes the use of standard minimization

algorithms to choose the �best��: For this reason we prefer to compute b� (r�)for a selected set of candidate �s in a range [�min; �max]: The de�nition of this

set is given in Section 2.5.

2.5 The set of candidate �s and the edf

It is intuitively clear that the robustness of RR�MM depends on that of the

initial estimator, which, as we shall see, depends in turn on the edf, which are a

function of �. For this reason we shall need an a priori approximation to the

edf corresponding to each �.

11

For the classical estimator RR-LS it is known that the edf are bp (�)+1 withbp (�) = pX

j=1

j j + �=n

(21)

where j are the eigenvalues of C; the sample covariance matrix of X: We use

a robust version of (21) replacing those eigenvalues by the Spherical Principal

Components (SPC) of X (Locantore et al. 1999):

bpSPC (�) = pXj=1

j j + �=n

(22)

where j are the �pseudo eigenvalues�; the details are described in the Supple-

mental Material. Note that bpSPC (�) is a decreasing function. The choice ofSPC was based on the fact that it is very fast, can be computed for p >> n;

and has some interesting theoretical properties (see the Discussion of the article

by Locantore et al. (1999)).

Call "� the BDP of b�ini: It is not di¢ cult to show that if � > 0; the BDP

of RR�MM is � "�: However, since the estimator is not regression�equivariant,

this result is deceptive. As an extreme case, take p = n � 1 and "� = 0:5:

Then the BDP of b� is null for � = 0; but is 0.5 for � = 10�6! Actually, for

a given contamination rate less than 0.5, the maximum bias of the estimator

remains bounded, but the bound tends to in�nity when � tends to zero. This fact

suggests keeping the candidate �s bounded away from 0 when p=n is large. This

idea is more easily expressed through the edf. As will be seen in Section 3, to

control the robustness of the initial estimator it is convenient that the �e¤ective

dimension�be bounded away from n; more precisely, that bpSPC (�) + 1 � 0:5n:We therefore choose [�min; �max] such that

�min = inff� � 0 : bpSPC (�) + 1 � 0:5ng and bpSPC (�max) = 1; (23)

12

so that �min = 0 if p < (n � 1)=2: We take N� values of � between �min and

�max; with N� depending on p; so that the N� values of bpSPC (�) are equispaced.As a compromise between precision and computing time, we take

N� = minfp; n; 20g: (24)

A technical detail appears. The set of candidate �s is chosen so that RR�LS

has a given degree of shrinking. Note however that even if there is no conta-

mination, RR�LS and RR�MM with the same � need not yield approximately

the same estimators. For a given �; we want to �nd �0 such that RR�MM with

penalty �0 yields approximately the same degree of shrinking as RR�LS with

penalty �: There is a simple solution. Note that limk!1k2

3 �bis (r=k) = r2; and

therefore RR�LS with a given � is approximately �matched�to RR-MM with

penalty �0 = 3�=c2 with c in (8).

3 The initial estimator for RR�MM: S estima-

tors for RR

As in the standard case, the initial estimator for RR�MM will be the one based

on the minimization of a (penalized) robust scale (an S estimator). We de�ne an

S estimator for RR (henceforth RR�SE) by replacing in (2) the sum of squared

residuals by a squared robust M�scale b� (4), namelyL (X;y;�) = nb� (r (�))2 + � k�1k2 : (25)

Here the factor n is employed to make (25) coincide with (2) when b�2 (r) =avei

�r2i�; corresponding to � (t) = t2 and � = 1 in (4).

A straightforward calculation shows that the estimator satis�es the �normal

13

equations�(12)-(13) with wi =W (ti) and � replaced by �0 = �n�1Pn

i=1 wit2i :

These equations suggest an iterative algorithm similar to the one at the end

of Section 2.1, with the di¤erence that now the scale changes at each iteration,

i.e. now in the de�nition of ti in (11), b�ini is replaced by b� �r �b��. Althoughit has not been possible to prove that the objective function descends at each

iteration, this fact has been veri�ed in all cases up to now.

Expressing the estimator as a weighted linear combination yields an expres-

sion for the equivalent number of parameters bpSE as in (17).As with RR�MM, it is not di¢ cult to show that if � > 0 the BDP of RR�SEb� coincides with that of b�: Recall that this BDP equals min (�; 1� �) with � in

(4). In the case of standard regression, recall that for X in general position, the

BDP of a regression S estimator with intercept ismin (�; 1� (p+ 1) =n� �) ; and

therefore its maximum (which coincides with the maximum BDP of regression-

equivariant estimators) is attained when � = 0:5 (1� (p+ 1) =n) (see Maronna

et al. 2006). For � > 0; the above facts suggest using

� = 0:5

�1� bp+ 1

n

�; (26)

where bp is a measure of the �actual number of parameters�, which will depend on�: It would be ideal to use bp = bpSE; but this value is unknown before computingthe estimator; for this reason we use bp = bpSPC: Once b�SE computed, we compute� for RR�MM using (19) with bq = bpSE+1, the reason being that bpSE is expectedto give a more accurate measure of the edf than the tentative bpSPC:It follows from (26) that if we want � � 0:25 we get the value of �min in (23).

The technical detail of matching � described at the end of Section 2.5 occurs

also with RR�SE, namely, the same value of � yields di¤erent degrees of shrink-

ing for RR�LS and RR�SE. Since its solution is more involved, it is deferred to

the Supplemental Material.

14

4 Other robust RR estimators for n > p

4.1 The Silvapulle estimator

Silvapulle´s (1991) estimator is based on the following fact. Let X be centered

to zero mean. Call b�LS = (b�0; b�01)0 the ordinary LS (OLS) estimator. Thenthe RR�LS estimator

�b�0; b�01�0 ful�llsb�0 = b�0; b�1 = Z (�) b�1; (27)

where

Z (�) = (X 0X + �I)�1X 0X: (28)

Let now b�M = (b�M;0; b�0M:1) be a standard regression M estimator and b� arobust residual scale. Then Silvapulle proposes to de�ne a RR estimator by

(27)�(28) with b�M instead of b�LS; and to choose � in the same way as Hoerl etal. (1975), namely b� = pb�2

kb�M;1k2 : (29)

Silvapulle uses a simultaneous estimator of � and � of �Huber�s Proposal II�

type (Huber 1964), which has the following drawbacks: (1) The M estimator has

a monotonic = �0, which makes it non�robust towards high leverage outliers.

(2) The scale has a low BDP. (3) The matrix Z in (28) is also sensitive to

leverage points and (4) The centering of X may also be in�uenced by leverage

points.

For this reason the following modi�cations are proposed. Call b� a robustlocation vector of X obtained by applying a robust location estimator to each

column (we use the bisquare M estimator). Call X= X � 1nb�0 the centeredX matrix. Let b�M =

�b�M;0; b�0M;1�0 be a standard MM regression estimator

of (X;y) with initial scale b�: This estimator yields a vector of weights w. Let15

W = diag (w) and

Z (�) =�X 0WX + �I

��1X 0WX:

Finally de�ne the modi�ed Silvapulle estimator by

b�1 = Z (�) b�M;1; b�0 = b�M;0 � b�01b�:4.2 The Simpson-Montgomery estimator

Simpson and Montgomery (1996) employ a Schweppe GM type of estimator (see

Maronna et al 2006, Sec. 5.11). They start by computing a standard regression

S estimator b� = (b�0; b�01)0 with residual scale b�; then � is estimated accordingto Hoerl and Kennard�s (1976) iterative method. Call (t; V ) the robust loca-

tion/covariance M estimator proposed by Krasker and Welsch (1982), and letX

be the centered matrix X=X� 1t0 and let y be robustly centered. De�ne the

Mahalanobis distances d2i = xi0V �1xi and the weights wi = Wmult (ri= (b��i)) ;

where �i = Cmedianj (dj) =di; with the constant C chosen to regulate e¢ ciency;

andWmult is the weight function given by the multivariate M estimator. Finally

their RR estimator is given by

�X 0�X + b�I� b�1 =X 0�y with � = diag (�1; :::; �n) : (30)

The main drawbacks of this approach are: (1) The multivariate estimator

has a very low BDP. (2) The estimation of � depends on X 0X which is not

robust and (3) The Hoerl-Kennard iterations are unstable.

For these reasons the following modi�cations were introduced: (1) (t; V ) is

a multivariate S estimator using the bisquare scale. (2) Call W the matrix of

weights given by the initial regression S-estimator; then the modi�ed estimator is

16

given by (30) with� replaced byW; and (3) Only two Hoerl-Kennard iterations

are employed.

5 Simulations for p < n

The performances of the robust estimators described above were compared for

n > p through a simulation study. For each simulation scenario we generate

Nrep datasets as yi = x0i�true + ei; i = 1; :::; n with xi � Np (0; V ) where

V = [vjk] with vjj = 1 and vjk = � for j 6= k; and ei � N(0; 1). Since

the estimators are neither regression- nor a¢ ne�equivariant, the choice of �true

matters. For a given value of the �signal to noise ratio� (SNR) R de�ned as

R = k�truek2 = (pVar (e)) ; we take �true at random: �true =ppRb where b

has a uniform distribution on the unit spherical surface. To introduce conta-

mination we choose two constants klev and kslo which respectively control the

leverage and slope of the contamination, their e¤ect being to mislead b�1 towards�true (1 + kslo) : For a given contamination rate " 2 (0; 1) let m = [n"] where

[:] denotes the integer part. Then the �rst m observations (xi; yi) are replaced

by xi = x0 and yi = x00�true (1 + kslo) with x0 = kleva=pa0V �1a; where a is

random with a01p = 0: Thus x0 is placed on the directions most in�uential to

the estimator. The values of kslo were taken in a grid aimed at determining the

maximum MSE for each estimator. We took two values for � : 0.5 and 0.9, and

two for klev : 5 and 10.

Before computing the estimators, the columns of X are centered and nor-

malized to unit scale. For RR�LS we use the mean and standard deviation,

and for the robust estimators we use the bisquare location and scale estimators.

Then b� is de�normalized at the end.The set of �s for RR�LS is chosen according to (21), in order to obtain N�

values of � with N� = p; so that the N� values of bp (�) are equispaced between17

1 and p.

To evaluate the performance of a given estimator b� = �b�0; b�01�0 ; recall thatthe conditional prediction mean squared error for a new observation (x0; y0) is

MSE�b�� = E�y0 � b�0 � x00 b�1�2 = 1 +�;

where

� = b�20 + �b�1 � �true�0 V �b�1 � �true� :It must be recalled that the distributions of robust estimators under conta-

mination are themselves heavy-tailed, and it is therefore prudent to evaluate

their performance through robust measures. For this reason, to summarize the

Nrep values of MSE�b�� we employed both their average, and their 10% (upper)

trimmed average. The results given below correspond to this trimmed MSE,

although the ordinary average yields qualitatively similar results.

The estimators employed were:

� RR�LS with N� = p and KCV�fold CV with KCV = n and KCV = 5 and

=10.

� RR-MM with the same N� and KCV:

� Simpson�Montgomery estimator, original and modi�ed versions.

� Silvapulle estimator, original and modi�ed versions.

Table 1 displays the maximum trimmed MSEs for p = 10; n = 50; � = 0:9

and klev = 5 and three values of the SNR. The values for " = 0 correspond

to no contamination, and those for " = 0:1 to the maximum of the trimmed

MSEs. The results for RR�MM and RR�LS with KCV = 10 are omitted since

18

SNR 10 1 0.1" 0 0.1 0. 0.1 0 0.1

RR� LS, CV(n) 1.34 43.35 1.28 5.33 1.13 3.01RR� LS, CV(5) 1.34 44.05 1.30 5.45 1.13 3.06RR�MM, CV(n) 1.38 1.70 1.31 1.58 1.13 1.32RR�MM, CV(5) 1.41 1.74 1.32 1.47 1.13 1.23Simp-Mont. orig. 1.48 36.47 1.37 17.03 1.12 3.27

Simp-Mont. modif. 1.50 2.53 1.40 1.92 1.23 1.55Silv. orig 1.35 40.48 1.29 5.85 1.16 3.40

Silv. modif. 1.35 1.66 1.28 1.57 1.14 1.38

Table 1: Maximum trimmed MSEs of estimators for p = 10; n = 50; � = 0:9

they were not very di¤erent from those with KCV = 5: Both versions of RR�

LS have similar behaviors, as well as both versions of RR�MM. All robust

estimators show a high e¢ ciency for " = 0. The original Simpson�Montgomery

and Silvapulle estimators are clearly not robust; it is curious that under " = 0:1

the former can attain higher MSEs than RR�LS. It must be recalled that the

maximum MSE of RR�LS for all kslo is in�nite, but here the range of kslo was

limited to �nding the maximum MSE of each robust estimator. The modi�ed

Simpson�Montgomery estimator is robust, but for both " = 0 and " = 0:1

it exhibits higher MSEs than the two versions of RR�MM and the modi�ed

Silvapulle estimator.

For these reasons we henceforth drop the original Silvapulle estimator and

both versions of Simpson�Montgomery from the study, as well as RR�LS with

KCV = 5.

The modi�ed Silvapulle estimator appears as a good competitor to RR�MM,

despite its simplicity. We now consider the same scenario but with " = 0:2: The

results are given in Table 2.

Among the three robust estimators, RR�MM with CV(5) appears as the best

and the modi�ed Silvapulle estimator as the worst, with RR�MM with CV(n)

19

SNR 10 1 0.1RR� LS, CV(n) 231.82 24.28 3.41RR�MM, CV(n) 3.13 2.80 2.48RR�MM, CV(5) 3.13 2.14 1.85

Silv. Mod 3.55 3.22 3.18

Table 2: Maximum trimmed MSEs of estimators for p = 10; n = 50; � = 0:9 and� = 0:2

in between. Figure 1 displays the MSEs of the three estimators as a function of

kslo:

0 5 10 15 200

0.5

1

1.5

2

2.5

3

kslo

MS

E

SNR=1

MM(5)

MM(n)

SilMod

0 10 20 300

0.5

1

1.5

2

2.5

3

kslo

SNR=0.1

MM(5)

MM(n)

SilMod

Figure 1: Simulation MSEs for p = 10; n = 50 and " = 0:2 of MM estimatorswith 5-fold and n-fold cross-validation, and the modi�ed Silvapulle estimator,as a function of contamination slope. The values for kslo = 0 are those for " = 0:

We �nally consider a situation with a high p=n ratio, namely p = 10 and

n = 25:

Table 3 and Figure 2 show the results. The e¢ ciency of all robust estimators

20

SNR 10 1 0.1" 0 0.1 0 0.1 0 0.1

RR� LS, CV(n) 2.03 41.93 1.63 5.38 1.22 4.58RR�MM, CV(n) 2.36 2.80 1.82 2.09 1.33 1.54RR�MM, CV(5) 2.32 2.70 1.73 1.81 1.25 1.36

Silv. Mod 2.34 3.09 1.74 2.27 1.43 2.02


is lower than for n = 50; though still larger than 0.85. Again RR�MM, CV(5) is

the best and the modi�ed Silvapulle estimator the worst, with RR�MM, CV(n)

in between.

A plausible explanation for the superiority of KCV = 5 over KCV = n is

that the estimation of the MSE for each � is more robust in the former, which

would lead to a better choice of �.

It is seen in both tables that all MSEs increase with increasing SNR. This

phenomenon appears more clearly in the next Section. A plausible explanation

it that for large SNR, the bias caused by shrinking dominates the variance

reduction, and all estimators �shrink too much�.

All simulations were repeated for � = 0:5 and for klev = 10; yielding the

same qualitative conclusions as above.

6 Simulations for n < p

Since RR estimators are equivariant with respect to orthogonal transformations

of X; the case n < p can always be reduced to one with p � n� 1: To this end,

X is �rst normalized, the principal components of the transformed matrix are

computed, and the data are expressed in the coordinate system of the principal

directions with (numerically) non�null eigenvalues. Note that this implies that

now the edf will be � n:

Many situations with p > n correspond to data such as e.g. spectra, for

21

0 5 10 15 20 25 300

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

MS

E

kslo

MM(5)

MM(n)

SilMod

Figure 2: Simulation MSEs for p = 10; n = 25; SNR=0.1 and " = 0:1 of MMestimators with 5-fold and n-fold cross-validation, and the modi�ed Silvapulleestimator, as a function of contamination slope. The values for kslo = 0 arethose for " = 0:

which X usually exhibits a �local� dependence structure. For this reason we

now take xi � Np (0; V ) with vjk = �jj�kj: All other details of the simulation

are like in the former Section. The values chosen for � are 0.9 and 0.5, and those

for klev are 5 and 10. We took n = 30 and p = 40: This is not very realistic,

but it was necessary to limit the computing time. The values of kslo ranged

between 1 and 15 for " = 0:1 and between 1 and 20 for " = 0:2: Again, the

estimators were RR�LS and RR�MM with KCV = n and = 5: Since the Silva-

pulle and the Simpson�Montgomery estimators cannot be computed for n < p;

in order to have competitors against RR we included the robust Partial Least

Squares (PLS) estimator RSIMPLS (Hubert and Vanden Branden 2003), and

also the classical PLS estimator CSIMPLS, both as implemented in the library

22

LIBRA (see http://wis.kuleuven.be/stat/robust/LIBRA.html). The number of

components was chosen in both through cross�validation. The breakdown point

of RSIMPLS was the default value of 0.25; attempts to employ a larger break-

down point such as 0.5 or 0.4 caused the program to crash due to some matrix

becoming singular. Table 4 gives the results for � = 0:9 and klev = 5: The SNR

was taken as 0.1, 0.5 and 1. The case SNR=10 was omitted since all estimators

had very high MSEs.

SNR 1 0.5 0.1" 0 0.1 0.2 0. 0.1 0.2 0 0.1 0.2

RR� LS, CV(n) 4.0 38.1 88.2 3.2 19.9 45.7 2.1 5.7 11.7RR� LS, CV(5) 4.1 22.7 52.3 3.0 12.1 27.0 1.8 3.6 6.8RR�MM, CV(n) 4.6 5.8 9.2 3.1 3.9 6.0 1.8 2.1 3.5RR�MM, CV(5) 4.8 5.9 10.7 3.1 3.8 5.7 1.8 2.1 3.0

RSIMPLS 7.3 10.4 85.7 4.5 6.1 42.6 2.2 2.8 10.2CSIMPLS 4.7 28.2 61.1 3.3 14.8 31.2 2.1 3.7 7.7


It is seen that: both versions of RR�MM have similar performances, and are

rather e¢ cient. RR�LS with KCV = 5 is remarkably better than with KCV = n:

RSIMPLS has a low e¢ ciency, and for " = 0:2 it breaks down, performing even

worse than CSIMPLS; this behavior is puzzling.

7 A real data set with p >> n

We consider a dataset corresponding to electron-probe X ray microanalysis of

archaeological glass vessels (Janssens et al, 1998), where each of n = 180 vessels

is represented by a spectrum on 1920 frequencies. For each vessel the contents of

13 chemical compounds are registered. Since for frequencies below 15 and above

500 the values of xij are almost null, we keep frequencies 15 to 500, so that we

23

have p = 486: Figure 3 shows the residual QQ�plots for compound 13 (PbO).

Both robust estimates exhibit seven clear outliers plus six suspect points, while

RR�LS exhibits one outlier and CSIMPLS none. The dotted lines correspond

to a 95% envelope based on 200 bootstrap samples. It is curious that the

envelopes corresponding to the two non-robust estimators are relatively much

more dispersed than the ones corresponding to the two robust ones.

2 1 0 1 2

0.2

0.1

0

0.1

0.2

0.3

0.4

0.5LS

Normal quantiles

resi

dual

s

2 1 0 1 2

0

1

2

3

4

5

6

7MM

Normal quantiles

resi

dual

s

2 1 0 1 21

0

1

2

3

4

5

6

7RPLS

Normal quantiles

resi

dual

s

2 1 0 1 2

0.4

0.2

0

0.2

0.4

0.6

CPLS

Normal quantiles

resi

dual

s

Figure 3: Vessel data: residual QQ plots for compound 13, with bootstrap 95%envelopes.

The edf for RR�LS and for RR�MM with KCV = n and KCV = 5 were

respectively 108.3, 10.0 and 5.4, and both versions of PLS chose 5 components.

The predictive behavior of the estimators was assessed through 5�fold CV;

the criteria used were the root mean squared error (RMSE) and RMSE with

upper 10% trimming (RMSE(0.9)), considered to be safer in the presence of

outliers. Table 5 shows the results.

24

RMSE RMSE(0.9)RR� LS, CV(n) 0.194 0.133RR� LS, CV(5) 0.186 0.119RR�MM, CV(n) 0.871 0.091RR�MM, CV(5) 0.892 0.088

RSIMPLS 0.890 0.113CSIMPLS 0.510 0.194

Table 5: Vessel data: RMSEs for compound 13

According to RMSE(0.9), both versions of RR�MM have a similar perfor-

mance, which is clearly better than those of the other estimators. However,

according to RMSE they perform much worse than RR�LS. The plausible rea-

son is that the outliers in�ate RMSE, and this demonstrates the importance of

using robust criteria to evaluate robust estimators.

The left panel of Figure 4 shows the ridge trace for coe¢ cients 1, 50,100

and 150 of �1 as a function of � (in logarithmic scale), and the right one shows

the estimated robust MSEs. The latter panel shows that the absolute minimum

occurs at � = 2575; i.e. log10 (�) = 3:41: The coe¢ cients are seen to oscillate

wildly for low � and then to stabilize.

8 Computing times

Table 6 gives the computing times (in minutes) for the robust estimators. The

�rst four rows are the averages of �ve samples generated as in Section 6, and the

last one corresponds to the data in Section 7. The computation was performed

in Matlab on a 3.01 GHz PC with an Intel Core Duo CPU.

It is seen that in the last two rows, the times required by RR�MM with

KCV = 5 are about �ve times those for KCV = n; what is to be expected,

since the latter does not actually perform the cross-validation. But in the other

cases, the di¤erence is about ten�fold, which is surprising. It is also seen that

25

2 3 40.01

0.005

0

0.005

0.01

0.015

log(lambda)

beta

s

2 3 40.011

0.0115

0.012

0.0125

0.013

0.0135

0.014

log(lambda)

MS

E

Figure 4: Vessel data: Ridge trace for coe¢ cients 1 (o), 50 (*), 100 (x) and 150(+) (left) panel) and estimated MSE (right), as a function of log10 (�) :

RSIMPLS is very fast, and the modi�ed Silvapulle estimator is still faster.

9 The problem of estimating variability

An approach to estimate the variability of estimators and �tted values is to

derive an (exact or approximate) expression for their variances. For RR�LS it

is easy to derive a simple formula for the covariance matrix of b�, for a givenn p RR-MM, CV(n) RR-MM, CV(5) RSIMPLS Silv. Mod100 10 0.0203 0.2705 0.1242 0.0019100 20 0.0597 0.6847 0.1251 0.0028100 40 0.1114 1.0261 0.1305 0.0054100 200 0.4040 2.7384 0.1413 �180 486 3.413 15.932 0.517 �

Table 6: Computing times in minutes

26

�: But the fact that actually � depends on the data makes the analysis rather

complicated. I do not know of any theoretical results in this direction.

The situation is even more complex for the robust estimators. Deriving

the asymptotic distributions of robust RR estimators for �xed �; seems feasible

(though not trivial) if n!1 with p �xed. But the result would be irrelevant, for

in practical terms this means that p=n is negligible, and in this case there would

not be much di¤erence between ridge and standard (i.e., with � = 0) regression.

The practically interesting cases are those where p=n is not negligible (or even

>1), and the corresponding asymptotic theory would have p tend to in�nity

with n; with p=n tending to a positive constant. This seems a very di¢ cult

task, even for �xed �; and even more so for an estimated �:

Another approach is to use the bootstrap, which implies recomputing the

estimate at least several hundred times. As Table 6 shows, this is feasible when

p < n for the modi�ed Silvapulle estimator, or for RR-MM if the data set is

small. In the case of large p > n, when only the latter can be employed, the

required time seems too large for practical purposes. Therefore, estimating the

variability of the robust RR estimators for large data sets would require either

a solution to the problem of approximating their asymptotic distribution, or a

way to drastically reduce computing times.

10 Conclusions

The modi�ed Silvapulle estimator is simple and fast. The results of Section

5 for n > p indicate that it has a good performance if p=n is small and the

contamination rate is low. Otherwise, RR�MM with either KCV = n or = 5

perform better, the latter being the best.

When n < p the modi�ed Silvapulle estimator cannot be employed. The

results of Sections 6 and 7 indicate that (at least for the situations considered

27

there) both versions of RR�MM outperform RSIMPLS, with respect to both

e¢ ciency and robustness. Although in general RR�MM with KCV = 5 does

better than with KCV = n; the advantages of the former do not seem so clear

as for n > p: Since the di¤erence between the respective computing times is at

least �ve-fold, KCV = n should be preferred for high�dimensional data.

11 Supplemental material

Appendix Contains details of the description of the estimators and of the

algorithms, and details of derivations (.pdf �le).

RidgeCode Matlab code for the computing of RR�MM (zipped RAR �les).

ACKNOWLEDGEMENTS

The author thanks the editor, the associate editor, and two referees for many

constructive comments and suggestions, which greatly improved the quality of

the article. This research was partially supported by grants X�018 from Univer-

sity of Buenos Aires, PID 5505 from CONICET and PICT 00899 from ANPCyT,

Argentina.

References

Bühlmann, P. (2006). �Boosting for High-Dimensional Linear Models�. The

Annals of Statistics, 34, 559-583.

Crambes, C., Kneip, A. and Sarda, P. (2009). �Smoothing Splines Estima-

tors for Functional Linear Regression�, The Annals of Statistics, 37, 35-72.

Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). �Least Angle

Regression�. The Annals of Statistics, 32, 407�499.

Frank, I.E. and Friedman, J.H (1993). �A Statistical View of Some Chemo-

metrics Regression Tools�, Technometrics, 2, 109�135.

28

Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statis-

tical Learning; Data Mining, Inference and Prediction. New York: Springer.

Hoerl, A.E., and Kennard, R.W. (1970), �Ridge Regression: Biased Estima-

tion for Nonorthogonal Problems,�Technometrics, 8, 27-51.

Hoerl, A.E., Kennard, R.W. and Baldwin, K.F. (1975). �Ridge Regression:

Some Simulations�. Communications in Statistics A (Theory and Methods), 4,

105-123.

Huber, P.J. (1964), �Robust Estimation of a Location Parameter�, The An-

nals of Mathematical Statistics, 35, 73-101.

Hubert, M. and Vanden Branden, K. (2003). �Robust Methods for Partial

Least Squares Regression�, Journal of Chemometrics, 17, 537�549.

Janssens, K., Deraedt, I., Freddy, A. and Veekman, J. (1998). �Composition

of 15-17th Century Archeological Glass Vessels Excavated in Antwerp, Belgium�.

Mikrochimica Acta, 15 (Suppl.), 253-267.

Krasker, W.S. and Welsch, R.E. (1982). �E¢ cient bounded in�uence regres-

sion estimation�. Journal of the American Statistical Association, 77, 595�604.

Lawrence, K.D. and Marsh, L.C. (1984). �Robust Ridge Estimation Meth-

ods for Predicting U.S. Coal Mining Fatalities�, Communications in Statistics,

Theory and Methods, 13, 139-149.

Locantore, N., Marron, J.S., Simpson, D.G., Tripoli, N., Zhang, J.T. and

Cohen, K.L. (1999), �Robust Principal Components for Functional Data�, Test,

8, 1�28.

Maronna, R.A., Martin, R.D. and Yohai, V.J. (2006). "Robust Statistics:

Theory and Methods", John Wiley and Sons, New York.

Maronna, R.A. and Yohai, V.J. (2009). �Robust Functional Linear Regres-

sion Based on Splines�. Available at http://mate.unlp.edu.ar/publicaciones/23112009231134.pdf

Maronna, R.A. and Yohai, V.J. (2010). �Correcting MM Estimates for �Fat�

29

Data Sets.�Computational Statistics & Data Analysis, 54, 3168-3173.

Peña, D. and Yohai, V.J. (1999), �A Fast Procedure for Outlier Diagnostics

in Large Regression Problems�. Journal of the American Statistical Association,

94, 434-445.

Silvapulle, M.J. (1991). �Robust Ridge Regression Based on an M�estimator�.

Australian Journal of Statistics, 33, 319-333.

Simpson, J.R. and Montgomery, D.C. (1996). �A Biased-robust Regression

Technique for the Combined Outlier-multicollinearity Problem�. Journal of

Statistical Computing and Simulation, 56, 1-20.

Tibshirani, R. (1996). �Regression Shrinkage and Selection via the Lasso�.

Journal of the Royal Statistical Society, Series B, 58, 267�288.

Yohai, V.J. (1987). �High Breakdown�point and High E¢ ciency Estimates

for Regression�, The Annals of Statistics 15, 642�65.

Zou, H. and Hastie, T. (2005). �Regularization and Variable Selection via

the Elastic Net�. Journal of the Royal Statistical Society (B), 67, 301-320.

30

Robust ridge regression for highŒdimensional data · 2010. 12. 26. · with the Supplemental...

Documents

Transcript of Robust ridge regression for highŒdimensional data · 2010. 12. 26. · with the Supplemental...