Download - Bootcamp 2011 Algebra

Transcript
  • 8/11/2019 Bootcamp 2011 Algebra

    1/42

    Department of BiostatisticsDepartment of Stat. and OR

    Refresher course, Summer 2011

    Linear Algebra

    Original Author:

    OlegMayba(UC Berkeley, 2006)

    Modified By:

    Eric Lock(UNC, 2010 & 2011)

    Instructor:Eric Lock

    (UNC at Chapel Hill)

    Based on the NSF sponsored (DMS Grant No0130526) VIGRE Boot camp lecture notes in the

    Department of Statistics, University of California,Berkeley

    June 7, 2011

  • 8/11/2019 Bootcamp 2011 Algebra

    2/42

    Contents

    1 Introduction 2

    2 Vector Spaces 2

    2.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Gram-Schmidt Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3 Matrices and Matrix Algebra 83.1 Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Special Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 Fundamental Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    4 Least Squares Estimation 134.1 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2 Applications to Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    5 Differentiation 165.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.2 Jacobian and Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    6 Matrix Decompositions 19

    6.1 Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.2 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . 216.3 Complex Matrices and Basic Results . . . . . . . . . . . . . . . . . . . . . . 236.4 SVD and Pseudo-inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    7 Statistics: Random Variables 287.1 Expectation, Variance and Covariance . . . . . . . . . . . . . . . . . . . . . 287.2 Distribution of Functions of Random Variables . . . . . . . . . . . . . . . . . 307.3 Derivation of Common Univariate Distributions . . . . . . . . . . . . . . . . 337.4 Random Vectors: Expectation and Variance . . . . . . . . . . . . . . . . . . 36

    Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    8 Further Applications to Statistics: Normal Theory and F-test 388.1 Bivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 388.2 F-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    9 References 41

    1

  • 8/11/2019 Bootcamp 2011 Algebra

    3/42

    1 Introduction

    These notes are intended for use in the warm-up camp for incoming UNC STOR and Bio-statistics graduate students. Welcome to Carolina! We assume that you have taken a linearalgebra course before and that most of the material in these notes will be a review of what

    you already know. If some of the material is unfamiliar, do not be intimidated! We hopeyou find these notes helpful! If not, you can consult the references listed at the end, or anyother textbooks of your choice for more information or another style of presentation (mostof the proofs on linear algebra part have been adopted from Strang, the proof of F-test fromMontgomery et al, and the proof of bivariate normal density from Bickel and Doksum). GoTar Heels!

    2 Vector Spaces

    A set V is a vector space over R, and its elements are called vectors, if there are 2

    operations defined on it:1. Vector addition, that assigns to each pair of vectors v1, v2 V another vector w V

    (we write v1+v2 = w)

    2. Scalar multiplication, that assigns to each vectorv Vand each scalarr Ranothervector w V (we write rv= w)

    that satisfy the following 8 conditionsv1, v2, v3 V andr1, r2 R:1. v1+v2 =v2+v1

    2. (v1+v2) +v3 =v1+ (v2+v3)

    3.vector 0 V, s.t. v+ 0 =v,v V4.v V v= w V s.t. v+w= 05. r1(r2v) = (r1r2)v,v V6. (r1+r2)v =r1v+r2v,v V7. r(v1+v2) =rv1+rv2,r R8. 1v =v,v VVector spaces over fields other than Rare defined similarly, with the multiplicative iden-

    tity of the field replacing 1. We wont concern ourselves with those spaces, except for whenwell be needing complex numbers later on. Also, well be using the symbol 0 to designateboth the number 0 and the vector 0 in V, and you should always be able to tell the differencefrom the context. Sometimes, well emphasize that were dealing with, say, n 1 vector 0by writing 0n1.

    Examples:

    2

  • 8/11/2019 Bootcamp 2011 Algebra

    4/42

    v

    w

    v+ w

    2w

    Figure 1: Vector Addition and Scalar Multiplication

    1. Vector space Rn with usual operations of element-wise addition and scalar multiplica-tion. An example of these operations in R2 is illustrated above.

    2. Vector spaceF[1,1]of all functions defined on interval [1, 1], where we define (f+g)(x)=f(x) +g(x) and (rf)(x) =rf(x).

    2.1 Basic Concepts

    We say that S V is a subspace of V, if S is closed under vector addition and scalarmultiplication, i.e.

    1.s1, s2 S, s1+s2 S2.s S,r R, rs S

    You can verify that if those conditions hold, Sis a vector space in its own right (satisfies the8 conditions above). Note also that Shas to be non-empty; the empty set is not allowed as

    a subspace.

    Examples:

    1. A subset{0}is always a subspace of a vectors space V2. Given vectorsv1, v2, . . . , vn V, the set of all their linear combinations (see below for

    definition) is a subspace ofV.

    3. S= {(x, y) R2 :y = 0} is a subspace ofR2 (x-axis)4. A set of all continuous functions defined on interval [1, 1] is a subspace ofF[1,1]

    For all of the above examples, you should check for yourself that they are in fact subspaces.

    Given vectors v1, v2, . . . , vn V, we say that w V is a linear combination ofv1, v2, . . . , vn if for some r1, r2, . . . , rn R, we have w = r1v1 + r2v2 + . . .+ rnvn. If ev-ery vector inVis a linear combination ofv1, v2, . . . , vn, then we say that v1, v2, . . . , vn spanV.

    Given vectors v1, v2, . . . , vn V we say that v1, v2, . . . , vn are linearly independent ifr1v1+r2v2+. . .+rnvn = 0 = r1 =r2 =. . .= rn = 0, i.e. the only linear combination of

    3

  • 8/11/2019 Bootcamp 2011 Algebra

    5/42

    v1, v2, . . . , vnthat produces 0 vector is the trivial one. We say that v1, v2, . . . , vnare linearlydependentotherwise.

    Now suppose that v1, v2, . . . , vn span V and that, moreover, they are linearly indepen-dent. Then we say that the set{v1, v2, . . . , vn} is a basisforV.

    Theorem: Let {v1, v2, . . . vn} be a basis forV, and let {w1, w2, . . . , wm} be another basisfor V. Then n = m.

    Proof: As{v1, v2, . . . vn} span V, we can write

    wi= ci1V1+...+cinVn

    for each i = 1,...,m. Note that

    0 = r1w1+r2w2+...+rmwm

    =

    mi=1 r

    i(ci1v1+ci2v2+...+cinvn

    =n

    i=1

    (r1c1i+r2c2i+...+rmcmi)vi

    is satisfied only ifr1c1i+r2c2i+...+rmcmi= 0 i= 1,...,n

    This yields n equations with m unknowns (r1,...,rm). Hence ifm > n there are multiplesolutions for r1,...,rm, contradicting the linear independence of w1,...,wm. Therefore wemust have m n, and by an analogous argument n m.

    We call the unique number of vectors in a basis for V the dimension of V (denoteddim(V)).

    Examples:

    1. S= {0} has dimension 0.2. Any set of vectors that includes 0 vector is linearly dependent (why?)

    3. IfVhas dimension n, and were given k < n linearly independent vectors in V, thenwe can extend this set of vectors to a basis.

    4. Letv1, v2, . . . , vn be a basis for V. Then ifv V,v = r1v1+ r2v2+ . . . + rnvn for somer1, r2, . . . , rn R. Moreover, these coefficients are unique, because if they werent,we could also write v = s1v1+ s2v2+ . . .+ snvn, and subtracting both sides we get0 =v v= (r1s1)v1 + (r2s2)v2 + . . . + (rn sn)vn, and since thevis form basis andare therefore linearly independent, we have ri =sii, and the coefficients are indeedunique.

    4

  • 8/11/2019 Bootcamp 2011 Algebra

    6/42

    5. v1 =

    10

    and v2=

    50

    both span x-axis, which is the subspace ofR2. Moreover,

    any one of these two vectors also spans x-axis by itself (thus a basis is not unique,though dimension is), and they are not linearly independent since 5v1+ 1v2 = 0

    6. e1 = 10

    0

    , e2 = 010

    , and e3 = 001

    form the standard basis for R3, since everyvector

    x1x2

    x3

    in R3 can be written as x1e1+ x2e2+ x3e3, so the three vectors span R3

    and their linear independence is easy to show. In general, Rn has dimensionn.

    7. Let dim(V) = n, and letv1, v2, . . . , vm V, s.t. m > n. Thenv1, v2, . . . , vmare linearlydependent.

    2.2 OrthogonalityAn inner product is a function f :V V R(which we denote by f(v1, v2) =< v1, v2 >),s.t. v , w, z V, andr R:

    1. < v,w +rz >=< v, w >+r < v, z >

    2. < v, w >=< w, v >

    3. < v, v > 0 and < v, v >= 0 iffv = 0We note here that not all vector spaces have inner products defined on them, but we will

    only be dealing with the ones that do.

    Examples:

    1. Given 2 vectorsx=

    x1x2...

    xn

    andy=

    y1y2...

    yn

    in Rn, we define their inner product xy

    =< x, y >=n

    i=1

    xiyi. You can check yourself that the 3 properties above are satisfied,

    and the meaning of notation xy will become clear from the next section.2. Givenf, g F[1,1], we define< f, g >=

    11 f(x)g(x)dx. Once again, verification that

    this is indeed an inner product is left as an exercise.

    We point out here the relationship in Rn between inner products and the length (or norm)of a vector. The length of a vector x =x =

    x21+x

    22+. . .+x

    2n =

    xx, orx2 =xx.

    5

  • 8/11/2019 Bootcamp 2011 Algebra

    7/42

    We say that vectors v, w in V are orthogonal if < v, w >= 0. Notice that the zerovector is the only vector orthogonal to itself (why?).

    Examples:

    1. In Rn the notion of orthogonality agrees with our usual perception of it. Ifx is orthog-onal toy , then Pythagorean theorem tells us that x2 + y2 = x y2. Expendingthis in terms of inner products we get:

    xx+yy= (x y)(x y) = xx yx xy+yy or 2xy= 0

    and thus < x, y >=xy = 0.

    2. Nonzero orthogonal vectors are linearly independent. Suppose we haveq1, q2, . . . , q n, aset of nonzero mutually orthogonal (< qi, qj >= 0i= j ) vectors in V, and supposethatr1q1+ r2q2+ . . . + rnqn= 0. Then taking inner product ofq1 with both sides, wehave r1 < q1, q1 >+r2 < q1, q2 >+ . . .+rn < q1qn >=< q1, 0>= 0. That reduces to

    r1q12 = 0 and since q1= 0, we conclude that r1 = 0. Similarly, ri = 0 1 i n,and we conclude that q1, q2, . . . , q n are linearly independent.

    3. Suppose we have an1 vector of observationsx=

    x1x2...

    xn

    . Then if we let x= 1n

    ni=1

    xi,

    we can see that vector e =

    x1 xx2 x

    ...

    xn x

    is orthogonal to vector x =

    xx...

    x

    , since

    ni=1

    x(xi x) = xn

    i=1

    xi xn

    i=1

    x= nx2 nx2 = 0.

    Suppose S, T are subspaces ofV. Then we say that they are orthogonal subspacesif every vector in S is orthogonal to every vector in T. We say that S is the orthogonalcomplement ofT inV, ifScontains ALL vectors orthogonal to vectors inTand we writeS = T. For example, the x-axis and y-axis are orthogonal subspaces ofR3, but they are

    not orthogonal complements of each other, since y-axis does not contain

    001

    , which is

    perpendicular to every vector in x-axis. However, y-z plane and x-axis ARE orthogonalcomplements of each other in R3. You should prove as an exercise that if dim(V) =n, anddim(S) =k, then dim(S) =n k.

    2.3 Gram-Schmidt Process

    Suppose were given linearly independent vectors v1, v2, . . . , vn in V, and theres an innerproduct defined on V. Then we know that v1, v2, . . . , vn form a basis for the subspace

    6

  • 8/11/2019 Bootcamp 2011 Algebra

    8/42

    which they span (why?). Then, the Gram-Schmidt process can be used to construct anorthogonal basis for this subspace, as follows:

    Let q1 = v1 Suppose v2 is not orthogonal to v1. then let rv1 be the projection ofv2 on v1, i.e. we want to find r R s.t. q2 = v2 rq1 is orthogonal to q1. Well, weshould have < q1, (v2

    rq1) >= 0, and we get r =

    . Notice that the span ofq1, q2

    is the same as the span of v1, v2, since all we did was to subtract multiples of originalvectors from other original vectors. Proceeding in similar fashion, we obtain qi = vi

    q1+. . .+

    qi1

    , and we thus end up with an orthogonal basis for

    the subspace. If we furthermore divide each of the resulting vectors q1, q2, . . . , q n by itslength, we are left with orthonormal basis, i.e. < qi, qj >= 0i=j and < qi, qi >= 1i(why?). We call these vectors that have length 1 unitvectors.

    You can now construct an orthonormal basis for the subspace of F[1,1] spanned byf(x) = 1, g(x) = x, and h(x) = x2 (Exercise 2.6 (b)). An important point to take away isthat given any basis for finite-dimensional V, if theres an inner product defined on V, wecan always turn the given basis into an orthonormal basis.

    Exercises

    2.1 Show that the space F0 of all differentiable functions f : R R with dfdx = 0 defines avector space.

    2.2 Verify for yourself that the two conditions for a subspace are independent of eachother, by coming up with 2 subsets ofR2: one that is closed under addition and subtractionbut NOT under scalar multiplication, and one that is closed under scalar multiplication butNOT under addition/subtraction.

    2.3 Strang, section 3.5 #17b Let V be the space of all vectors v = [c1 c2 c3 c4] R4 withcomponents adding to 0: c1+c2+c3+c4= 0. Find the dimension and give a basis for V.

    2.4 Letv1, v2,...,vnbe a linearly independent set of vectors inV. Prove that ifn= dim(V),v1, v2,...,vn form a basis for V.

    2.5 IfF[1,1] is the space of all continuous functions defined on the interval [1, 1], showthat < f,g >=

    11 f(x)g(x)dx defines an inner product ofF[1,1].

    2.6 Parts (a) and (b) concern the space F[1,1], with inner product < f, g >=

    1

    1 f(x)g(x)dx.

    (a) Show that f(x) = 1 and g(x) =x are orthogonal in F[1,1]

    (b) Construct an orthonormal basis for the subspace ofF[1,1]spanned byf(x) = 1, g(x) =x,and h(x) =x2.

    2.7 If a subspace S is contained in a subspace V, prove thatS containsV.

    7

  • 8/11/2019 Bootcamp 2011 Algebra

    9/42

    3 Matrices and Matrix Algebra

    Anm nmatrixA is a rectangular array of numbers that has m rows andn columns, andwe write:

    A=

    a11 a12 . . . a1na21 a22 . . . a2n...

    . . . . . . ...

    am1 am2 . . . amn

    For the time being well restrict ourselves to real matrices, so 1 i mand 1 j n,

    aij R. Notice that a familiar vector x=

    x1x2...

    xn

    Rn is just a n 1 matrix (we say x is

    acolumn vector. A 1

    nmatrix is referred to as a row vector. Ifm = n, we say that A

    is square.

    3.1 Matrix Operations

    Matrix addition

    Matrix addition is defined elementwise, i.e. A+B = C, where cij =aij+ bij . Note thatthis implies that A+ B is defined only ifA and B have the same dimensions. Also, notethat A +B = B +A.

    Scalar multiplication

    Scalar multiplication is also defined elementwise. IfrR, thenrA = B, wherebij =raij .

    Any matrix can be multiplied by a scalar. Multiplication by 0 results in zero matrix, andmultiplication by 1 leaves matrix unchanged, while multiplyingAby -1 results in matrix A,s.t. A+ (A) =A A= 0mn. You should check at this point that a set of all m nma-trices is a vector space with operations of addition and scalar multiplication as defined above.

    Matrix multiplication

    Matrix multiplication is trickier. Given a m n matrix A and a p q matrix B, ABis only defined if n = p. In that case we have AB = C, where cij =

    nk=1

    aikbkj , i.e. the

    i, j-th element ofAB is the inner product of thei-th row ofA and j-th column ofB, and the

    resulting product matrix ismq. You should at this point come up with your own examplesof A, B s.t both AB and BA are defined, but AB= BA. Thus matrix multiplication is,in general, non-commutative. Below we list some very useful ways to think about matrixmultiplication:

    1. SupposeA ismnmatrix, andx is an 1 column vector. Then if we leta1, a2, . . . , andenote the respective columns ofA, andx1, x2, . . . , xn denote the components ofx, weget am 1 vectorAx = x1a1+ x2a2+ . . . + xnan, a linear combination of the columns

    8

  • 8/11/2019 Bootcamp 2011 Algebra

    10/42

    ofA. Thus applying matrix A to a vector always returns a vector in the column spaceofA (see below for definition of column space).

    2. Now, letA be mn, and letx be a 1mrow vector. Leta1, a2, . . . , amdenote rows ofA, and x1, x2, . . . , xm denote the components ofx. Then multiplying A on the left by

    x, we obtain a 1 nrow vector xA = x1a1+ x2a2+ . . . + xmam, a linear combinationof the rows ofA. Thus multiplying matrix on the right by a row vector always returnsa vector in the row space ofA (see below for definition of row space)

    3. Now let A be m n, and let B be n k, and let a1, a2, . . . , an denote columns ofAandb1, b2, . . . , bk denote the columns ofB , and let cj denote the j -th column ofm kC = AB. Then cj = Abj = b1ja1 + b2ja2 + . . .+ bnjan, i.e. we get the columns ofproduct matrix by applying A to the columns ofB. Notice that it also implies thatevery column of product matrix is a linear combination of columns ofA.

    4. Once again, considerm n Aand n k B, and leta1, a2, . . . an denote rows ofA (theyare, of course, just 1

    nrow vectors). Then lettingci denote thei-th row ofC=AB,

    we havecj =ajB, i.e. we get the rows of the product matrix by applying rows ofA toB. Notice, that it means that every row ofCis a linear combination of rows ofB .

    5. Finally, letAbe mnandB benk. Then if we leta1, a2, . . . , andenote the columnsofA and b1, b2, . . . , bn denote the rows ofB, then AB =a1b1+a2b2+. . .+anbn, thesum ofn matrices, each of which is a product of a row and a column (check this foryourself!).

    Let A be m n, then the transpose ofA is the n m matrix A, s.t. aij =aji. Now thenotation we used to define the inner product on Rn makes sense, since given twon1 columnvectors x and y , their inner product < x, y >is just xy according to matrix multiplication.

    Let Inn, denote the n n identity matrix, i.e. the matrix that has 1s down its maindiagonal and 0s everywhere else (in future we might omit the dimensional subscript and justwriteI, the dimension should always be clear from the context). You should check that inthat case,InnA= AInn= A for every n n A. We say that n n A, hasn ninverse,denotedA1, ifAA1 = A1A= Inn. IfA has inverse, we say that A is invertible. Notevery matrix has inverse, as you can easily see by considering the n nzero matrix. We willassume that you are familiar with the use of elimination to calculate inverses of invertiblematrices and will not present this material. The following are some important results aboutinverses and transposes:

    1. (AB) = B A

    Proof: Can be shown directly through entry-by-entry comparison of (AB) andB A.

    2. IfA is invertible and B is invertible, then AB is invertible, and (AB)1 =B1A1.

    Proof: Exercise 3.1(a).

    3. IfA is invertible, then (A1)= (A)1

    Proof: Exercise 3.1(b).

    9

  • 8/11/2019 Bootcamp 2011 Algebra

    11/42

    4. A is invertible iffAx = 0 = x = 0 (we say that N(A) ={0}, where N(A) is thenullspace ofA, to be defined shortly).

    Proof: Assume A1 exists. Then,

    Ax= 0

    A1(Ax) =A10 x= 0.

    Now, assume Ax = 0 implies x = 0. Then the columns a1,...,an of A are linearlyindependent and therefore form a basis for Rn (Exercise 2.4). So, if

    e1=

    10...0

    ,e2 =

    01...0

    , ...,en=

    00...1

    , we can write

    c1ia1+c2ia2+...+cnian= A

    c1,ic2,i

    ...cn,i

    =ei

    for alli = 1,...,n. Hence, ifCis given by Cij =cij , then

    AC= [e1 e2 ...en] =In.

    So,C=A1 and A is invertible.

    3.2 Special Matrices

    A square matrix A is said to be symmetric if A = A. If A is symmetric, then A1 isalso symmetric (Exercise 3.2). A square matrixA is said to be orthogonal ifA = A1.You should prove that columns of an orthogonal matrix are orthonormal, and so are therows. Conversely, any square matrix with orthonormal columns is orthogonal. We note thatorthogonal matrices preserve lengths and inner products:

    < Qx, Qy >=xQQy= xInny= xy.

    In particularQx =xQQx =x. Also, ifA, and B are orthogonal, then so are A1

    and AB . We say that a square matrix A is idempotentifA2 =A.We say that a square matrix A is positive definite if A is symmetric and if n 1

    vectors x= 0n1, we have xAx > 0. We say that A is positive semi-definite (or non-negative definite ifA is symmetric andn 1 vectors x = 0n1, we have xAx 0. Youshould prove for yourself that every positive definite matrix is invertible (Exercise 3.3)). Onecan also show that ifA is positive definite, then so is A (more generally, ifA is positivesemi-definite, then so is A).

    10

  • 8/11/2019 Bootcamp 2011 Algebra

    12/42

    We say that a square matrix A is diagonal ifaij = 0 i =j . We say that A is uppertriangularifaij = 0i > j . Lower triangular matrices are defined similarly.We also introduce another concept here: for a square matrix A, its trace is defined to be

    the sum of the entries on main diagonal(tr(A) =n

    i=1 aii). For example, tr(Inn) = n. Youmay prove for yourself (by method of entry-by-entry comparison) that tr(AB) = tr(BA),and tr(ABC) =tr(CAB). Its also immediately obvious that tr(A +B) =tr(A) +tr(B).

    3.3 Fundamental Spaces

    Let A be m n. We will denote by col(A) the subspace ofRm that is spanned by columnsofA, and well call this subspace the column space of A. Similarly, we define the rowspaceofA to be the subspace ofRn spanned by rows ofA and we notice that it is preciselycol(A).

    Now, let N(A) = {x Rn :Ax = 0}. You should check for yourself that this set, whichwe call kernel or nullspace ofA, is indeed subspace of

    Rn

    . Similary, we define the leftnullspaceofA to be{x Rm :xA= 0}, and we notice that this is precisely N(A).The fundamental theorem of linear algebra states:

    1. dim(col(A)) =r = dim(col(A)). Dimension of column space is the same as dimensionof row space. This dimension is called the rankofA.

    2. col(A) = (N(A)) and N(A) = (col(A)). The columns space is the orthogonalcomplement of the left nullspace in Rm, and the nullspace is the orthogonal complementof the row space in Rn. We also conclude that dim(N(A)) = n r, and dim(N(A))=m r.

    We will not present the proof of the theorem here, but we hope you are familiar with theseresults. If not, you should consider taking a course in linear algebra (math 383).

    We can see from the theorem, that the columns of A are linearly independent iff thenullspace doesnt contain any vector other than zero. Similarly, rows are linearly independentiff the left nullspace doesnt contain any vector other than zero.

    We now make some remarks about solving equations of the form Ax= b, where A is am nmatrix,x is n 1 vector, andb is m 1 vector, and we are trying to solve forx. Firstof all, it should be clear at this point that ifb / col(A), then the solution doesnt exist. Ifb col(A), but the columns ofA are not linearly independent, then the solution will notbe unique. Thats because there will be many ways to combine columns ofA to produce

    b, resulting in many possible xs. Another way to see this is to notice that if the columnsare dependent, the nullspace contains some non-trivial vector x, and ifx is some solutionto Ax = b, then x+x is also a solution. Finally we notice that ifr = m > n (i.e. if therows are linearly independent), then the columns MUST span the whole Rm, and thereforea solution exists for every b (though it may not be unique).

    We conclude then, that ifr = m, the solution to Ax = b always exists, and if r = n,the solution (if it exists) is unique. This leads us to conclude that if n = r = m (i.e. Ais full-rank square matrix), the solution always exists and is unique. The proof based on

    11

  • 8/11/2019 Bootcamp 2011 Algebra

    13/42

    elimination techniques (which you should be familiar with) then establishes that a squarematrixA is full-rank iff it is invertible.

    We now give the following results:

    1. rank(AA) = rank(A). In particular, if rank(A) = n (columns are linearly independent),

    then AA is invertible. Similarly, rank(AA) = rank(A), and if the rows are linearlyindependent,AA is invertible.

    Proof: Exercise 3.5.

    2. N(AB) N(B)Proof: Let x N(B). Then,

    (AB)x= A(Bx) =A0 = 0,

    sox N(AB).

    3. col(AB) col(A), the column space of product is subspace of column space ofA.Proof: Note that

    col(AB) =N((AB)) = N(BA) N(A) = col(A).

    4. col((AB)) col(B), the row space of product is subspace of row space ofB .Proof : Similar to (3).

    Exercises

    3.1 Prove the following results:

    (a) IfA is invertible and B is invertible, then AB is invertible, and (AB)1 =B1A1

    (b) IfA is invertible, then (A1) = (A)1

    3.2 Show that ifA is symmetric, then A1 is also symmetric.

    3.3 Show that any positive definite matrixA is invertible (think about nullspaces).

    3.4 Horn & Johnson 1.2.2 For A : n

    nand invertible S :n

    n, show that tr(S1AS) =

    tr(A). The matrixS1AS is called a similarity of A.

    3.5 Show that rank(AA) = rank(A). In particular, if rank(A) = n (columns are linearlyindependent), then AA is invertible. Similarly, show that rank(AA) = rank(A), and if therows are linearly independent, AA is invertible. (Hint: show that the nullspaces of the twomatrices are the same).

    12

  • 8/11/2019 Bootcamp 2011 Algebra

    14/42

    4 Least Squares Estimation

    4.1 Projections

    Suppose we have n linearly independent vectors a1, a2, . . . , an in Rm, and we want to find

    the projection of a vector b inR

    m

    onto the space spanned by a1, a2, . . . , an, i.e. to find somelinear combination x1a1+x2a2+. . .+xnan = b, s.t.b = b + b b. Its clear that

    ifb is already in the span ofa1, a2, . . . , an, thenb= b (vector just projects to itself), and if

    bis perpendicular to the space spanned bya1, a2, . . . , an, thenb= 0 (vector projects to the

    zero vector).We can now re-write the above situation in matrix terms. Let ai be now the i-th column

    of the m n matrix A. Then we want to find x Rn s.t. (b Ax) col(A), or in otherwords A(bAx) = 0n1. We now have Ab = AAx or x = (AA)1Ab (why is AAinvertible?). Then for every vector b in Rm, its projection onto the column space of A isAx = A(AA)1Ab. We call the matrix P = A(AA)1A that takes a vector in Rm andreturns its projection ontocol(A) theprojectionmatrix. We follow up with some properties

    of projection matrices:

    1. P is symmetric and idempotent (what should happen to a vector if you project it andthen project it again?).

    Proof: Exercise 4.1(a).

    2. I P is the projection onto orthogonal complement ofcol(A) (i.e. the left nullspaceofA)

    Proof: Exercise 4.1(b).

    3. Given any vector bR

    m and any subspace S ofRm, b can be written (uniquely) asthe sum of its projections onto Sand S

    Proof: Assumedim(S) = q, so dim(S) = m q. Let AS = [a1 a2 ...aq] and AS =[aq+1 ...am] be such that a1,...,aq form a basis for S and aq+1,...,am form a basis forS. By 3, ifPSis the projection ontocol(AS) andPS is the projection ontocol(AS),b Rm

    PS(b) +PS(b) =PS(b) + (I PS)b= b.

    As columns ofAS and AS are linearly independent, the vectors a1, a2,...,am form abasis ofRm. Hence,

    b= PS(b) +PS(b) =c1a1+...+cqaq+cq+1aq+1+...+cmam

    for unique c1,...,cm.

    4. P(I P) = (I P)P= 0 (what should happen to a vector when its first projectedtoSand thenS?)

    Proof: Exercise 4.1(c).

    13

  • 8/11/2019 Bootcamp 2011 Algebra

    15/42

  • 8/11/2019 Bootcamp 2011 Algebra

    16/42

    to find out how to carry out such a differentiation) and set it to zero (since if minimizesthe expression, the derivative at should be 0) to get:XY XY+ 2XX= 0, or onceagain = (XX)1XY. The projected values Y = X(XX)1XY are known as fittedvalues, and the portion e = Y Y ofY(which is orthogonal to the column space ofX) isknown as residuals.

    Finally, suppose theres a columnxjin Xthat is perpendicular to all other columns. Thenbecause of the results on the separation of projections (xj is the orthogonal complement incol(X) of the space spanned by the rest of the columns), we can project b onto the linespanned byxj, then project b onto the space spanned by rest of the columns ofXand addthe two projections together to get the overall projected value. What that means is thatif we throw away the column xj , the values of the coefficients in corresponding to othercolumns will not change. Thus inserting or deleting from Xcolumns orthogonal to the restof the column space has no effect on estimated coefficients in corresponding to the rest ofthe columns.

    Exercises

    4.1 Prove the following properties of projection matrices:

    (a) P is symmetric and idempotent.

    (b) I Pis the projection onto orthogonal complement ofcol(A) (i.e. the left nullspace ofA)

    (c) P(I P) = (I P)P = 0(d) col(P) =col(A)

    (e) If A is a matrix of rank A andP is the projection oncol(A),tr(P) =r.

    15

  • 8/11/2019 Bootcamp 2011 Algebra

    17/42

    5 Differentiation

    5.1 Basics

    Here we just list the results on taking derivatives of expressions with respect to a vector

    of variables (as opposed to a single variable). We start out by defining what that actually

    means: Letx =

    x1x2...

    xk

    be a vector of variables, and let fbe some real-valued function ofx

    (for examplef(x) =sin(x2)+x4or f(x) =x1x7 +x11log(x3)). Then we define

    fx =

    fx1fx2

    ...fxk

    .

    Below are the extensions

    1. Let a Rk, and let y= ax= a1x1+a2x2+. . .+akxk. Then yx =aProof : Follows immediately from definition.

    2. Let y = xx, then yx = 2x

    Proof: Exercise 5.1(a).

    3. Let A be k k, and a be k 1, andy = aAx. Then yx =AaProof: Note that a A is 1 k. Writingy= aAx= (Aa)xits then clear from 1 thatyx

    =Aa.

    4. Let y = xAx, then yx

    = Ax+ Ax and if A is symmetric yx

    = 2Ax. We call the

    expressionxAx=k

    i=1

    kj=1

    aijxixj , a quadratic formwith corresponding matrix A.

    Proof: Exercise 5.1(b).

    5.2 Jacobian and Chain Rule

    A function f : Rn Rm is said to be differentiable at x if there exists a linear functionL: Rn

    R

    m such that

    limxx,x=x

    f(x) f(x) L(x x)x x = 0.

    It is not hard to see that such a linear function L, if any, is uniquely defined by the aboveequation. It is called the differential off at x. Moreover, iff is differentiable at x, then all

    16

  • 8/11/2019 Bootcamp 2011 Algebra

    18/42

    of its partial derivatives exist, and we write the Jacobian matrix off at x by arranging itspartial derivatives into a m nmatrix,

    Df(x) =

    f1x1

    (x) f1xn (x)...

    ...fmx1 (x)

    fmxn (x)

    .

    It is not hard to see that the differential L is exactly represented by the Jacobian matrixDf(x). Hence,

    limxx,x=x

    f(x) f(x) Df(x)(x x)x x = 0

    whenever f is differentiable at x.In particular, iff is of the form f(x) =M x+b, then Df(x) M.

    Now consider the case where f is a function from Rn to R. The Jacobian matrix Df(x)is a n-dimensional row vector, whose transpose is the gradient. That is, Df(x) =f(x)T.Moreover, iffis twice differentiable and we define g(x) =f(x), then Jacobian matrix ofg is the Hessian matrix off. That is,

    Dg(x) = 2f(x).

    Suppose that f : Rn Rm and h : Rk Rn are two differentiable functions. Thechain rule of differentiability says that the function g defined by g(x) = f(h(x)) is alsodifferentiable, with

    Dg(x) =Df(h(x))Dh(x).

    For the casek = m = 1, whereh is from Rto Rn andfis from Rn to R, the equation abovebecomes

    g(x) =Df(h(x))Dh(x) = f(h(x)), Dh(x) =n

    i=1

    if(h(x))hi(x)

    where if(h(x)) is the ith partial derivative off at h(x) and hi(x) is the derivative of the

    ith component ofh at x.

    Finally, suppose that f : Rn Rm and h : Rn Rm are two differentiable functions,then the function g defined byg(x) = f(x), h(x) is also differentiable, with

    Dg(x) =f(x)TDh(x) +h(x)TDf(x).

    Taking transposes on both sides, we get

    g(x) =Dh(x)Tf(x) +Df(x)Th(x).

    Example 1. Let f : Rn R be a differentiable function. Let x Rn and d Rn befixed. Define a functiong : R R byg(t) = f(x+ td). If we writeh(t) = x+ td, theng(t) =f(h(t)). We have

    g(t) = f(x+td), Dh(t) = f(x+td), d.

    17

  • 8/11/2019 Bootcamp 2011 Algebra

    19/42

    In particular,g(0) = f(x), d.

    Suppose in addition that f is twice differentiable. WriteF(x) =f(x). Theng(t) =d, F(x+td) = d, F(h(t)) =dTF(h(t)). We have

    g(t) =dTDF(h(t))Dh(t) =dT2f(h(t))d= d, 2f(x+td)d.In particular,

    g(0) = d, 2f(x)d.

    Example 2. Let Mbe an n n matrix and let b Rn, and define a function f : Rn Rbyf(x) =xTMx+bTx. Because f(x) = x,Mx, we have

    f(x) =MTx+Mx+b= (MT +M)x+b,

    and2f(x) =MT +M.

    In particular, ifMis symmetric thenf(x) = 2Mx+b and2f(x) = 2M.

    Exercises

    5.1 Prove the following properties of vector derivatives:

    (a) Let y = xx, then yx = 2x

    (b) Let y = xAx, then yx

    =Ax+Ax and ifA is symmetric yx

    = 2Ax.

    5.2 The inverse function theorem states that for a function f : Rn Rn, the inverseof the Jacobian matrix for f is the Jacobian off1:

    (Df)1 =D(f1).

    Now consider the function f : R2 R2 that maps from polar (r, ) to cartesian coordinates(x, y):

    f(r, ) = rcos()r sin() =

    xy .

    FindDf, then invert the two-by-two matrix to find rx , ry ,

    x , and

    y .

    18

  • 8/11/2019 Bootcamp 2011 Algebra

    20/42

    6 Matrix Decompositions

    We will assume that you are familiar with LU and QR matrix decompositions. If you arenot, you should look them up, they are easy to master. We will in this section restrictourselves to eigenvalue-preserving decompositions.

    6.1 Determinants

    We will assume that you are familiar with the idea of determinants, and specifically calculat-ing determinants by the method of cofactor expansion along a row or a column of a squarematrix. Below we list the properties of determinants of real square matrices. The first 3properties are defining, and the rest are established from those 3.

    1. det(A) depends linearly on the first row.

    det

    a11+a11 a12+a

    12 . . . a1n+a

    1n

    a21 a22 . . . a2n

    ... ... . . . ...an1 an2 . . . ann

    =

    det

    a11 a12 . . . a1na21 a22 . . . a2n...

    ... . . .

    ...an1 an2 . . . ann

    + det

    a11 a12 . . . a

    1n

    a21 a22 . . . a2n...

    ... . . .

    ...an1 an2 . . . ann

    .

    det

    ra11 ra12 . . . ra1na21 a22 . . . a2n..

    .

    ..

    .

    . . . ..

    .an1 an2 . . . ann

    = r det

    a11 a12 . . . a1na21 a22 . . . a2n..

    .

    ..

    .

    . . . ..

    .an1 an2 . . . ann

    2. Determinant changes sign when two rows are exchanged. This also implies that the

    determinant depends linearly on EVERY row, since we can exhange rowiwith row 1,split the determinant, and exchange the rows back, restoring the original sign.

    3. det(I) = 1

    4. If two rows ofAare equal, det(A) = 0 (why?)

    5. Subtracting a multiple of one row from another leaves determinant unchanged.

    Proof: Suppose instead of rowiwe now have rowirj. Then splitting the determinantof the new matrix along this row we have det(original) + det(original matrix with rowrj in place of rowi. That last determinant is justrtimes determinant of original matrxiwith rowj in place of row i, and since the matrix has two equal rows, the determinantis 0. So the determinant of the new matrix has to be equal to the determinant of theoriginal.

    6. If a matrix has a zero row, its determinant is 0. (why?)

    19

  • 8/11/2019 Bootcamp 2011 Algebra

    21/42

    7. If a matrix is triangular, its determinant is the product of entries on main diagonal

    Proof: Exercise 6.1.

    8. det(A) = 0 iffA is not invertible (proof involves ideas of elimination)

    9. det(AB) = det(A)det(B). In particular det(A1

    ) =

    1

    det(A) .Proof: Suppose det(B) = 0. ThenB is not invertible, andAB is not invertible (recall

    (AB)1 =B1A1, therefore det(AB) = 0. If det(B) = 0, let d(A) = det(AB)det(B)

    . Then,

    (1) For [a11 a12 ...a

    1n] Rn letA be A with the first row replaced with [a11 a12 ...a1n].

    Then,

    d

    a11+a

    11 a12+a

    12 . . . a1n+a

    1n

    a21 a22 . . . a2n...

    ... . . .

    ...an1 an2 . . . ann

    =det

    a11+a

    11 . . . a1n+a

    1n

    a21 . . . a2n...

    . . . ...

    an1 . . . ann

    B

    det(B)

    =

    det

    ab11+ab11 . . . ab1n+ab1n

    ab21 . . . ab2n...

    . . . ...

    abn1 . . . ann

    det(B)

    =

    det(AB) +det(AB)det(B) =d(A) +d(A).

    For r R,

    d

    ra11 ra12 . . . ra1na21 a22 . . . a2n...

    ... . . .

    ...an1 an2 . . . ann

    =

    det

    ra11 ra12 . . . ra1na21 a22 . . . a2n...

    ... . . .

    ...an1 an2 . . . ann

    B

    det(B)

    =

    det

    rab11 rab12 . . . rab1nab21 ab22 . . . ab2n...

    ... . . .

    ...abn1 abn2 . . . ann

    det(B)

    = rdet(AB)

    det(B) =rd(A).

    20

  • 8/11/2019 Bootcamp 2011 Algebra

    22/42

    (2) WLOG assume rows 1 and 2 of A are interchanged, then

    d

    a21 a22 . . . a2na11 a12 . . . a1n...

    ... . . .

    ...an1 an2 . . . ann

    = det

    a21 a22 . . . a2na11 a12 . . . a1n...

    ...

    .. .

    ...

    an1 an2 . . . ann

    B

    det(B)

    =

    det

    ab21 ab22 . . . ab2nab11 ab12 . . . ab1n...

    ... . . .

    ...abn1 abn2 . . . ann

    det(B)

    = det(AB)det(B)

    = d(A)

    (3) d(I) =det(IB)/det(B) =det(B)/det(B) = 1.

    So conditions 1-3 are satisfied and therefore d(A) =det(A).

    10. det(A) = det(A). This is true since expanding along the row of A is the same asexpanding along the corresponding column ofA.

    6.2 Eigenvalues and Eigenvectors

    Given a square n n matrix A, we say that is an eigenvalue ofA, if for some non-zerox Rn we have Ax= x. We then say that x is an eigenvector ofA, with correspondingeigenvalue . For small n, we find eigenvalues by noticing that

    Ax= x (A I)x= 0 A I

    is not invertible det(A I) = 0. We then write out the formula for the determinant(which will be a polynomial of degree n in ) and solve it. Every n n A then has neigenvalues (possibly repeated and/or complex), since every polynomial of degree n has nroots. Eigenvectors for a specific value ofare found by calculating the basis for nullspace

    ofA Ivia standard elimination techniques. Ifn5, theres a theorem in algebra thatstates that no formulaic expression for the roots of the polynomial of degree n exists, soother techniques are used, which we will not be covering. Also, you should be able to seethat the eigenvalues ofA and A are the same (why? Do the eigenvectors have to be thesame?), and that ifx is an eigenvector ofA (Ax = x), then so is every multiple rx ofx,with same eigenvalue (Arx = rx). In particular, a unit vector in the direction ofx is aneigenvector.

    21

  • 8/11/2019 Bootcamp 2011 Algebra

    23/42

    Theorem: Eigenvectors corresponding to distinct eigenvalues are linearly independent.

    Proof: Suppose that there are only two distinct eigenvalues (A could be 2 2 or itcould have repeated eigenvalues), and letr1x1 + r2x2 = 0. Applying A to both sides we haver1Ax1+ r2Ax2 = A0 = 0 =

    1r1x1+ 2r2x2 = 0. Multiplying first equation by 1 and

    subtracting it from the second, we get 1r1x1+ 2r2x2 (1r1x1+ 1r2x2) = 0 0 = 0 =r2(2 1)x2= 0 and since x1= 0, and1=2, we conclude that r2= 0. Similarly, r1= 0as well, and we conclude thatx1 andx2 are in fact linearly independent. The proof extendsto more than 2 eigenvalues by induction.

    We say that n n A is diagonalizable if it has n linearly independent eigenvectors.Certainly, every matrix that has n DISTINCT eigenvalues is diagonalizable (by the proofabove), but some matrices that fail to havendistinct eigenvalues may still be diagonalizable,as well see in a moment. The reasoning behind the term is as follows: Lets1, s2, . . . , sn Rnbe the set of linearly independent eigenvectors of A, let 1, 2, . . . , n be correspondingeigenvalues (note that they need not be distinct), and let Sbe n

    nmatrix the j-th column

    of which is sj. Then if we let be n n diagonal matrix s.t. theii-th entry on the maindiagonal is i, then from familiar rules of matrix multiplication we can see that AS=S,and since S is invertible (why?) we have S1AS = (Exercise 6.2). Now suppose thatwe have n n A and for some S, we have S1AS= , a diagonal matrix. Then you caneasily see for yourself that the columns ofS are eigenvectors ofA and diagonal entries of are corresponding eigenvalues. So the matrices that can be made into a diagonal matrixby pre-multiplying by S1 and post-multiplying by S for some invertible S are preciselythose that haven linearly independent eigenvectors (which are, of course, the columns ofS).Clearly, I is diagonalizable (S1IS = I) invertible S, but Ionly has a single eigenvalue1. So we have an example of a matrix that has a repeated eigenvalue but nonetheless has n

    independent eigenvectors.If A is diagonalizable, calculation of powers of A becomes very easy, since we can see

    that Ak = SkS1, and taking powers of a diagonal matrix is about as easy as it can get.This is often a very helpful identity when solving recurrent relationships.

    ExampleA classical example is the Fibonacci sequence 1, 1, 2, 3, 5, 8, . . . , where eachterm (starting with 3rd one) is the sum of the preceding two: Fn+2 =Fn+Fn+1. We wantto find an explicit formula for n-th Fibonacci number, so we start by writing

    Fn+1Fn

    = 1 11 0

    FnFn1

    or un= Aun1, which becomes un= Anu0, where A =

    1 11 0

    , andu0 =

    10

    . Diagonal-

    izingA we find that S=

    1+

    5

    215

    2

    1 1

    and =

    1+

    5

    2 0

    0 15

    2

    , and identifyingFn with

    the second component ofun= Anu0= S

    nS1u0, we obtainFn= 15

    1+5

    2

    n

    152

    n

    22

  • 8/11/2019 Bootcamp 2011 Algebra

    24/42

    We finally note that theres no relationship between being diagonalizable and being invert-

    ible.

    1 00 1

    is both invertible and diagonalizable,

    0 00 0

    is diagonalizable (its already

    diagonal) but not invertible,

    3 10 3

    is invertible but not diagonalizable (check this!), and

    0 10 0

    is neither invertible nor diagonalizable (check this too).

    6.3 Complex Matrices and Basic Results

    We now allow complex entries in vectors and matrices. Scalar multiplicaiton now also allowsmultiplication by complex numbers, so were going to be dealing with vectors in Cn, and youshould check for yourself that dim(Cn) = dim(Rn) = n (Is Rn a subspace ofCn?) We alsonote that we need to tweak a bit the earlier definition of transpose to account for the fact

    that ifx = 1i C

    2, then

    xx= 1 +i2 = 0 = 1 = x2.We note that in the complex case x2 = (x)x, where xis the complex conjugate ofx, and weintroduce the notation xH to denote the transpose-conjugate x (thus we have xHx= x2).You can easily see for yourself that ifx Rn, then xH = x. AH = (A) for n n matrixA is defined similarly and we call AH Hermitian transpose ofA. You should check that(AH)H = A and that (AB)H = BHAH (you might want to use the fact that for complexnumbers x, y C, x+y = x+ y and xy = xy). We say thatx and y in Cn are orthogonalifxHy = 0 (note that this implies that yHx = 0, although it is NOT true in general thatxHy= yHx).

    We say that n nmatrixA isHermitianifA = AH. We say that n n Ais unitaryifAHA = AAH = I(AH = A1). You should check for yourself that every symmetric realmatrix is Hermitian, and every orthogonal real matrix is unitary. We say that a square matrixA is normalif it commutes with its Hermitian transpose: AHA= AAH. You should checkfor yourself that Hermitian (and therefore symmetric) and unitary (and therefore orthogonal)matrices are normal. We next present some very important results about Hermitian andunitary matrices (which also include as special cases symmetric and orthogonal matricesrespectively):

    1. IfA is Hermitian, thenx Cn, y = xHAx R.Proof: taking the hermitian transpose we have yH = xHAHx = xHAx = y, and the

    only scalars in Cthat are equal to their own conjugates are the reals.

    2. IfAis Hermitian, andis an eigenvalue ofA, then R. In particular, all eigenvaluesof a symmetric real matrix are real (and so are the eigenvectors, since they are foundby elimination on A I, a real matrix).Proof : suppose Ax= x for some nonzero x, then pre-multiplying both sides by xH,we get xHAx= xHx= xHx= x2, and since the left-hand side is real, andx2is real and positive, we conclude that R.

    23

  • 8/11/2019 Bootcamp 2011 Algebra

    25/42

    3. IfA is positive definite, and is an eigenvalue ofA, then >0 (note that since A issymmetric, we know that R).Proof : Let nonzero x be the eigenvector corresponding to . Then since Ais positivedefinite, we have xAx >0 = x(x)> 0 = x2 >0 = > 0.

    4. IfA is Hermitian, and x, y are the eigenvectors ofA, corresponding to different eigen-values (Ax= 1x,Ay= 2y), then x

    Hy= 0.

    Proof: 1xHy = (1x)

    Hy (since 1 is real) = (Ax)Hy = xH(AHy) = xH(Ay) =

    xH(2y) = 2xHy, and get (1 2)xHy = 0. Since 1 = 2, we conclude that

    xHy= 0.

    5. The above result means that if a real symmetric n nmatrixA has n distinct eigen-values, then the eigenvectors ofA are mutally orthogonal, and if we restrict ourselvesto unit eigenvectors, we can decompose A as QQ1, where Q is orthogonal (why?),and therefore A = QQ. We will later present the result that shows that it is true ofEVERY symmetric matrixA(whether or not it has n distinct eigenvalues).

    6. Unitary matrices preserve inner products and lengths.

    Proof: Let Ube unitary. Then (U x)H(Uy) =xHUHU y=xHIy=xHy. In particularU x = x.

    7. Let Ube unitary, and let be an eigenvalue ofU. Then|| = 1 (Note that couldbe complex, for example i, or 1+i

    2).

    Proof: SupposeU x = x for some nonzero x. Thenx =U x =x =||x,and sincex >0, we have|| = 1.

    8. Let Ube unitary, and let x, y be eigenvectors ofU, corresponding to different eigen-values (U x= 1x,Uy= 2y). ThenxHy= 0.

    Proof: xHy= xHIy= xHUHU y= (U x)H(U y) = (1x)H(2y) =

    H1 2x

    Hy= 12xHy

    (since 1 is a scalar). Suppose now that xHy= 0, then 12 = 1. But|1| = 1 =

    11 = 1, and we conclude that 1= 2, a contradiction. Therefore, xHy= 0.

    9. For EVERY square matrix A, some unitary matrix U s.t. U1AU = UHAU = T,whereT is upper triangular. We will not prove this result, but the proof can be found,for example, in section 5.6 of G.Strangs Linear Algebra and Its Applications (3rded.) This is a very important result which were going to use in just a moment toprove the so-called Spectral Theorem.

    10. IfA is normal, and U is unitary, then B =U1AU is normal.

    Proof: BBH = (UHAU)(UHAU)H = UHAUUHAHU = UHAAHU = UHAHAU(sinceAis normal) =UHAHUUHAU= (UHAU)H(UHAU) =BHB.

    11. Ifn n A is normal, thenx Cn we haveAx = AHx.Proof:Ax2 = (Ax)HAx= xHAHAx= xHAAHx= (AHx)H(AHx) = AHx2. AndsinceAx 0 AHx, we haveAx = AHx.

    24

  • 8/11/2019 Bootcamp 2011 Algebra

    26/42

    12. IfA is normal and A is upper triangular, thenA is diagonal.

    Proof: Consider the first row of A. In the preceding result, let x =

    10...

    0

    . Then

    Ax2 = |a11|2(since the only non-zero entry in first column ofA is a11) and AHx2 =|a11|2 + |a12|2 +. . .+ |a1n|2. It follows immediately from the preceding result thata12 = a13 = . . .= a1n= 0, and the only non-zero entry in the first row ofA is a11. Youcan easily supply the proof that the only non-zero entry in the i-th row ofA is aii andwe conclude that A is diagonal.

    13. We have just succeded in proving the Spectral Theorem: If A is n n symmetricmatrix, then we can write it as A = QQ. We know that ifA is symmetric, then itsnormal, and we know that we can find some unitary U s.t. U1AU = T, where T isupper triangular. But we know thatT is also normal, and being upper triangular, it

    is then diagonal. So A is diagonalizable and by discussion above, the entries ofT = are eigenvalues ofA (and therefore real) and the columns ofU are corresponding uniteigenvectors ofA (and therefore real), so Uis a real orthogonal matrix.

    14. More generally, we have shown that every normal matrix is diagonalizable.

    15. IfA is positive definite, it has a square root B , s.t. B2 =A.

    Proof: We know that we can write A = QQ, where all diagonal entries of arepositive. LetB = Q1/2Q, where 1/2 is the diagonal matrix that has square roots ofmain diagonal elements of along its main diagonal, and calculate B2 (more generallyifA is positive semi-definite, it has a square root). You should now prove for yourself

    thatA1 is also positive definite and therefore A1/2 also exists.

    16. IfA is symmetric and idempotent, and is an eigenvalue ofA, then= 1 or = 0.

    Proof: Exercise 6.4.

    There is another way to think about the result of the Spectral theorem. Let x Rn andconsider Ax= QQx. Then (do it as an exercise!) carrying out the matrix multiplicationon QQ and letting q1, q2, . . . , q n denote the columns of Q and 1, 2, . . . , n denote thediagonal entries of , we have: QQ = 1q1q1 + 2q2q

    2 + . . . + nqnq

    nand soAx = 1q1q

    1x +

    2q2q2x+. . .+nqnq

    nx. We recognizeqiq

    i as the projection matrix onto the line spanned

    by qi, and thus every n

    n symmetric matrix is the sum ofn 1-dimensional projections.

    That should come as no surprise: we have orthonormal basis q1, q2, . . . q n for Rn, thereforewe can write every x Rn as a unique combination c1q1+ c2q2+ . . .+cnqn, where c1q1 isprecisely the projection ofx onto line through q1. Then applying A to the expression wehave Ax = 1c1q1+ 2c2q2+ . . . + ncnqn, which of course is just the same thing as we haveabove.

    25

  • 8/11/2019 Bootcamp 2011 Algebra

    27/42

    6.4 SVD and Pseudo-inverse

    Theorem: Every m n matrix A can be written as A = Q1Q2, where Q1 is m morthogonal, is m n pseudo-diagonal (meaning that that the first r diagonal entries iiare non-zero and the rest of the matrix entries are zero, where r = rank(A)), andQ2is n northogonal. Moreover, the first r columns ofQ

    1 form an orthonormal basis for col(A), the

    last m r columns ofQ1 form an orthonormal basis for N(A), the first r columns ofQ2form an orthonormal basis for col(A), last n r columns ofQ2 form an orthonormal basisforN(A), and the non-zero entries of are the square roots of non-zero eigenvalues of bothAA and AA. (It is a good exersise at this point for you to prove that AA and AA do infact have same eigenvalues. What is the relationship between eigenvectors?). This is knownas the Singular Value Decomposition or SVD.

    Proof: AAis n nsymmetric and therefore has a set ofn real orthonormal eigenvectors.Since rank(AA) = rank(A) = r, we can see that AA has r non-zero (possibly-repeated)eigenvalues (Exercise 6.3). Arrange the eigenvectors x1, x2, . . . , xn in such a way that the

    firstx1, x2, . . . , xr correspond to non-zero 1, 2, . . . , r and put x1, x2, . . . , xn as columns ofQ2. Note that asxr+1, xr+2, . . . , xnform a basis forN(A) by Exercise 2.4 as they are linearlyindependent,dim(N(A)) = n r and

    xi N(A) f or i= r, ..., n.

    Thereforex1, x2, . . . , xr form a basis for the row space ofA. Now setii =

    ifor 1 i r,and let the rest of the entries of mn be 0. Finally, for 1 i r, let qi = Axiii .You should verify for yourself that qis are orthonormal (q

    iqj = 0 if i= j, and qiqi = 1).

    By Gram-Schmidt, we can extend the set q1, q2, . . . , q r to a complete orthonormal basisfor Rm, q1, q2, . . . , q r, qr+1, . . . , q n. As q1, q2, . . . , q r are each in the column space of A and

    linearly independent, they form an orthonormal basis for column space of A and there-fore qr+1, qr+2, . . . , q n form an orthonormal basis for the left nullspace of A. We now ver-ify that A = Q1Q

    2 by checking that Q

    1AQ2 = . Consider ij-th entry of Q

    1AQ2. It

    is equal to qiAxj. For j > r, Axj = 0 (why?), and for j r the expresison becomesqijj qj =jj q

    iqj = 0(ifi =j) or 1 (ifi= j). And thereforeQ1AQ2= , as claimed.

    One important application of this decomposition is in estimatingin the system we hadbefore when the columns ofXare linearly dependent. ThenXXis not invertible, and morethan one value of will result in X(Y X) = 0. By convention, in cases like this, we

    choose that has the smallest length. For example, if both 111 and

    110 satisfy the

    normal equations, then well choose the latter and not the former. This optimal value ofis given by =X+Y, where X+ is a p n matrix defined as follows: suppose Xhas rankr < p and it has S.V.D. Q1Q

    2. ThenX

    + = Q2+Q1, where

    + is p n matrix s.t. for1ir we let +ii = 1/ii and +ij = 0 otherwise. We will not prove this fact, but theproof can be found (among other places) in appendix 1 of Strangs book.

    26

  • 8/11/2019 Bootcamp 2011 Algebra

    28/42

    Exercises

    6.1 Show that if a matrix is triangular, its determinant is the product of the entries on themain diagonal.

    6.2 Let s1, s2, . . . , sn Rn

    be the set of linearly independent eigenvectors of A, let1, 2, . . . , nbe the corresponding eigenvalues (note that they need not be distinct), and letSbe then nmatrix such that the j -th column of which is sj. Show that if is the n ndiagonal matrix s.t. the ii-th entry on the main diagonal is i, thenAS=S, and since Sis invertible (why?) we have S1AS= .

    6.3 Show that ifrank(A) =r, then AAhasr non-zero eigenvalues.

    6.4 Show that ifAis symmetric and idempotent, and is an eigenvalue ofA, then = 1or = 0.

    27

  • 8/11/2019 Bootcamp 2011 Algebra

    29/42

    7 Statistics: Random Variables

    This sections covers some basic properties of random variables. While this material is notnecessarily tied directly to linear algebra, it is essential background for graduate level Statis-tics, O.R., and Biostatistics. For further review of these concepts, see Casella and Berger,

    sections 2.1, 2.2, 2.3, 3.1, 3.2, 3.3, 4.1, 4.2, 4.5, and 4.6.Much of this section is gratefully adapted from Andrew Nobels lecture notes.

    7.1 Expectation, Variance and Covariance

    Theexpected value of a continuous random variable X, with probability density functionf, is defined by

    EX=

    xf(x)dx.

    The expected value of a discrete random variable X, with probability mass function p, isdefined by

    EX=

    xR,p(x)=0xp(x).

    The expected value is well-defined ifE|X| < .

    We now list some basic properties ofE():1. X Y implesEX EY

    Proof: Follows directly from properties of

    and

    .

    2. For a, b

    R, E(aX+bY) =aEX+bEY.

    Proof: Follows directly from properties of

    and

    .

    3.|EX| E|X|Proof: Note that X, X |X|. Hence, EX, EX E|X| and therefore|EX| E|X|.

    4. IfXand Yare independent (X

    Y), then E(XY) =EX EY.Proof: See Theorem 4.2.10 in Casella and Berger.

    5. IfX 0, thenE X= 0 P(X t)dt=

    0 (1 F(t))dt.

    28

  • 8/11/2019 Bootcamp 2011 Algebra

    30/42

    Proof: Suppose X f. Then,0

    P(X > t)dt =

    0

    [

    t

    f(x)dx]dt

    =

    0

    [

    0

    f(x)I(x > t)dx]dt

    =

    0

    0

    f(x)I(x > t)dtdx (Fubini)

    =

    0

    f(x)[

    0

    I(x > t)dt]dx

    =

    0

    xf(x)dx= EX

    6. IfX f then E g(X) = g(x)f(x)dx.IfX p thenEg(x) =

    x

    g(x)p(x).

    Proof: Follows from definition ofE g(X).

    Thevarianceof a random variable Xis defined by

    Var(X) = E(X EX)2= EX2 (EX)2.

    Note that Var(X) is finite (and therefore well-defined) ifEX2 < . Thecovarianceof tworandom variables X and Y is defined by

    Cov(X, Y) = E[(X EX)(Y EY)]= EX Y EXEY.

    Note that Cov(X, Y) is finite ifE X2, EY2 < .

    We now list some general properties, that follow from the definition of variance and covari-ance:

    1. Var(X) 0, with = iffX= 0 with probability 1.2. For a, b R, Var(aX+b) =a2Var(X).

    3. IfX

    Y, then Cov(X, Y) = 0. The converse, however, is not true in general.

    4. Cov(aX+b,cY +d) =acCov(X, Y).

    5. IfX1, . . . , X n satisfyE X2i < , then

    Var(n

    i=1

    Xi) =n

    i=1

    Var(Xi) + 2i

  • 8/11/2019 Bootcamp 2011 Algebra

    31/42

    7.2 Distribution of Functions of Random Variables

    Here we describe various methods to calculate the distribution of a function of one or morerandom variables. For the single variable case, given X fxand g : R R we would like tofind the density ofY =g(X), if it exists. A straightforward approach is theCDF method:

    FindFY in terms ofFx Differentiate FY to get fY

    Example 1: Location and scale. Let X fY andY =aX+b, with a >0. Then,

    FY(y) = P(Y y) =P(aX+b y) =P(X y ba

    )

    = FX(y b

    a ).

    Thus, fY(y) =FY(y) =a

    1fX(yba ).

    Ifa

  • 8/11/2019 Bootcamp 2011 Algebra

    32/42

    So, f=f1 f2 is a density.

    Theorem: IfX fX, and Y fYand X and Y are independent, then X+Y fx fy.

    Proof: Note that

    P(X+Y v) =

    fX(x)fY(y)I{(x, y) :x+y v}dxdy

    =

    vy

    fX(x)fY(y)dxdy

    =

    [

    vy

    fX(x)dx]fY(y)dy

    =

    [

    v

    fX(u y)du]fY(y)dy (u= y +x)

    = vinfty[

    fX(u y)fY(y)dy]du

    =

    v

    (fX fY)(u)du.

    Corollary: Convolutions are commutative and associative. Iff1, f2, f3 are densities, then

    f1 f2= f2 f1(f1 f2) f3= f1 (f2 f3).

    Change of Variables

    We now consider functions of more than one random variable. In particular, let U,Vbeopen subsets in Rk, andH :U V. Then, ifxis a vector in U,

    H(x) = (h1(x), . . . , hk(x))t.

    is a vector in V. The functions h1(), . . . , hk() are the coordinate functions ofH. If X isa continuous random vector, we would like to find the density ofH( X). First, some furtherassumptions:

    (A1) H : U Vis one-to-one and onto.

    (A2) H is continuous.

    (A3) For every 1 i, j k, the partial derivatives

    hij hixj

    exist and are continuous.

    31

  • 8/11/2019 Bootcamp 2011 Algebra

    33/42

    Let DH(x) be the matrix of partial derivatives ofH:

    DH(x= [hij(x: 1 i, j k].

    Then, theJacobian(or Jacobian determinant1) ofHatxis the determinant ofDH(x):

    JH(x) = det(DH(x)).

    The assumptionsA1-3imply that H1 :V Uexists and is differentiable on VwithJH1(y) = (JH(H

    1y))1.

    Theorem: SupposeJH(x) = 0 on U. If X fx is a k-dimensional random vector such thatP( X U) = 1, then Y =H( X) has density

    fY(y) = fX(H1(y)) |JH1(y)|

    = fX(H1(y)) |JH(H1(y))|1.

    Example: Suppose X1, X

    2 are jointly continuous with density f

    X1

    ,X2. Let Y

    1= X

    1+ X

    2,

    Y2 = X1 X2, and find fY1,Y2 .Here

    y1 = h1(x1, x2) = x1+x2

    y2 = h2(x1, x2) = x1 x2x1= g1(y1, y2) =

    1

    2(y1+y2)

    x2= g2(y1, y2) = 1

    2(y1 y2),

    and

    JH(x1, x2) = h1x1 h1x2h2

    x1h2x2

    = 1 11 1 = 2 = 0.So, applying the theorem, we get

    fY1,Y2(y1, y2) =1

    2fX1,X2(

    y1+y22

    ,y1 y2

    2 ).

    As a special case, assume X1, X2 are N(0, 1) and independent. Then,

    fY1,Y2(y1, y2) = 1

    2(

    y1+y22

    )(y1 y2

    2 )

    = 14

    exp

    (y1+y2)28

    (y1 y2)28

    = 1

    4exp

    2Y

    21 + 2Y

    22

    8

    = 1

    4exp

    Y

    21

    4

    exp

    Y

    22

    4

    .

    1The partial derivative matrix D is sometimes called the Jacobain matrix(see Section 5.2).

    32

  • 8/11/2019 Bootcamp 2011 Algebra

    34/42

    So, both Y1 andY2 are N(0, 2), and they are independent!

    7.3 Derivation of Common Univariate Distributions

    Double Exponential

    IfX1, X2 Exp() andX1

    X2, thenX1X2has adouble exponential(orLaplace)distribution: X1 X2 DE(). The density of DE(),

    f(x) =

    2e|x| < x < ,

    can be derived through the convolution formula.

    Gamma and Beta Distributions

    The gamma function, a component in several probability distributions, is defined by

    (t) =0

    xt1exdx t >0.

    Here are some basic properties of () :1. (t) is well-defined for t >0.

    Proof: For t >0,

    0 (t) 10

    xt1dx+1

    xt1exdx < .

    2. (1) = 1.

    Proof: Clear.

    3.x >0,(x+ 1) = x(x).Proof: Exercise 7.4.

    4. (n+ 1) = n! for n= 0, 1, 2,....

    Proof: Follows from 2, 3.

    5. log () is convex on [0, ).

    Thegamma distribution with parameters, >0 ((, )) has density

    g,(x) =x1ex

    () x >0.

    Note: A basic change of variables shows that for s >0,

    X (, ) sX (,s

    ).

    So, acts as a scale parameter of the (, ) family. The parameter controls shape:

    33

  • 8/11/2019 Bootcamp 2011 Algebra

    35/42

    If 0< 1, then g,() is unimodal, with maximum at x = 1 .

    IfX (, ), then E X= , Var(X) = 2 .We now use convolutions to show that ifX

    (1, ), Y

    (2, ) are independent

    thenX+Y (1+2, ):

    Theorem: The family of distributions{(, )} is closed under convolutions. In particular(1, ) (2, ) = (1+2, ).

    Proof: For x >0,

    f(x) = (g1, g2,)(x)=

    x0

    g1,(x u)g2,(u)du

    = 1+2

    (1)(2)ex

    x0

    (x u)11u21du (1)= const exx1+21

    Thus, f(x) andg1+2,(x) agree up to constants. As both integrate to 1, they are the samefunction.

    Corollary: Note if = 1, then (1, ) = Exp(). Hence, IfX1, . . . , X n are iid Exp(),then

    Y =X1+. . .+Xn (n, ),with density

    fY(y) =2xn1ex

    (n 1)! .This is also known as an Erlang distribution with parameters nand .

    It follow from equation (1), with x = 1 that

    1+2

    (1)(2)e

    10

    (1 u)11u21du

    =g1+2,(1) = 1+2e

    (1+2).

    Rearranging terms shows that for r, s >0,

    (r)(s)

    (r+s) =

    10

    (1 u)r1us1du.

    Here B(, ) is known as the beta function with parameters r, s. The beta distribution(r, s) has density

    br,s(x) =B(r, s)1 xr1(1 x)s1 0< x

  • 8/11/2019 Bootcamp 2011 Algebra

    36/42

    The parameters r, s play symmetric roles. If r = s then (r, s) is symmetric about 1/2.(r, r) is u-shaped if r < 1, uniform if r = 1, and unimodal (bell shaped) if r > 1. If r > s > 0 then (r, s) is skewed to the right, if 0 < s < r then (r, s) is skewed left. Therandom variable X (r,s, ) has expection and variance

    EX= r

    r+s Var(X) = rs

    (r+s)2(r+s+ 1) .

    Chi-square distributions

    Fix an integerk 1. Then, the chi-square distribution withkdegrees of freedom, written2k, is (k/2, 1/2). Thus,

    2k has density

    fk(x) = 1

    2k/2(k2)xk21e

    x2 , x >0.

    Theorem: IfX1, . . . , X k are iid N(0, 1), thenX21 +. . .+X

    2k 2k.

    Proof: Recall that ifX N(0, 1) then X2 f(x) = 12

    e

    x2 = (12 ,

    12). Thus, X

    2 21.Furthermore,

    X21 +. . .+K2k (k(

    1

    2),

    1

    2) =2k.

    The above theorem makes calculating the expectation and variance of2k easy:

    E2k =E(X21 +. . .+X

    2k) =kEX

    21 =k.

    Var(2k) =kVar(X21 ) =k(EX

    41 (EX21 )2)

    =k(3 1) = 2k.F and t-distributions

    The F-distribution with with m,n degrees of freedom, F(m, n), is the distribution ofthe ratio

    X/m

    Y /n,

    whereX

    2m, Y

    2n, andXY.

    Fact: IfX fx and 0< Y fY then R = XY.

    Proof: Use CDF method.

    By the fact above,F(m, n) has density

    f(x) =B1(m

    2,

    n

    2)(

    m

    n)m2 X

    m21(1 + (

    m

    nx))

    1

    2(m+n).

    35

  • 8/11/2019 Bootcamp 2011 Algebra

    37/42

    Thet-distribution withndegrees of freedom, tn, is the distribution of the ratio

    XY2/n

    ,

    where X, Y

    N(0, 1) are independent. Equivalently, tn is the distribution of

    Z where

    Z F(1, n). The density oftn is

    fn(y) = 1

    pin(

    12

    (n+ 1))

    (12n) 1

    (1 + x2

    n)(n+1)/2

    .

    Some other properties of the t-distribution:

    1. t1 is the Cauchy distribution.

    2. IfX tn then EX = 0 for n 2, undefined for n = 1; Var(X) = nn2 for n 3,undefined for n= 1, 2.

    3. The density fn

    (x) converges to the density of a standard normal, (x), asn

    .

    7.4 Random Vectors: Expectation and Variance

    Arandom vectoris a vector X= [X1 X2 ...Xn] whose componentsX1, X2,...,Xn are real-

    valued random variables defined on the same probability space. The expectation of a randomvector E(X), if it exists, is given by the expected value of each component:

    E(X) = [EX1 EX2 ...EXn].

    The covariance matrix of a random vector cov(X) is given by

    cov(X) =E[(X EX)(X EX)].We now give some general results on expectations and variances. We supply reasonings

    for some of them, and you should verify the rest (usually by the method of entry-by-entrycomparison). We assume in what follows that k k Aand k 1a are constant, and we letk 1 = E(x) and k k V =cov(x) (vij =cov(xi, xj)):

    1. E(Ax) =AE(x)

    Proof: Exercise 7.5(a).

    2. V ar(ax) =aV a.

    Proof: Note thatvar(ax) = var(a1x1+a2x2+. . .+akxk)

    =k

    i=1

    kj=1

    aiajcov(xixj)

    =k

    i=1

    kj=1

    vijaiaj =aV a

    36

  • 8/11/2019 Bootcamp 2011 Algebra

    38/42

    3. V ar(Ax) =AV A

    Proof Exercise 7.5(b).

    4. E(xAx) =tr(AV) +A

    5. Covariance matrixV is positive semi-definite.Proof: yV y = V ar(yx) 0y= 0. Since V is symmetric (why?), it follows thatV1/2 = (V1/2).

    6. Cov(ax, bx) =aV b

    Proof: Exercise 7.5(c).

    7. Ifx, y are two k 1 vectors of random varialbes, we define their cross-covariancematrixCas follows : cij =cov(xi, yj). Notice that unlike usual covariance matrices, across-covariance matrix is not (usually) symmetric. We still use the notation cov(x, y)and the meaning should be clear from the context. Now, supposeA, Bare k

    k. Then

    cov(Ax,Bx) =AV B.

    Exercises

    7.1 Show that ifX f and g() is non-negative, then E g(X) = g(x)f(x)dx.[Hint: Recall that E X=

    0 P(X > t)dt ifX 0.]

    7.2 LetXbe a continuous random variable with density fX. Find the density ofY = |X|in terms offX.

    7.3 Let X (1) and Y (2, 1) be indepedent. Use the two-dimensional change ofvariables formula to show that Y1 = X1+ X2 andY2 = X1/(X1+ X2) are independent withY1 (1+2, 1) andY2 (1, 2).7.4 Using integration by parts, show that the gamma function (t) =

    0 x

    t1exdxsatisfiesthe relation (t+ 1) =t(t) for t >0.

    7.5 Prove the following results about vector expectations and variance:

    (a) E(Ax) =AE(x)

    (b) V ar(Ax) =AV A

    (c) Cov(ax, bx) =aV b

    37

  • 8/11/2019 Bootcamp 2011 Algebra

    39/42

    8 Further Applications to Statistics: Normal Theory

    and F-test

    8.1 Bivariate Normal Distribution

    Suppose X is a vector of continuous random variables and Y = AX+c, where A is aninvertible matrix. If Xhas probability density function pX, then the probability densityfunction ofY is given by

    pY(y) = |det(A)|1pX(A1(Y c)).The proof of this result can be found in appendix B.2.1 of Bickel and Doksum.

    We say that 2 1 vector X=

    X1X2

    has a bivariate normal distribution if Z1, Z2

    I.I.DN(0, 1), s.t. X=AZ+. In what follows we will moreover assume that A is invertible.You should check at this point for yourself that X1 N(1, 1) andX2 N(2, 2), where1 =

    a112 +a122 and 2 =

    a212 +a222, and that cov(X1, X2) = a11a21+ a12+ a22. We

    then say that X N(021, ), where

    =AA=

    21 1212

    22

    and = cov(X1,X2)12

    (you should verify that the entries of = AA are as we claim). Themeaning behind this definition is made explicit by the following theorem:

    Theorem: Suppose 1= 0 =2 and||

  • 8/11/2019 Bootcamp 2011 Algebra

    40/42

    which proves the theorem. The symmetric matrix is the covariance matrix ofX.

    You should prove for yourself (Exercise 8.1) that ifXhas a bivariate normal distributionN(, V, andB is invertible, then Y =BX+ dhas a bivariate normal distribution N(B +d, B V B).

    These results generalize to more than two variables and lead to multivariate normaldistributions. You can familiarize yourself with some of the extensions in appendix B.6 ofBickel and Doksum. In particular,we note here that ifx is a k 1 vector of IID N(0, 2)random variables, thenAx is distributed as a multivariate N(0, 2AA) random vector.

    8.2 F-test

    We will need a couple more results about quadratic forms:

    1. Suppose k k A is symmetric and idempotent and k 1 x N(0k1, 2Ikk). ThenxAx2

    r2, where r = rank(A).Proof: We write x

    Ax2 = x

    Q Q

    x and we note that Q

    x N(0, 12 2QQ) =N(0, I),

    i.e.Qx

    is a vector of IIDN(0, 1) random variables. We also know that the is diagonaland its main diagonal consist ofr 1s and k r 0s, where r = rank(A). You can theneasily see from matrix multiplication that x

    Q

    Qx

    = z12 +z2

    2 +. . .+zr2, where the

    zis are IID N(0, 1). Therefore xAx

    2 r2. 2. The above result generalizes further: supposek 1x N(0, V), andk ksymmetric

    Ais s.t. Vis positive definite and either AV or V Ais idempotent. Then xAx r2,wherer = rank(AV) or rank(V A), respectively.

    Proof: We will prove it for the case of idempotent AVand the proof for idempotentV A

    is essentially the same. We know that x V1/2

    z, where z N(0, Ikk), and we knowthat V1/2 = (V1/2), so we have: xAx = z(V1/2)AV1/2z = zV1/2AV1/2z. ConsiderB= V1/2AV1/2. Bis symmetric, and B2 =V1/2AV1/2V1/2AV1/2 =V1/2AV AV V 1/2 =V1/2AV V1/2 = V1/2AV1/2 = B, so B is also idempotent. Then from the previ-ous result (with = 1), we have zBz r2, and therefore xAx r2, wherer = rank(B) = rank(V1/2AV1/2). It is a good exercise (Exercise 8.2) to show thatrank(B) =rank(AV).

    3. Let U =xAx and V =xBx. Then the two quadratic forms are independent (in theprobabilistic sense of the word) ifAV B= 0. We will not prove this result, but we willuse it.

    Recall (Section 4.2) that we had a model Y = X+, where Y is n 1 vector ofobservations, X isnpmatrix of explanatory variables (with linearly independent columns), is p 1 vector of coefficients that were interested in estimating, and is n 1 vector oferror terms with E() = 0. Recall that we estimate = (XX)1XY, and we denote fittedvalues Y = X = P Y, where P = X(XX)1X is the projection matrix onto columns ofX, ande = Y Y = (I P)Yis the vector of residuals. Recall also that Xe= 0. Supposenow that N(0, 2I), i.e. the errors are IID N(0, 2) random variables. Then we canderive some very useful distributional results:

    39

  • 8/11/2019 Bootcamp 2011 Algebra

    41/42

    1. Y N(X, 2P).Proof: Clearly, Y N(X, 2I), and Y = P Y = Y N(P X , P 2IP) =N(X(XX)1XX, 2P P) =N(X, 2P).

    2. e

    N(0, 2(I

    P)).

    Proof: Analagous to 1.

    3. Y ande are independent (in probabilistic sense of the word).

    Proof: cov(Y , e) =cov(P Y, (IP)Y) =P(var(Y))(IP) =P 2I(IP) =2P(IP) = 0. And since both vectors were normally distributed, zero correlation impliesindependence. Notice thatcov above referred to the cross-covariance matrix.

    4. e2

    2 2np.

    Proof: First notice that e= (I P)Y = (I P)(X+ ) = (I P) (why?). Now,e22

    = ee

    2 =

    (IP)(IP)2

    = (IP)

    2 . Since (I

    P) is symmetric and idempotent, and

    N(0, 2), by one of the above results we have

    (IP)2 r2, wherer= rank(IP).

    But we know (why?) that rank(I P) = tr(I P) = tr(I X(XX)1X) =tr(I) tr(X(XX)1X) =n tr(XX(XX)1) =n tr(Ipp) =n p. So we havee22

    2np, and in particular E(e2

    np ) =2.

    Before we introduce the F-test, we are going to establish one fact about partitionedmatrices. Suppose we partition X = [X1X2]. Then [X1X2] = X(X

    X)1X[X1, X2] =X1 = X(X

    X)1XX1 and X2 = X(XX)XX2 (by straightforward matrix multiplicaiton)or P X1 = X1 and P X2 = X2. Taking transposes we also obtain X

    1 = X

    1X(X

    X)1X

    and X2 = X2X(X

    X)1X. Now suppose we want to test a theory that the lastp2 coeffi-cients ofare actually zero (note that if were interested in coefficients scattered throught, we can just re-arrange the columns of X). In other words, splitting our system intoY =X11+ X22+ , withn p1 X1 andn p2 X2 (p1+p2 = p), we want to see if2= 0.

    We consider the test statistic

    Yf2 Yr22

    =Y(X(XX)1X X1(X1X1)1X1)Y

    2 ,

    whereYf is the vector of fitted values when we regress with respect to all columns ofX(full

    system), and Yr is the vector of fited values when we regress with respect to only first p1columns ofX(restricted system). Under null hypothesis (2 = 0), we have Y =X11+,and expanding the numerator of the expression above, we get

    Y(X(XX)1X X1(X1X1)1X1)Y

    =(X(XX)1X X1(X1X1)1X1)+1X1(X(XX)1X X1(X1X1)1X1)X11.We recognize the second summand as

    (1X1X(X

    X)1X 1X1X1(X1X1)1X1)X11 = (1X1 1X1)X11= 0.

    40

  • 8/11/2019 Bootcamp 2011 Algebra

    42/42