Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf ·...

47
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine Nelson Tobias Scheffer

Transcript of Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf ·...

Page 1: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Universität Potsdam Institut für Informatik

Lehrstuhl Maschinelles Lernen

Maschinelles Lernen II

PCA

Christoph Sawade/Niels Landwehr/Blaine Nelson

Tobias Scheffer

Page 2: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Overview

Principal Component Analysis (PCA)

Optimization problem

Kernel-PCA

Adaptation for high-dimensional data

Fisher Linear Discriminant Analysis

Directions of Maximum Covariance

Canonical Correlation Analysis

2

Page 3: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

PRINCIPAL COMPONENT ANALYSIS Part I

3

Page 4: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

PCA Motivation

Data compression

Preprocessing (Feature Selection / Noisy Features)

Data visualization

4

Page 5: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

PCA Example

Representation of Digits as an 𝑚 ×𝑚 pixel matrix

The actual number of degrees of freedom is

significantly smaller because many features

Are meaningless or

Are composites of several pixels

Goal: Reduce to a 𝑑-dimensional subspace

5

Page 6: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

PCA Example

Representation of faces as an 𝑚 ×𝑚 pixel matrix

The actual number of degrees of freedom is significantly

smaller because many features

Are meaningless or

Are composites of several pixels

Goal: Reduce to a 𝑑-dimensional subspace

6

Page 7: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

PCA Projection

A Projection is an idempotent linear Transformation

7

1y x

T

1u x

x

ix

1 iy x

Center point:

𝐱 =1

𝑛 𝐱𝑖

𝑛

𝑖=1

Covariance:

𝚺 =1

𝑛 𝐱𝑖 − 𝐱 𝐱𝑖 − 𝐱 T

𝑛

𝑖=1

Page 8: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

PCA Projection

A Projection is an idempotent linear Transformation

Let 𝐮1 ∈ ℝ𝑚 with 𝐮1T𝐮1 = 1

𝑦1 𝐱 = 𝐮1T𝐱 constitutes a

projection onto a one-dimensional

subspace

For data in the projection‘s space, it follows that:

Center (mean): 𝑦1 𝐱 = 𝐮1T 𝐱

Variance: 1

𝑛 𝐮1

T𝐱𝑖 − 𝐮1T 𝐱

2𝑛𝑖=1 = 𝐮1

T𝚺 𝐮1

8

1y x

T

1u x

x

ix

1 iy x

Page 9: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

PCA Directions of Maximum Variance

Find direction 𝐰 that maximizes projected variance

Consider random variable 𝐱~𝑃𝑋 (Assume 0-mean).

The projected variance onto (normalized) 𝐮1 is

E proj𝐮1𝐱2= E 𝐮1

T𝐱𝐱T𝐮1 = 𝐮1T 𝐸 𝐱𝐱T

𝚺𝐱𝐱

𝐮1

9

The empirical covariance

matrix (of centered data) is

𝚺 𝑥𝑥 =1𝑛𝐗𝐗T

How can we find direction 𝐮1

to maximize 𝐮1T𝚺 𝑥𝑥𝐮1?

How can we kernelize it?

Page 10: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

PCA Optimization Problem

Goal: Variance of the projected data 𝐮1T𝚺 𝑥𝑥 𝐮1

should not be lost

Maximize 𝐮1T𝚺 𝑥𝑥 𝐮1 w.r.t. 𝐮1, such that 𝐮1

T𝐮1 = 1

Lagrangian: 𝐮1T𝚺 𝑥𝑥 𝐮1+ 𝜆1 1 − 𝐮1

T𝐮1

10

Page 11: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

PCA Optimization Problem

Goal: Variance of the projected data 𝐮1T𝚺 𝑥𝑥 𝐮1

should not be lost

Maximize 𝐮1T𝚺 𝑥𝑥 𝐮1 w.r.t. 𝐮1, such that 𝐮1

T𝐮1 = 1

Lagrangian: 𝐮1T𝚺 𝑥𝑥 𝐮1+ 𝜆1 1 − 𝐮1

T𝐮1

Taking its derivative & setting it to 0: 𝚺 𝑥𝑥𝐮1 = 𝜆1𝐮1

… The solution 𝐮1 must be an eigenvector of 𝚺 𝑥𝑥

… The variance is the corresponding eigenvalue

Reduces PCA to determining the largest eigenvalue

The largest eigenvector is 1st principal component

11

Page 12: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

PCA

Projection of 𝐱 to the eigenspace:

𝑦1 𝐱 = 𝐮1T𝐱 𝑦 𝐱 = 𝐔T𝐱 with 𝐔 =

| |𝐮1 ⋯ 𝐮𝑚| |

Largest eigenvector is 1st principal component

The remaining principal components are orthogonal

directions which maximize the residual variance

𝑑 principal components vectors of the 𝑑 largest

eigenvalues 12

Page 13: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

PCA Reverse Projection

Observation: 𝐮𝑗 form a basis for ℝ𝑚 & 𝑦𝑗 𝐱 are

the coordinates of 𝐱 in that basis

Data 𝐱𝑖 can thus be reconstructed in that basis:

𝐱𝑖 = 𝐱𝑖T𝐮𝑗

𝑚

𝑗=1

𝐮𝑗 or 𝐗 = 𝐔𝐔T𝐗

If data lies (mostly) in 𝑑-dimensional principal

subspace, we can also reconstruct the data there:

𝐱 𝑖 = 𝐱𝑖T𝐮𝑗

𝑑

𝑗=1

𝐮𝑗 or 𝐗 = 𝐔1:𝑑𝐔1:𝑑T𝐗

where 𝐔1:𝑑 is the matrix of 1st 𝑑 eigenvectors

13

Page 14: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

PCA Algorithm

PCA finds dataset’s principal components, which

maximize the projected variance

Algorithm:

1. Compute data’s mean: 𝛍 = 1

𝑛 𝐱𝑖𝑛𝑖=1

2. Compute data’s covariance:

𝚺 𝑥𝑥 =1

𝑛 𝐱𝑖 − 𝛍 𝐱𝑖 − 𝛍 T𝑛𝑖=1

3. Find principal axes: 𝐔, 𝐕 = eig 𝚺 𝑥𝑥

4. Project data onto 1st d eigenvectors

𝐱 𝑖 ← 𝐔1:𝑑T 𝐱𝑖 − 𝛍

14

Page 15: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

PCA Example

15

Page 16: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

KERNEL PCA Part II

16

Page 17: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

PCA Motivation: high-dimensional data

Computation of 𝑑 eigenvectors for 𝑚-dimensional

data is 𝑂 𝑑𝑚2

Not computable for large 𝑚

Idea: data points span a linear subspace of at most

min 𝑚, 𝑛 dimensions

Let 𝐱 = 𝟎, then with help from the data, 𝐗 ∈ ℝ𝑚×𝑛

𝚺 𝑥𝑥𝐮1 = 𝜆1𝐮1 𝐗𝐯1 = 𝑛𝜆1𝐮1, 𝐗T𝐗𝐯1 = 𝑛𝜆1𝐯1

Computation is 𝑂 𝑑𝑛2 instead of 𝑂 𝑑𝑚2 .

Has same 𝑛 − 1 eigen-solutions: 𝐮𝑖 =1

𝑛𝜆𝑖𝐗𝐯𝑖

Except for eigenvalues 0

17

𝐯1 = 𝐗T𝐮1

Kernel Matrix 𝐊𝑥𝑥 𝚺 𝑥𝑥 =1𝑛𝐗𝐗T

Page 18: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Kernel-PCA

18

Requirements: Data only interact through inner

product

PCA can only capture linear subspaces

More complex features can capture non-linearity

Want to use PCA in high-dimensional spaces

Page 19: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Kernel PCA Ring Data Example

19

Page 20: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Kernel PCA Ring Data Example

PCA fails to capture the data’s two ring structure—

rings are not separated in the first 2 components.

20

Page 21: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

21

Kernel-PCA Recap: Kernels

Linear Classifiers:

Often adequate, but not always.

Idea: Data implicitly mapped to

another space, in which they are

linearly classifiable

Image mapping:

𝐱 ↦ 𝜙 𝐱

Associated kernel:

𝜅 𝐱𝑖 , 𝐱𝑗 = 𝜙 𝐱𝑖T𝜙 𝐱𝑗

Kernel = Inner Product =

Similarity of Examples.

-

-

- +

+

+

+

-

-

-

-

-

-

+

(-)

(-)

(-)

(-)

(-)

(-)

(-)

(-) (-)

(+)

(+)

(+)

(+) (+)

Page 22: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Kernel-PCA

For 𝜙 𝐱𝑖 = 0𝑛𝑖=1 , the eigenvector problem is

equivalently transformed:

𝚺𝐮𝑖 = 𝜆𝑖𝐮𝑖 𝐊𝛂𝑖 = 𝑛𝜆𝑖𝛂𝑖

Projection: 𝑦𝑖 𝐱 = 𝜙 𝐱 𝐓𝐯𝑖 = 𝛼𝑖,𝑗𝜅 𝐱, 𝐱𝑗𝑛𝑗=1

Alternative derivation via the Mercer-Map…

22

Page 23: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Kernel-PCA Algorithm

Kernel-PCA finds dataset’s principal components in

an implicitly defined feature space

Algorithm:

1. Compute kernel matrix 𝐊: 𝐾𝑖𝑗 = 𝜅 𝐱𝑖 , 𝐱𝑗

2. Center the kernel matrix:

𝐊 = 𝐊 −1

𝑛𝟏𝟏T𝐊 −

1

𝑛𝐊𝟏𝟏T +

𝟏T𝐊𝟏

𝑛𝟐𝟏𝟏T

3. Find its eigenvectors: 𝐔, 𝐕 = eig 𝐊

4. Find the dual vectors: 𝛂𝑘 = 𝜆𝑘−1 2 𝐮𝑘

5. Project the data onto the subspace:

𝐱 𝑗 ← 𝛼𝑘,𝑖𝐾 𝑖𝑗𝑛

𝑖=1 𝑘=1

𝑑

= 𝛂𝑘T𝐊 ∗,𝑗 𝑘=1

𝑑

23

Page 24: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Kernel PCA Ring Data Example

Kernel PCA (RBF) does capture the data’s structure

& resulting projections separate the 2 rings

24

Page 25: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Kernel-PCA Mercer Map

Observation: Any symmetric matrix, 𝐊, can be split

as follows (Eigenvalue Decomposition):

𝐊 = 𝐔𝐕𝐔−1

𝐔 =| |𝐮1 ⋯ 𝐮𝑚| |

, 𝐕 =𝜆1 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝜆𝑚

If 𝐊 is positive semi-definite, then 𝜆𝑖 ∈ ℝ, ∀𝑖

If eigenvectors are normalized (𝐮𝑖T𝐮𝑖 = 1), then

𝐔T = 𝐔−1

25

Page 26: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

𝐊 = 𝐔𝐕𝐔T

= 𝐔𝐕1 2 𝐕1 2 𝐔T

= 𝐔𝐕1 2 𝐔𝐕1 2 T

= 𝚽 𝐗 T𝚽 𝐗 with 𝚽 𝐗 ≔

| |

𝜙 𝐱1 ⋯ 𝜙 𝐱𝑛| |

Explict feature-mapping is given by

𝐊𝑥,𝑛𝑒𝑤 = 𝚽 𝐗 T𝚽 𝐗𝑛𝑒𝑤 = 𝐔𝐕1 2 𝚽 𝐗𝑛𝑒𝑤

𝚽 𝐗𝑛𝑒𝑤 = 𝐔𝐕1 2 −1𝐊𝑥,𝑛𝑒𝑤 = 𝐕−1 2 𝐔T𝐊𝑥,𝑛𝑒𝑤

26

Kernel-PCA Mercer Map

Diagonal matrix: 𝐕𝑖𝑖 = 𝜆𝑖

Eigenvalue decomposition

𝐔T = 𝐔−1

Page 27: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Explicit feature-mapping is given by

𝚽 𝐗𝑛𝑒𝑤 = 𝐕−1 2 𝐔T𝐊𝑥,𝑛𝑒𝑤

Observation: reduction to 𝑑 principal components is

equivalent to

𝚽𝑟𝑒𝑑 𝐗𝑛𝑒𝑤 = 𝐕 −1 2

𝐔T𝐊𝑥,𝑛𝑒𝑤, where 𝐕 = 𝑑𝑖𝑎𝑔 𝜆1, ⋯ , 𝜆𝑑 , 0,⋯

27

Kernel-PCA Mercer Map

Page 28: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

FISHER-DISCRIMINANT ANALYSIS Part III

28

Page 29: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Fisher-Discriminant Analysis (FDA)

The subspace induced by PCA maximally captures

variance from all data

Not the correct criterion for classification…

29

-5 -4 -3 -2 -1 0 1 2 3 4 5-40

-30

-20

-10

0

10

20

30Original Space

x1

x2

-1 -0.5 0 0.5 1-40

-30

-20

-10

0

10

20

30PCA Subspace

x1

x2

𝐗T𝐮𝑃𝐶𝐴

𝚺𝐮𝑃𝐶𝐴 = 𝜆𝑃𝐶𝐴𝐮𝑃𝐶𝐴

Page 30: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Fisher-Discriminant Analysis (FDA)

Optimization criterion of PCA:

Maximize the data‘s variance in the subspace.

max𝐮𝐮T𝚺𝐮, where 𝐮T𝐮 = 1

Optimization criterion of FDA:

Maximize between-class variance and minimize within-

class variance within the subspace.

max𝐮 𝐮T𝚺𝑏𝐮

𝐮T𝚺𝑤𝐮, where

𝚺𝑤 = 𝚺+1 + 𝚺−1𝚺𝑏 = 𝐱+1 − 𝐱−1 𝐱+1 − 𝐱−1

T

Already introduced as a classifier in ML1.

30

Variance

per class

Page 31: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Fisher-Discriminant Analysis (FDA)

Optimization criterion of FDA for 𝑘 classes:

Maximize between-class variance and minimize within-

class variance within the subspace.

max𝐮 𝐮T𝚺𝑏𝐮

𝐮T𝚺𝑤𝐮, where

𝚺𝑤 = 𝚺1 +⋯+ 𝚺𝑘𝚺𝑏 = 𝑛𝑖 𝐱𝑖 − 𝐱 𝐱𝑖 − 𝐱 T𝑘

𝑖=1

Generalized eigenvalue problem has 𝑘 − 1 different

solutions

31

Number of samples per class Leads to the generalized

eigenvalue problem 𝚺𝑏𝐮 = 𝜆𝚺𝑤𝐮

Page 32: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Fisher-Discriminant Analysis (FDA)

The subspace induced by PCA maximally captures

variance from all data

Not the correct criterion for classification…

32

-5 -4 -3 -2 -1 0 1 2 3 4 5-40

-30

-20

-10

0

10

20

30Original Space

x1

x2

-1 -0.5 0 0.5 1-40

-30

-20

-10

0

10

20

30PCA Subspace

x1

x2

𝐗T𝐮𝑃𝐶𝐴

𝚺𝐮𝑃𝐶𝐴 = 𝜆𝑃𝐶𝐴𝐮𝑃𝐶𝐴

𝐗T𝐮𝐹𝐷𝐴

𝚺𝑏𝐮𝐹𝐼𝑆 = 𝜆𝐹𝐼𝑆𝚺𝑤𝐮𝐹𝐼𝑆

-1 -0.5 0 0.5 1-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15Fisher Subpace

x1

x2

Page 33: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

MAXIMUM COVARIANCE ANALYSIS Part IV

33

Page 34: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Maximum Covariance Analysis (MCA) Motivation: Explanatory Directions

Consider data 𝐱, 𝐲 : input & output 𝐱, 𝐲 ~𝑃 𝑋, 𝑌

Find covariance directions 𝐮𝑋 ∈ 𝑋 & 𝐮𝑌 ∈ 𝑌 s.t.

changes in 𝐮𝑋 correspond to changes in 𝐮𝑌.

Assuming mean-centered data, the covariance of

its projection onto (normalized) 𝐮𝑋 & 𝐮𝑌 is again

E 𝐮𝑋T𝐱 𝐮𝑌

T𝐲 = 𝐮𝑋T𝚺𝑥𝑦𝐮𝑌

Empirical covariance matrix (of centered data):

𝚺 𝑥𝑦 =1𝑛𝐗𝐘

T

How to maximize 𝐮𝑋T𝚺 𝑥𝑦𝐮𝑌 for non-square 𝚺 𝑥𝑦?

How can we kernelize it in space 𝑋?

34

Page 35: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Maximum Covariance Analysis (MCA) Solution: Singular Value Decomposition

MCA can be cast as finding a pair of directions to

maximize covariance:

max𝐮𝑋 = 𝐮𝑌 =1

1𝑛𝐮𝑋

T𝐗𝐘T𝐮𝑌

≡ max𝐮𝑋 =1

1𝑛2𝐮𝑋

T 𝐗𝐘T𝐘𝐗T 𝐮𝑋

≡ max𝐮𝑌 =1

1𝑛2𝐮𝑌

T 𝐘𝐗T𝐗𝐘T 𝐮𝑌

Solve for 𝐮𝑋 & 𝐮𝑌 as eigenvalue problems

Solution is a Singular Value Decomposition (SVD)

Produces triplets 𝜎𝑖 , 𝐮𝑋,𝑖 , 𝐮𝑌,𝑖 - singular value &

corresponding vectors; 𝜎𝑖 is variance captured.

Generally we get the SVD: 𝚺 𝑥𝑦 = 𝐔𝑋𝐕𝐔𝑌T

35

Page 36: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Kernelized MCA

MCA can also be kernelized by projecting 𝐱 ↦ 𝜙 𝐱

Consider that eigen-analysis of 𝚺 𝑥𝑦𝚺 𝑥𝑦T yields 𝐔𝑋 &

of 𝚺 𝑥𝑦T𝚺 𝑥𝑦 gives 𝐔𝑌 of the SVD of 𝚺 𝑥𝑦… in fact

𝚺 𝑥𝑦T𝚺 𝑥𝑦 =

1𝑛2

𝐘𝐊𝑥𝑥𝐘T

The relationship between 𝐮𝑋,𝑖 & 𝐮𝑌,𝑖 is then

𝐮𝑋,𝑖 =1

𝜎𝑖𝚺 𝑥𝑦𝐮𝑌,𝑖

This gives projections onto 𝐮𝑋,𝑖 when 𝑋 is kernelized:

proj𝐮𝑋,𝑘𝜙 𝐱 = 𝛼𝑘,𝑗𝜅 𝐱𝑗 , 𝐱𝑛

𝑗=1 𝛂𝑘 =

1𝑛𝜎𝑘

𝐘𝐮𝑌,𝑘

36

Page 37: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

CANONICAL CORRELATION ANALYSIS (CCA)

Part V

37

Page 38: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Canonical Correlation Analysis (CCA) Motivation

We have 2 different representations of same data 𝐱:

𝐱𝑎 ← 𝜓𝑏 𝐱 & 𝐱𝑏 ← 𝜓𝑏 𝐱

Find directions 𝐮𝑎 ∈ 𝑋𝑎 & 𝐮𝑏 ∈ 𝑋𝑏 s.t. changes in 𝐮𝑎

correspond to changes in 𝐮𝑏 - correlated directions

Examples of related datasets:

Climate data: spatial measurements of different

quantities (pressure/temperature) may be correlated

due to a single underlying phenomenon (El Nino)

Multi-lingual text: parallel texts written in 2 different

languages that represent the same ideas.

38

Page 39: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Canonical Correlation Analysis (CCA) Example

Climate Prediction: Researchers have used CCA

techniques to find correlations in sea level pressure

& sea surface temperature:

39

Page 40: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Canonical Correlation Analysis (CCA) Motivation

We have 2 different representations of same data 𝐱:

𝐱𝑎 ← 𝜓𝑏 𝐱 & 𝐱𝑏 ← 𝜓𝑏 𝐱

Find directions 𝐮𝑎 ∈ 𝑋𝑎 & 𝐮𝑏 ∈ 𝑋𝑏 s.t. changes in 𝐮𝑎

correspond to changes in 𝐮𝑏 - correlated directions

For mean-centered data, the correlation of its

projection onto (normalized) 𝐮𝑎 & 𝐮𝑏 is

𝜌𝑎𝑏 =E 𝐮𝑎

T𝐱𝑎 𝐮𝑏T𝐱𝑏

E 𝐮𝑎T𝐱𝑎 𝐮𝑎

T𝐱𝑎 ∙ E 𝐮𝑏T𝐱𝑏 𝐮𝑏

T𝐱𝑏

40

Page 41: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Canonical Correlation Analysis (CCA) Motivation

We have 2 different representations of same data 𝐱:

𝐱𝑎 ← 𝜓𝑏 𝐱 & 𝐱𝑏 ← 𝜓𝑏 𝐱

Find directions 𝐮𝑎 ∈ 𝑋𝑎 & 𝐮𝑏 ∈ 𝑋𝑏 s.t. changes in 𝐮𝑎

correspond to changes in 𝐮𝑏 - correlated directions

For mean-centered data, the correlation of its

projection onto (normalized) 𝐮𝑎 & 𝐮𝑏 is

𝜌𝑎𝑏 =𝐮𝑎

T𝚺 𝑎𝑏𝐮𝑏

𝐮𝑎T𝚺 𝑎𝑎𝐮𝑎 ∙ 𝐮𝑏

T𝚺 𝑏𝑏𝐮𝑏

How can we find directions that maximize 𝜌𝑎𝑏?

How can we kernelize it in spaces 𝑋𝑎 & 𝑋𝑏? 41

Page 42: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Canonical Correlation Analysis (CCA) Equivalent Optimization Problem

CCA is equivalent to finding a pair of unit directions

to maximize covariance:

max𝐮𝑎

T𝚺𝑎𝑎𝐮𝑎=𝐮𝑏T𝚺𝑏𝑏𝐮𝑏=1

𝐮𝑋T𝚺 𝑎𝑏𝐮𝑌

CCA Program has a Lagrangian:

ℒ = 𝐮𝑎T𝚺 𝑎𝑏𝐮𝑏 −

𝜆𝑎2 𝐮𝑎

T𝚺 𝑎𝑎𝐮𝑎 − 1 − 𝜆𝑏2 𝐮𝑏

T𝚺 𝑏𝑏𝐮𝑏 − 1

42

Page 43: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Canonical Correlation Analysis (CCA) Equivalent Optimization Problem

CCA Program has a Lagrangian:

ℒ = 𝐮𝑎T𝚺 𝑎𝑏𝐮𝑏 −

𝜆𝑎2𝐮𝑎

T𝚺 𝑎𝑎𝐮𝑎 − 1 − 𝜆𝑏2𝐮𝑏

T𝚺 𝑏𝑏𝐮𝑏 − 1

Setting derivatives of ℒ to 0 gives conditions

𝚺 𝑎𝑏𝐮𝑏 − 𝜆𝑎𝚺 𝑎𝑎𝐮𝑎 = 𝟎 𝚺 𝑏𝑎𝐮𝑎 − 𝜆𝑏𝚺 𝑏𝑏𝐮𝑏 = 𝟎

One can show that 𝜆𝑎 = 𝜆𝑏 & we thus have

𝟎 𝚺 𝑎𝑏𝚺 𝑏𝑎 𝟎

𝐮𝑎𝐮𝑏

= 𝜆𝚺 𝑎𝑎 𝟎

𝟎 𝚺 𝑏𝑏

𝐮𝑎𝐮𝑏

𝐀𝐰 = 𝜆𝐁𝐰

This is a generalized eigenvalue problem

43

Page 44: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Canonical Correlation Analysis (CCA) Solving as Generalized Eigenvalue Probem

Generalized Eigenvalue Problem: 𝐀𝐰 = 𝜆𝐁𝐰

Since 𝐁 ≻ 0, we can simply invert it:

𝐁−1𝐀𝐰 = 𝜆𝐰

This is now an eigenvalue problem… are we done?

Nope… not symmetric

44

Page 45: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Canonical Correlation Analysis (CCA) Solving as Generalized Eigenvalue Probem

Generalized Eigenvalue Problem: 𝐀𝐰 = 𝜆𝐁𝐰

If 𝐁 ≻ 0, it can be decomposed as 𝐁 = 𝐁1 2 𝐁1 2 & by

letting 𝐰 = 𝐁−1 2 𝐯 we obtain the eigenvalue problem

𝐁−1 2 𝐀𝐁−1 2 𝐯 = 𝜆𝐯

Solutions 𝜆𝑖 , 𝐯𝑖 give solutions to original problem 𝐮𝑎,𝑖𝐮𝑏,𝑖

= 𝐰𝑖 = 𝐁−1 2 𝐯𝑖

Each eigenvalue is correlation coefficient: 𝜆𝑖 ∈ −1,1

Directions 𝐰𝑖 are not generally orthogonal; only

conjugate in space defined by 𝐁

However, there are 2𝑛 solutions; why? 45

Page 46: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Canonical Correlation Analysis (CCA) Kernelizing CCA

CCA is kernelized taking 𝐮𝑎 = 𝐗𝑎𝛂𝑎 & 𝐮𝑏 = 𝐗𝑏𝛂𝑏.

For both reps., make kernel matrices 𝐊𝑎 & 𝐊𝑏

We then can replace 𝚺 𝑎𝑎 = 𝐊𝑎𝐊𝑎, 𝚺 𝑏𝑏 = 𝐊𝑏𝐊𝑏, &

𝚺 𝑎𝑏 = 𝐊𝑎𝐊𝑏 to apply CCA to arrive at

𝐊𝑎𝐊𝑏𝛂𝑏 − 𝜆𝐊𝑎𝐊𝑎𝛂𝑎 = 𝟎 𝐊𝑏𝐊𝑎𝛂𝑎 − 𝜆𝐊𝑏𝐊𝑏𝛂𝑏 = 𝟎

Problem: in high-dimensional spaces where

𝑚 ≫ 𝑛, one can always find perfect correlations ---

an example of the curse of dimensionality.

What can we do?

Answer: regularize the directions 𝐮𝑎 and 𝐮𝑏.

Solution is beyond the scope of this lecture but also

is solved as a generalized eigenvalue problem. 46

Page 47: Maschinelles Lernen II PCA - uni-potsdam.de › ml › teaching › ss13 › ml2 › PCA.pdf · Lehrstuhl Maschinelles Lernen Maschinelles Lernen II PCA Christoph Sawade/Niels Landwehr/Blaine

Saw

ade/L

andw

ehr/S

cheffe

r, Maschin

elle

s L

ern

en II

Summary

Goal: reduction / compression of data into essential

components

Maximization of variance leads to an eigenvalue

problem for principal component analysis (PCA)

Applicable to high-dimensional data and non-linear

components (kernel PCA)

Class-dependent variance minimization leads to

Fisher discriminant analysis (FDA)

Covariance maximization also yields a singular

value problem (MCA)

Max correlation between 2 different representations

leads to canonical correlation analysis (CCA)

47