Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear...

Linear Regression

Vaclav Smıdl

March 3, 2020

Linear regression and OLS

Fit by a linear function:

y1 = ax1 +b1, +e1y2 = ax2 +b1 +e2,...

......

...

In matrix notation θ = [a, b]T :

y = Xθ + e,

Minimize∑

i e2i = eT e = ||y− Xθ||22:

d(eT e)dθ = 0.

ddθ ((y− Xθ)T (y− Xθ)) = 0

ddθ (yT y− θT X T y− yT Xθ + θT X T Xθ) = 0

−X T y + X T Xθ = 0

0 2 4 6 8 10

x

0

2

4

6

8

10

12

14

y

Data

Solution:

θ = (X T X )−1X T y.

Example: The European Tracer Experiment

I Release of 340kg of PMCH for 12hours

I Direct modeling:with known meteo and release rates

I Inverse:Find release profile, how much andwhen released

Least squares solution:

1. Define a time window of possible releases: t ∈ [t1, . . . , tn] withmagnitudes θi

2. Compute concentrations cj,i at all measurement locationsj ∈ [1 . . .m] , for release of 1 unit of tracer at time ti , ∀i

3. Solve linear problemy1y2...

ym

=

c1,1 c1,2 · · · c1,nc2,1 c2,2

.... . .

c1,m cm,n

θ1θ2...θn

,for vector of measurements y and vector of unknowns θ.

� Fails. Poorly conditioned.

Ridge regression

Ordinary least squares:

θ = (X T X )−1X T y.

problematic for numerical stability. For min eig(X T X )→ 0,

θ →∞.I Replace inverse by pseudo-inverse (threshold),I Increase eigenvalues of X T X

θ = (X T X + αI)−1X T y. (1)

minimal eigenvalue is α.I Add penalization for large values :

θ = arg minθ

(||y− Xθ||22 + α||θ||22

),

where α is a suitably chosen coefficient. Yields (1).

Ridge regression


θ = (X T X )−1X T y.

problematic for numerical stability. For min eig(X T X )→ 0, θ →∞.

I Replace inverse by pseudo-inverse (threshold),I Increase eigenvalues of X T X

θ = (X T X + αI)−1X T y. (1)


θ = arg minθ

(||y− Xθ||22 + α||θ||22

),


Ridge regression


θ = (X T X )−1X T y.

problematic for numerical stability. For min eig(X T X )→ 0, θ →∞.I Replace inverse by pseudo-inverse (threshold),

I Increase eigenvalues of X T X

θ = (X T X + αI)−1X T y. (1)


θ = arg minθ

(||y− Xθ||22 + α||θ||22

),


Ridge regression


θ = (X T X )−1X T y.

problematic for numerical stability. For min eig(X T X )→ 0, θ →∞.I Replace inverse by pseudo-inverse (threshold),I Increase eigenvalues of X T X

θ = (X T X + αI)−1X T y. (1)


θ = arg minθ

(||y− Xθ||22 + α||θ||22

),


Ridge regression – selection of α

Selection of α:1. Cross-validation

(analytical solution for α)2. L-curve

plot of ||θ||22 versus ||y− Xθ||223. Joint estimation of α and θ

I Bayes10−5 100

0

200

400

600

800

1000

ETEX ERA−40 B

setting of tuning parameter

Probability model

p(y, θ|X , α) = p(y|θ,X , ω)p(θ|α)= N (Xθ, ωI)N (0, α−1I)

∝ exp{−1

2ω||y− Xθ||22 −12α||θ||

22

}


Selection of α:1. Cross-validation

(analytical solution for α)2. L-curve

plot of ||θ||22 versus ||y− Xθ||223. Joint estimation of α and θ

I Bayes10−5 100

0

200

400

600

800

1000

ETEX ERA−40 B

setting of tuning parameterProbability model


∝ exp{−1

2ω||y− Xθ||22 −12α||θ||

22

}


Probability model


∝ exp{−1

2ω||y− Xθ||22 −12α||θ||

22

}Introduce prior

p(α) = G(δ0, γ0),

computep(α|y,X )

or

p(θ|X , y, ω) =∫

p(θ, α|X , y, ω)dα

.

y

θωX

α

γ δ

Bayesian estimation of α

Joint likelihood

p(y, θ, α|X ) = p(y|θ,X , ω)p(θ|α)p(α)= N (Xθ, ωI)N (0, α−1I)G(δ, γ),

p(θ, α|y,X ) ∝ αδ+ d2 −1 exp

{−1

2ω||y− Xθ||22 −12α||θ||

22 − γα

}Conditional for θ:

p(θ|α, y,X ) = N (θ, (X T X + αI)−1),

Conditional for α:

p(α|θ, y,X ) = G(δ0 + d2 , γ0 + ||θ||22),

Solution: EM, VB, Gibbs...

Ridge regression – polynomial case

Define a degenerate case:I true model

y = x2 + e

I observations at x = [1, 2].I fit polynomial of 5th order.I Use αI to restore rank of the

covariance matrix

0.5 1 1.5 2 2.5x

1

2

3

4

y

y= 0 + 0x + 1x 2 + 0.05e

0 0.1 0.2 0.3 0.4 0.5norm( 3)

0

1

2

3

4

5

norm

(y-X-

)

,=exp(-10:2)

10 -5 10 0 10 5 10 10-0.1

0

0.1

0.2

0.3

1x

x 2

x 3

x 4

x 5

Sparse Linear Regression:Prior has peak at zero and heavy tailSpike and slab prior:

p(θ) = λN (0, σ0) + (1− λ)N (0, σ1),

Laplace prior

p(θ) = (2b)−1 exp(− 1

2b |x |).

with joint likelihood (LASSO)

p(y, θ|X , b) = exp(−1

2 ||y− Xθ||22 −1

2b ||θ||1),

Horseshoe... prior p(θ) = N (0, λ), p(λ) = Cauchy(0, τ), p(τ) =Cauchy(0, 1)

ARD prior:p(θ) =

∫p(θ|α)p(α)dα = St(0, σ, ν),

with two possible hidden variable formulations.

Automatic relevance determination (ARD)

Probability model

p(y, θ|X , α) = p(y|θ,X )p(θ|α)= N (Xθ, I)N (0, diag[α1, . . . , αp])

∝ exp{−1

2 ||y− Xθ||22 −12∑

iαiθ

2i

}

Introduce prior

p(αi ) = G(δ0, γ0), p(α) =∏

ip(αi )

computep(α|y,X )

orp(θ|X , y) =

∫p(θ, α|X , y)dα

.

y

θiωX

αi

γ0 δ0

i = 1..p

Variational Bayes for Automatic relevance determination

Probability model

p(y, θ|X , α) = N (Xθ, I)N (0, diag[α1, . . . , αp])∏

iG(δ, γ)

Posterior factors

p(αi |y,X ) = G(δ0 + 12 , γi ),

γi = γ0 + 12⟨θ2

i⟩

p(θ|y,X ) = N (θ,Σθ),θ = (X T X + diag 〈α〉)−1X T y,Σθ = (X T X + diag 〈α〉)−1.

Iterated least squares.

Variational Bayes for Automatic relevance determination

0 50 100 150 200 250 300 350 400 450 500iterations

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

3

1x

x 2

x 3

x 4

x 5

Revisiting toy example:

xi

i=1,2

ms = 1

µ = 0 φ

α ≈ 0 β ≈ 0

p(xi |m) = N (m, 1),p(m|φ) = N (0, φ),

p(φ) = iG(γ0, δ0)

Solution using:I Variational BayesI Numerical evaluation on grid


p(m, ?|x1=10,x2=12, ,=1e-08, -=1e-08

100 200 300 400s

8

9

10

11

12

13

14

m

marginal

p( ?|x1,x2)q( ?|x1,x2)

marginal

p(m|x1,x2)q(m|x1,x2)


p(m, ?|x1=3,x2=2.7, ,=1e-08, -=1e-08

5 10 15 20 25 30s

0

1

2

3

4

5

m

marginal

p( ?|x1,x2)q( ?|x1,x2)

marginal



p(m, ?|x1=2,x2=2, ,=1e-08, -=1e-08

2 4 6 8 10 12 14s

-1

0

1

2

3

4

5

m

marginal

p( ?|x1,x2)q( ?|x1,x2)

marginal



p(m, ?|x1=3,x2=0, ,=1e-08, -=1e-08

5 10 15s

-1

0

1

2

3

4

m

marginal

p( ?|x1,x2)q( ?|x1,x2)

marginal


Outliers

2 4 6 8 10x

10

20

30

40

50

60

70

80

90

100

y

y= 0 + 0x + 1x 2 + 0.05e

2 4 6 8 10x

10

20

30

40

50

60

70

80

90

100

y

y= 0 + 0x + 1x 2 + 0.05e

dataOLS 5thOLS 2th

I How to minimize the effect of an outlier?

I Outlier detection, robust statistics, etc.I Hierarchical model?

Outliers

2 4 6 8 10x

10

20

30

40

50

60

70

80

90

100

y

y= 0 + 0x + 1x 2 + 0.05e

2 4 6 8 10x

10

20

30

40

50

60

70

80

90

100

y

y= 0 + 0x + 1x 2 + 0.05e

dataOLS 5thOLS 2th

I How to minimize the effect of an outlier?I Outlier detection, robust statistics, etc.

I Hierarchical model?

Outliers

2 4 6 8 10x

10

20

30

40

50

60

70

80

90

100

y

y= 0 + 0x + 1x 2 + 0.05e

2 4 6 8 10x

10

20

30

40

50

60

70

80

90

100

y

y= 0 + 0x + 1x 2 + 0.05e

dataOLS 5thOLS 2th

I How to minimize the effect of an outlier?I Outlier detection, robust statistics, etc.I Hierarchical model?


Probability model


Is the variance of the noise homogenous?

New model:



∏i

p(βi )


Probability model


Is the variance of the noise homogenous?New model:



∏i

p(βi )

Outliers

2 4 6 8 10x

10

20

30

40

50

60

70

80

90

100

y

y= 0 + 0x + 1x 2 + 0.05e

dataOLS 5thOLS 2th

0 2 4 6 8 10x

0

20

40

60

80

100

120

y

y= 0 + 0x + 1x 2 + 0.05e

dataVB-OLS

Outliers, both diagonal α and β

0 20 40 60 80 100iterations

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

3

1x

x 2

x 3

0 20 40 60 80 100iterations

0

0.02

0.04

0.06

0.08

0.1

0.12

-

I local minima, unstable for both α and β.

Conclusion

I Linear regression is solved by OLS.I When the data are not informative, we need to regularize:I Different prior assumptions yield different results

I ridge regression minimizes coefficientsI sparsity prior minimizes the number of non-zero coefficients

I Non-Gaussian residuesI Student-t residue,I Mixture residue, etc.

ETEX data challenge

Estimate source term of the ETEX experiment

y = Xθ,

where θ is assumed to be sparse, piece-wise linear, with non-negativeelements.

pointsθ using prior (e.g. ARD) 10

Outlier detection (e.g. ARD) 10

Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear...

Documents

Transcript of Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear...