Machine Learning -...

112
17.06.13 1 Machine Learning Support Vector Machine (SVM) Prof. Dr. Volker Sperschneider AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg [email protected]

Transcript of Machine Learning -...

Page 1: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 1

Machine Learning

Support Vector Machine (SVM)

Prof. Dr. Volker Sperschneider

AG Maschinelles Lernen und Natürlichsprachliche Systeme

Institut für Informatik Technische Fakultät

Albert-Ludwigs-Universität Freiburg

[email protected]

Page 2: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 2

SVM

I.  Large margin linear separability II.  Optimization theory III.  Maximum margin classifier at work IV.  Kernel functions and kernel trick V.  SVM learnability theory VI.  Extension to soft margin

Page 3: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 3

SVM

I.  Large margin linear separability

Page 4: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 4

Architecture

x1

xn

w1

wn

b

⎩⎨⎧

<−

≥+=

+=

ℜ∈

ℜ∈

∑=

0),,(10),,(1

),,(

,,,,,,,

1

21

21

xbwnetxbwnet

y

bxwxbwnet

bwwwxxx

n

iii

n

n

Page 5: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 5

Training set

{ }1,1,,,

,,,

),(,),,(),,(

21

21

2211

+−∈

ℜ∈

l

nl

ll

ddd

xxx

dxdxdx

Set of l labelled (classified) vectors

We assume that both positive and negative

training vectors are present.

Page 6: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 6

Hyperplanes and Halfspaces

{ }

{ }

{ }0),(

0),(

0),(

<+ℜ∈=

>+ℜ∈=

=+ℜ∈=

+

bxwxbwH

bxwxbwH

bxwxbwH

Tn

Tn

Tn

Page 7: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 7

Linear Separability

),( bwH −

),( bwH +

wwb2−

02222 =+

−=+

−=+

− bwwbbww

wbbw

wbw TT

),( bwH

Page 8: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 8

Distance of arbitrary vector z to hyperplane is

Signed distance (> 0 for vectors in positive halfspace, < 0 for vectors in negative halfspace) of arbitrary vector z to hyperplane is

wbzwT +

wbzwT +

Page 9: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 9

Take two arbitrary vectors on hyperplane:

Difference vector is perpendicular to w:

Thus distance of origin to hyperplane is

0)( =−=−=− bbywxwyxw TTT

wb

bywbxw TT +==+ 0

Page 10: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 10

wwb2−

u

zv

parallelwvvuz

+=

,

vwvw

wvw

wvwbuw

wbvuw

wbzw

T

TTTT

=⋅

==

++=

++=

+ )(

Page 11: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 11

wwb2−

u

z

v

parallelwvvuz

−=

,

vwvw

wvw

wvwbuw

wbvuw

wbzw

T

TTTT

−=⋅−

=−

=

−+=

+−=

+ )(

Page 12: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 12

Unfavourable seperating lines

Page 13: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 13

Favourable seperating line

Page 14: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 14

due to large margin

Page 15: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 15

Maximum Margin Separation

Given training set T, find weight vector w and threshold b that maximize margin for hyperplane H(w,b) w.r.t. T. { }

w

bxwbwbw

w

bxw

wbxw

bw

dxdxdxT

kT

lk

bwT

bw

bw

kT

lkkT

lkT

ll

+==

→+

=+

=

=

=

=

=

1

,,maxmax

,1

1

2211

minmaxarg),(maxarg,

maxmin

min),(

),(,),,(),,(

µ

µ

Page 16: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 16

Normal form of maximum margin

Double occurrence of w in the definition of margin, in nominator and denominator, can be avoided. Simply scale w, b with suitable factor λ > 0. Scaled parameters define the same hyperplane and halfspaces as before. Use scaled w, b such that

1min1

=+=

bxw kT

lk

Page 17: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 17

Normal form of maximum margin

Constraints after scaling Training vectors xk with are called support vectors.

lkbxwdlkbxwd

kTk

kTk

111111

=∀−≤+⇒−=

=∀+≥+⇒+=

1=+bxw kT

Page 18: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 18

Positive support vectors are separated from negative support vectors by a corridor of width . Exercise: Proof this! Why is the statement not completely trivial? The term above is to be maximized under the normalized constraints. Alternatively one can minimize under the normalized constraints.

w2

221 w

Page 19: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 19

Constraints can be transformed in uniform format: lkbxwd

lkbxwdkTk

kTk

111111

=∀−≤+⇒−=

=∀+≥+⇒+=

lkbxwdlkbxwd

kTk

kTk

11)(111)(1

=∀−≤+⇒−=

=∀−≤+−⇒+=

lkbxwdlkbxwd

kTk

kTk

101)(1101)(1

=∀≤++⇒−=

=∀≤++−⇒+=

lkbxwd kTk 101)( =∀≤++−

Page 20: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 20

Normal form of maximum margin

under constraints Parameter b has vanished in the function to be maximized. Does this cause a problem? Does it make the optimization senseless? Why cannot we simply let norm of w tend to infinity?

lkbxwd kTk 101)( =∀≤++−

min221

ww →

Page 21: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 21

SVM

II.  Optimization theory

Is only presented (without proofs) in so far as is required for an understanding of support vector machines. For a more detailed presentation use Martin Riedmillers slides:

Riedmiller_svm

Page 22: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 22

Convexity makes life easier Subset is convex if the following holds:

[ ] Ω∈−+∈∀Ω∈∀Ω∈∀ )(1,0 xyxyx λλ

yxyxx )( −+ λ

nℜ⊆Ω

Page 23: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 23

Convexity makes life easier Function f is convex if the following holds:

[ ]1,0))()(()())(( ∈∀−+≤−+ λλλ xfyfxfxyxf

yxyxx )( −+ λ

Page 24: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 24

Convexity makes life easier Consider convex function on convex domain:

A local minimum is a vector such that: A global minimum is a vector such that:

ℜ→Ω:f

))()((0 yfxfyrxyyr ≤∧Ω∈⇒≤−∀>∃

))()(( yfxfyy ≤⇒Ω∈∀

Ω∈x

Ω∈x

Page 25: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 25

Convexity makes life easier Consider convex function on convex domain:

Theorem: Every local minimum is a global minimum. Proof: •  Let be a local minimum and arbitrary. •  Choose small enough such that

ℜ→Ω:f

Ω∈yΩ∈x1,0 ≤> λλ

))(()( xyxfxf −+≤ λ

Page 26: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 26

Convexity makes life easier Using convexity we conclude:

)()(

)()(0

))()((0

))()(()())(()(

yfxf

xfyf

xfyf

xfyfxfxyxfxf

−≤

−≤

−+≤−+≤

λ

λλ

Page 27: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 27

Convexity makes life easier Examples of convex functions: •  linear functions - trivial •  affine functions (= linear + constant) - trivial •  square function (1-dimensional) – proof follows •  sum of convex functions •  squared euklidean norm (n-dimensional) – from

results above •  convex function scaled with positive factor - easy

Page 28: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 28

Convexity makes life easier

Square function is convex:

:10 <<≠ λandyxConsider

true

xyxyxyxyyxyx

xyxyxxyxyxyxyx

xyxyxxyxyxxxyxxyx

⇔≤

⇔−≤−

⇔−≤−

⇔≤−+

⇔+≤−+

⇔+−≤−+−

⇔+−+≤−+−+

⇔−+≤−+

1)()()()(

)()()(2

))(()()(2))(()()(2

)())((

2

2

2

22

2222

2222

λ

λ

λλ

λλλ

λλλ

λλλ

λλλ

λλ

Page 29: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 29

Minimization under equalities

Differentiable function to be minimized Equality constraints Lagrange function

ℜ→Ω:f

lpxhp ,,10)( =∀=

∑=

+=l

p

ppl xhxfxL

11 )()(),,,( ααα

Page 30: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 30

Minimization under equalities

Necessary condition on minimum x with constraints is existence of Lagrange multipliers with: Under certain conditions also sufficient.

lαα ,,1

lpxh

xhxfxL

p

l

p

pxpxlx

,,10)(

0))(())(()),,,((1

1

=∀=

=∇+∇=∇ ∑=

ααα

Page 31: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 31

Minimization under equalities

In explicit terms:

lpxh

xxh

xxf

p

l

p i

p

pi

,,10)(

0)()(1

=∀=

=∂

∂+

∂∂

∑=

α

Page 32: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 32

Example 1: max area rectangle

•  Find rectangle with side lengths x and y, fixed circum- ference sum 2x + 2y = c, and maximum area xy.

•  Function to be minimized

•  Equality constraint

022 =−+ cyx

xyyxf −=),(

Page 33: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 33

Solution by square

844ccc xy === α

0220202 =−+=+−=+− cyxxy αα

cxy === ααα 822

Page 34: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 34

Example 2: Entropy maximization

•  Function to be maximized

•  Equality constraint

∑=

=−n

kkx

1

01

∑=

=n

kkkn xxxxf

11 log),,(

[ ] ℜ→nf 1,0:

Page 35: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 35

Solution

∑∑

=

==

=

=++−=∂

=++−=∂

−+−=

n

kk

nn

n

n

n

kk

n

kkkn

x

xxxxL

xxxxL

xxxxxL

1

1

11

1

111

1

0)2log(log),,,(

0)2log(log),,,(

)1(log),,,(

αα

αα

αα

Page 36: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 36

∑=

=

==n

kk

n

x

xx

1

1

1

∑=

=

=+=+n

kk

n

x

xx

1

1

1

2loglog2loglog αα

nnxx 11 ===

Solution by uniform probability distribution

Page 37: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 37

Example 3: Likelihood maximization

•  A random process with k independent possible events is observed with number of occurrences for the events

•  If probabilities of events were known to be

•  then likelihood of this probability model under observations above is defined by likelihood function:

•  Equality constraint

knn ,,1

kpp ,,1

Page 38: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 38

•  Likelihood function is to be maximized.

•  Equality constraint

∑=

=−k

iip

1

01

∏=

==k

i

nikkkipppnnLppf

1111 ),,,,(),,(

[ ] ℜ→kf 1,0:

Page 39: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 39

Exercise: Show that empirical relative frequencies give the most likely probability model: Calculations are are little bit more complicated than in the examples before.

ninn

npk

ii ,,1

1

=∀++

=

Page 40: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 40

Minimization under inequalities Differentiable function to be minimized Inequality constraints Lagrange function

ℜ→Ω:f

∑=

+=l

p

ppl xgxfxL

11 )()(),,,( ααα

mpxg p ,,10)( =∀≤

Page 41: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 41

Minimization under inequalities

Necessary condition on minimum x with inequality constraints is the existence of Lagrange multipliers which fulfill the following KKT constraints (Karush, Kuhn, Tucker):

lαα ,,1

Page 42: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 42

Minimization under inequalities Karush-Kuhn-Tucker constraints Note that there are as many equations as variables.

lpxglpxg

xgxfxL

pp

p

l

p

pxpxlx

l

,,10)(,,10)(

0))(())(()),,,((

0,,

11

1

=∀=

=∀≤

=∇+∇=∇

∑=

α

ααα

αα

Page 43: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 43

Duality Primal problem

lpxgtsrequirementosubject

xff

p

x

n

,,10)(

min)(:

=∀≤

ℜ→Ω

Ω∈

Page 44: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 44

Duality Lagrange function More compact: KKT conditions force us to solve equations under inequality constraints. This is often uncomfortable.

0,,)()(),,,( 11

1 ≥+= ∑=

l

l

p

ppl xgxfxL ααααα

0)()(),(1

≥+= ∑=

αααl

p

pp xgxfxL

Page 45: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 45

Duality Dual problem ignores all inequality constraints and, for fixed α, minimizes Lagrange function over x. This defines a lower bound for primal problem. Lemma 1: For arbitrary α ≥ 0

),(inf)( αα xLQx

=

)(inf)(0)(

xfQxgwithx

≤α

Page 46: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 46

Duality Proof: Consider an arbitrary y with: Then: Since this holds for all y we conclude:

∑=

≤+=≤=l

p

ppx

yfygyfyLxLQ1

)()()(),(),(inf)( αααα

)(inf)(0)(

xfQxgwithx

≤α

lpyg p ,,10)( =∀≤

Page 47: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 47

Duality Having ignored requirements is compensated in a second step by taking the greatest lower bound, that is, by maximizing over all Lagrange multipliers: Lemma 2: Existence of infima and max supposed,

max)(0≥→α

αQ

)(inf)(max0)(

0xfQ

xgwithx

≤≥

≤αα

Page 48: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 48

Computing max inf L means to find a saddle-point of function L.

Page 49: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 49

Duality Lemma 3: Assume you found β ≥ 0 and y with For short: „Dual value meets primal value.“ Then For short: „Optimal dual = optimal primal = solution of KKT has been obtained.“

)()(,,10)(

yfQlpyg p

=

=∀≤

β

),()()(inf)(max)(0)(

0yLyfxfQQ

xgwithx

βαβα

====≤

Page 50: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 50

Duality Proof: was used in Lemma 2 to conclude: Using we obtain

)()( yfQ =β

lpyg p ,,10)( =∀≤

)()(inf)(max)()(0)(

0yfxfQQyf

xgwithx

=≤≤=≤

≥αβ

α

)(inf)(max0)(

0xfQ

xgwithx

≤≥

≤αα

Page 51: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 51

Duality Proof continued: In the proof of Lemma 1 we showed: Thus So far, things were rather simple. The non-trivial part is (without proofs):

)(),()( yfyLQ ≤≤ ββ

),()()( ββ yLQyf ==

Page 52: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 52

Duality Lemma 4: Under certain conditions (that are fulfilled for the margin maximization problem: quadratic function, linear constraints, compact domains) equality holds: In particular, this means that max and min both exits.

)(inf)(max0)(

0xfQ

xgwithx

≤≥

=αα

Page 53: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 53

Duality Thus we have the choice: •  Solve primal problem

•  Solve dual problem

•  Solve KKT

Page 54: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 54

Minimization under equality and inequality constraints

Exercise:

Combine the formulas for the case of equality constraints and the case of inequality constraints into formulas for the combination of equality and inequality constraints.

Page 55: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 55

Margin maximization: Primal form

min221

ww →

lpbxwd pTp 101)( =∀≤++−

Page 56: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 56

Lagrange function

∑∑

==

=

−+−

=++−+

=

l

p

pTpp

n

ii

l

p

pTpp

l

bxwdw

bxwdw

bwL

1

2

1

1

2

1

)1)(()(21

)1)((21

),,,,(

α

α

αα

Page 57: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 57

=

=

−=∂

−=∂

l

p

pp

l

l

p

pi

ppi

i

l

db

bwL

xdww

bwL

1

1

1

1

),,,,(

),,,,(

ααα

ααα

Partial derivatives:

Page 58: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 58

Karush-Kuhn-Tucker conditions

lpbxwd

lpbxwd

d

xdw

pTpp

pTp

l

p

pp

l

p

ppp

l

,,10)1)(()5(

,,101)()4(

0)3(

)2(

0,,)1(

1

1

1

=∀=−+

=∀≤++−

=

=

=

=

α

α

α

αα

Page 59: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 59

Margin maximization: Dual form

Consider dual function Q and maximize:

max),,,(

),,,,,(inf),,,(

0,,21

21,21

1 ≥→

=

ll

lbwl

Q

bwLQ

ααααα

αααααα

Page 60: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 60

Lagrange function:

∑ ∑∑∑

∑∑

= ===

==

=

+−−

=−+−

=++−+

=

l

p

l

pp

l

k

pp

pTpp

n

ii

l

p

pTpp

n

ii

l

p

pTpp

l

dbxwdw

bxwdw

bxwdw

bwL

1 11

2

1

1

2

1

1

2

1

)()(21

)1)(()(21

)1)((21

),,,,(

ααα

α

α

αα

Page 61: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 61

If then due to subterm in function L and the fact that minimization of L also runs over b we conclude that Thus this case does not participate in the process of maximization of inf L in the definition of Q. So we may assume:

−∞=),,,,,(inf 21, lbwbwL ααα

∑=

≠l

p

ppd

1

0α ∑=

≠−l

p

ppdb

1

∑=

=l

p

ppd

1

Page 62: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 62

Lagrange function reduces to: For every fixed α try to explain why:

∑∑

==

=

+−

=++−+

=

l

pp

l

p

pTpp

l

p

pTpp

l

xwdw

bxwdw

bwL

11

2

1

2

1

)(21

)1)((21

),,,,(

αα

α

αα

−∞≠),,(inf αbwLw

Page 63: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 63

Existence of the infimum fixes w by setting gradient of L w.r.t. w to zero: Thus: This allows further simplification of function Q:

∑=

=l

p

ppp xdw

1

α

0),,,,(1

1 =−=∇ ∑=

l

p

ppplw xdwbwL ααα

Page 64: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 64

∑∑∑

∑∑ ∑

∑∑

∑∑∑

∑∑

== =

== =

==

===

==

+−

=+−

=+−

=+−

=+−

=

l

pp

l

p

qTpqpqp

l

q

l

pp

l

p

pl

q

Tqqq

pp

l

pp

l

p

pTpp

l

pp

l

p

pTpp

l

p

ppp

T

l

pp

l

p

pTpp

T

k

xxdd

xxdd

xwd

xwdxdw

xwdww

Q

11 121

11 121

1121

11121

1121

21

))((

))((

)(

)(

)(

),,,(

ααα

ααα

αα

ααα

αα

ααα

Page 65: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 65

max),,,(

)(

),,,(

0,,21

11 121

21

1 ≥

== =

+−

=

∑∑∑

ll

l

pp

l

p

l

q

qTpqpqp

l

Q

xxdd

Q

ααααα

ααα

ααα

Summary

Page 66: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 66

∑=

=l

p

ppp xdw

1

α

Solution of primal problem obtained from this:

Bias b is determined as follows: Select a non-zero

Lagrange multiplier – why must it exist? Now use

and solve for b.

0)1)(( =−+ bxwd pTppα

Page 67: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 67

SVM

III.  Maximum margin classifier at work

Page 68: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 68

Generalization •  Let a fresh vector z be presented to the net. •  Inner product with weight vector w is computed.

•  Inner product of x with support vector xp measures similarity between these vectors: Parallel vectors give large inner product. Orthogonal vectors give zero inner product.

∑∑==

==l

p

pTpp

l

p

ppp

TT xzdxdzwz11

)(αα

Page 69: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 69

Generalization •  Term dp gives inner product correct sign. •  Term αp weights terms above appropriately

(hopefully) to compute net input. •  Net input can alternatively be seen as computed

by the following architecture with a hidden linear neuron for each support vector whose weight vector is the support vector.

Page 70: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 70

),()(11

wznetxzdxdzwzl

p

pTpp

l

p

ppp

TT === ∑∑==

αα

1z

nz

11dα

kkdα

ssdα

11x

1nxkx1

knx sx1

snx

linear

linear

linear

step function

computed here

Page 71: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 71

We obtain a quite natural and intutive architecture: •  Among training vectors the support vectors are

determined; only these play a role; they are the most representative among the training vectors.

•  Similarity with each support vector is computed. This is some sort of „case based reasoning“:

support vectors = cases •  Lagrange multipliers weight the similarities.

Page 72: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 72

Main disadvantage - plan for a solution •  Linear separability is seldom the case. •  Embedding vectors of low dimensional input

space into higher dimensional feature space may help:

Create additional features whatever you think could be problem relevant.

•  Keep additional complexity limited.

Page 73: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 73

SVM

IV.  Kernel functions and kernel trick

Page 74: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 74

Kernel functions

Use extra features that appear to be relevant for the problem solution, formally described as a function from (lower-dimensional) input space to (higher-dimensional) feature space:

Nn ℜ→ℜΦ :

Page 75: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 75

Kernel functions

Originally given learning problem in input space reads in feature space equivalently as follows:

),(),( 11 ll dxdx

)),(()),(( 11 ll dxdx ΦΦ

Page 76: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 76

Kernel functions

Dual optimization now reads in feature space as follows:

max),,,(

)())((

),,,(

0,,21

11 121

21

1 ≥

Φ

== =

Φ

+ΦΦ−

=

∑∑∑

ll

l

pp

l

p

l

q

qTpqpqp

l

Q

xxdd

Q

ααααα

ααα

ααα

Page 77: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 77

Kernel functions

Classification of a fresh vector z from input space proceeds by computing inner product of embedded vectors as follows and comparing it with threshold:

))()((1∑=

ΦΦl

p

pTpp xzdα

Page 78: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 78

Kernel functions

For the process of learning (convex optimization) as well as for the process of classification fresh vectors (generalization) the following operation is central and occurs in high number: Function K is called a kernel function.

)()(),( yxyxK TΦΦ=

Page 79: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 79

Kernel functions

Computation of lots of inner products in feature space may be expensive since dimension N might be a large number compare to n. We should look for means to compute inner products in input space:

)()(),( yxyxK TΦΦ=

ℜ→ℜ

=ΦΦ=

:)()()(),(

kfunctionsomewithyxkyxyxK TT

Page 80: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 80

Kernel functions

Example: Inner product in feature space:

),2,,,(),(

:22

52

yxyxyxyx =Φ

ℜ→ℜΦ

2

2

2222

)())','(),(())','(),(()','(),('''2''')','(),(

rrrkwithyxyxkyxyxyxyxyyyxyxxxyyxxyxyx

TTT

T

+=

=+

=++++=ΦΦ

Page 81: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 81

Classification:

1z

nz

11dα

kkdα

ssdα

11x

1nxkx1

knx sx1

snx

linear

linear

linear

step function

),( 1xzK

),( lxzK

),( kxzK

Page 82: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 82

Interpretation: •  Hidden neurons measure similarity between input vector

and training vectors after embedding in feature space.

•  Output neuron uses class labels and Lagrange multipliers to weight similarities and integrate them into a summarized net input.

•  A particularly natural way to measure similarity of input

vector with some training vector is by Gaussian bell shaped functions:

2

2

21),( σσπ

pxzp exzK

−−

=

Page 83: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 83

•  It can be shown that this indeed is a Kernel function (Mercer‘s theorem).

•  Other widely used kernels are polynomial kernels of degree d

•  and tanh-Kernels that mimic the behaviour of MLPs (weight vector βw and threshold θ):

The latter are Kernels onl for certain combinations of βw and θ.

dT xzxzK )1(),( +=

)tanh(),( θβ −= xzxzK T

Page 84: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 84

•  There are lots of further useful kernel functions, some more general, some tailored to specifica applications.

•  There are lots of cooking recipes for building fresh

kernels out of already constructed ones.

Page 85: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 85

SVM

V.  SVM learnability theory

Page 86: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 86

Generalization ability of maximum margin classifier

•  Using lots of additional features (remember used embedding into high-dimensional feature space), usually (as for MLPs) has danger of overfitting.

•  Remember the estimations for VC-dimension of

MLPs of order w·log(w) or w2n2.

Page 87: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 87

•  SVM do not suffer from this problem: Expected error scales with m/R where m is maximum margin of a training set and R the radius of training vectors, but does not depend on dimension of feature space.

•  More concrete estimation for expected error ε is (with m maximum margin, l size of training set, R radius of training set, δ confidence):

)logloglog(),,,( 4328

642222

2

δδε +=ml

Relm

mR

lRml

Page 88: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 88

A single formula What confidence would you like to have?

Fix δ > 0, δ < 1 (for example δ = 1%)

What margin do you expect?

m would be nice (for example m = 2)

How many training data l are available?

Page 89: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 89

{

ε

µµ

−≥

≤−

⇒≥∀

=

1

),(),(

),((,),(,),,(Pr

,

11)(

bwerrorbwerror

bwbwdxdxT

genTemp

T

lll

Under the settings above, what error must be expected under a randomly drawn training set in the worst case?

p

lp

ml

Relm

mR

l

xRand

with

,,1

4328

642

max

)logloglog( 222

2

==

+= δε

Page 90: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 90

Do you see any problem with this estimation? What if you draw a random training set of size l and margin 1 instead of the desired margin 2? Using formula again with m = 1 changes l to l‘. Again draw a random training set of size l‘. Does this game finally come to an end?

Page 91: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 91

•  Dependence on m/R instead of m is obvious: By scaling training set with a positive factor R one could increase maximum margin m as strong as wished without affecting learning problem.

•  The existence of some difficulties with a sound mathematical formalization of this result should be mentioned – maximum margin cannot be fixed in advance before randomly drawing a training set. The interested reader should take a look into literature on structural risk minimization.

Page 92: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 92

SVM

VI.  Soft margin

Page 93: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 93

•  Allowing limited classification error (that is bad), larger margin becomes possible (that is good).

•  The amount of classification error is measured

by slack variables ξp, one for each training vector – error may vary from training vector to training vector.

•  Inequality constraints now read as follows:

Page 94: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 94

lpbxwdlpbxwd

ppTp

ppTp

111111

=∀+−≤+⇒−=

=∀−+≥+⇒+=

ξ

ξ

lpbxwd ppTp 11)( =∀−≥+ ξ

lpbxwd ppTp 101)( =∀≤−++− ξ

Page 95: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 95

larger margin

error measuredby slack variables

Page 96: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 96

•  In margin maximization use error values as small as possible.

•  This is expressed by the following function

with constant C that is chosen by the user: •  C controls the balance between margin

maximization and error tolerance.

0,,

)(),,,(

1

1

22211

+= ∑=

l

l

p

pl Cwwf

ξξ

ξξξ

Page 97: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 97

Soft margin: Primal problem Under constraints

min)(),,,(,1

22211

ξξξξ

w

l

p

pl Cwwf →+= ∑=

lpbxwd ppTp

l

101)(0,,1

=∀≤−++−

ξ

ξξ

Page 98: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 98

Soft margin: Lagrange function

)())(1(

)(

),,,,,,,,,,(

11

1

2221

111

pl

pp

l

p

pTppp

l

p

p

lll

bxwd

Cw

bwL

ξβξα

ξ

ββααξξ

−++−−

++

=

∑∑

==

=

Page 99: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 99

Soft margin: KKT conditions

pipi

piCL

pidbL

pixdwwL

lp

lpbxwdlp

lpbxwd

i

i

iiii

l

p

pp

l

p

pi

ppi

i

pp

pTppp

p

pTpp

1010

102

10

10

10)(

10))(1(10

10)(1

1

1

=∀≥

=∀≥

=∀=−−=∂∂

=∀=−=∂∂

=∀=−=∂∂

=∀=−

=∀=+−−

=∀≤−

=∀≤+−−

=

=

β

α

βαξξ

α

α

ξβ

ξα

ξ

ξ

Page 100: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 100

Soft margin: Deriving solution for primal problem from KKT Bias b is indirectly derived from an equality constraint for non-zero αi.

lpC

xdw

ppp

l

p

ppp

12

1

=∀+

=

=∑=

βαξ

α

Page 101: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 101

Soft margin: Dual function

.)..(0

12

),,,,,,,,(inf),,,,,(

1

1

11,,11

glwd

lpC

xdw

bwLQ

l

p

pp

ppp

l

p

ppp

ppbwpp

=

=

=

=∀+

=

=

=

α

βαξ

α

ββααξββααξ

Page 102: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 102

Soft margin: Inserting values in dual function

∑∑

==

=

−++−−

++

=

l

p

pp

l

p

pTppp

l

p

pT

pp

bxwd

Cww

Q

11

1

221

11

)())(1(

)(

),,,,,(

ξβξα

ξ

ββαα

Page 103: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 103

Inserting values in dual function continued

∑∑∑ ∑

∑∑

∑∑

=

+

== =

+

=

+

=

==

=

−−−−

++

=

−++−−

++

l

pCp

l

p

pp

l

p

l

p

pTppCp

l

pC

Tl

p

ppp

l

p

pp

l

p

pTppp

l

p

pT

pppp

pp

dbxwd

Cwxd

bxwd

Cww

12

11 12

1

22

121

11

1

221

)1(

)()(

)())(1(

)(

βαβα

βα

βααα

α

ξβξα

ξ

Page 104: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 104

Inserting values in dual function continued

∑∑∑∑

∑∑∑ ∑

∑∑

=

+

==

+

=

=

+

== =

+

=

+

=

−++−

=

−−−−

++

l

pC

l

pp

l

pC

Tl

p

ppp

l

pCp

l

p

pp

l

p

l

p

pTppCp

l

pC

Tl

p

ppp

pppp

pppp

pp

Cwxd

dbxwd

Cwxd

12

)(

11

22

121

12

11 12

1

22

121

2

)()(

)1(

)()(

βαβα

βαβα

βα

αα

βααα

α

Page 105: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 105

Inserting values in dual function continued

∑ ∑∑

∑∑

∑ ∑∑

∑∑∑

= =

+

=

+

= =

= =

++

=

=

+

= =

−+

+−

=

−−

++−

l

p

l

pCp

l

pC

l

p

l

q

qTpqpqp

l

p

l

pCpC

l

ppp

l

pC

l

p

l

q

qTpqpqp

pppp

pppp

pp

C

xxdd

Cxxdd

1 12

)(

1

22

1 121

1 122

1

1

22

1 121

2

)(

)(

)()(

βαβα

βαβα

βα

α

αα

βαα

αα

Page 106: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 106

Inserting values in dual function continued

∑∑∑

∑∑

∑ ∑∑

∑∑

=

+

==

+

= =

= =

+

=

+

= =

−+

+−

=

−+

+−

l

pC

l

pp

l

pC

l

p

l

q

qTpqpqp

l

p

l

pCp

l

pC

l

p

l

q

qTpqpqp

pppp

pppp

xxdd

C

xxdd

12

)(

114

)(

1 121

1 12

)(

1

22

1 121

22

2

)(

)(

)(

βαβα

βαβα

α

αα

α

αα

Page 107: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 107

Inserting values in dual function continued

∑∑∑∑

∑∑∑

∑∑

==

+

= =

=

+

==

+

= =

+−−

=

−+

+−

l

pp

l

pC

l

p

l

q

qTpqpqp

l

pC

l

pp

l

pC

l

p

l

q

qTpqpqp

pp

pppp

xxdd

xxdd

114

)(21

1 121

12

)(

114

)(

1 121

2

22

)(

)(

ααα

α

αα

βα

βαβα

Page 108: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 108

Final form of dual function with constraints

∑∑∑∑==

+

= =

+−−

=l

pp

l

pC

l

p

l

q

qTpqpqp

ll

ppxxdd

Q

114

)(21

1 121

11

2

)(

),,,,,(

ααα

ββαα

βα

∑=

=∀=

≥l

p

pp

ll

lpd1

11

,,10

0,,,,,

α

ββαα

Page 109: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 109

Wild mixture of kernel functions

cyxyxK T +=),(

dT cyxyxK ))((),( += α

2

2

2),( σ

yx

eyxK−

−=

22),( σ

yx

eyxK−

−=

σyx

eyxK−

−=),(

linear polynomial Gauss exponential Laplace

Page 110: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 110

Wild mixture of kernel functions

))(tanh(),( cyxyxK T += α

2111),(yx

yxK−+

22),( cyxyxK +−=

22

1),(cyx

yxK+−

=

Cauchy sigmoidal quadratic inverse quadratic

Page 111: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 111

Wild mixture of kernel functions

θθ yxyx

yxK−

−= sin),(

dyxyxK −−=),(

)1log(),( +−−=dyxyxK

)),min(),min(

),min(1(),(

3312

2

1

iiiiyx

n

iiiiiii

yxyx

yxyxyxyxK

ii +−

++=

+

=∏

Wave Power Log Spline

Page 112: Machine Learning - uni-freiburg.deml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/sperschneider/svm.pdfmore detailed presentation use Martin Riedmillers slides: Riedmiller_svm

17.06.13 112

Wild mixture of kernel functions

∑=

=n

iii yxyxK

1

),min(),(

∑=

=n

iii yxyxK

1

),min(),( βα

dyxyxK

−+=1

1),(

histogram generalized histogram T-student