Machine Learning -...

17.06.13 1

Machine Learning

Support Vector Machine (SVM)

Prof. Dr. Volker Sperschneider

AG Maschinelles Lernen und Natürlichsprachliche Systeme

Institut für Informatik Technische Fakultät

Albert-Ludwigs-Universität Freiburg

[email protected]

17.06.13 2

SVM

I.  Large margin linear separability II.  Optimization theory III.  Maximum margin classifier at work IV.  Kernel functions and kernel trick V.  SVM learnability theory VI.  Extension to soft margin

17.06.13 3

SVM

I.  Large margin linear separability

17.06.13 4

Architecture

x1

xn

w1

wn

b

⎩⎨⎧

<−

≥+=

+=

ℜ∈

ℜ∈

∑=

0),,(10),,(1

),,(

,,,,,,,

1

21

21

xbwnetxbwnet

y

bxwxbwnet

bwwwxxx

n

iii

n

n

17.06.13 5

Training set

{ }1,1,,,

,,,

),(,),,(),,(

21

21

2211

+−∈

ℜ∈

l

nl

ll

ddd

xxx

dxdxdx

Set of l labelled (classified) vectors

We assume that both positive and negative

training vectors are present.

17.06.13 6

Hyperplanes and Halfspaces

{ }

{ }

{ }0),(

0),(

0),(

<+ℜ∈=

>+ℜ∈=

=+ℜ∈=

−

+

bxwxbwH

bxwxbwH

bxwxbwH

Tn

Tn

Tn

17.06.13 7

Linear Separability

),( bwH −

),( bwH +

wwb2−

02222 =+

−=+

−=+

− bwwbbww

wbbw

wbw TT

),( bwH

17.06.13 8

Distance of arbitrary vector z to hyperplane is

Signed distance (> 0 for vectors in positive halfspace, < 0 for vectors in negative halfspace) of arbitrary vector z to hyperplane is

wbzwT +

wbzwT +

17.06.13 9

Take two arbitrary vectors on hyperplane:

Difference vector is perpendicular to w:

Thus distance of origin to hyperplane is

0)( =−=−=− bbywxwyxw TTT

wb

bywbxw TT +==+ 0

17.06.13 10

wwb2−

u

zv

parallelwvvuz

−

+=

,

vwvw

wvw

wvwbuw

wbvuw

wbzw

T

TTTT

=⋅

==

++=

++=

+ )(

17.06.13 11

wwb2−

u

z

v

parallelwvvuz

−

−=

,

vwvw

wvw

wvwbuw

wbvuw

wbzw

T

TTTT

−=⋅−

=−

=

−+=

+−=

+ )(

17.06.13 12

Unfavourable seperating lines

17.06.13 13

Favourable seperating line

17.06.13 14

due to large margin

17.06.13 15

Maximum Margin Separation

Given training set T, find weight vector w and threshold b that maximize margin for hyperplane H(w,b) w.r.t. T. { }

w

bxwbwbw

w

bxw

wbxw

bw

dxdxdxT

kT

lk

bwT

bw

bw

kT

lkkT

lkT

ll

+==

→+

=+

=

=

=

=

=

1

,,maxmax

,1

1

2211

minmaxarg),(maxarg,

maxmin

min),(

),(,),,(),,(

µ

µ

17.06.13 16

Normal form of maximum margin

Double occurrence of w in the definition of margin, in nominator and denominator, can be avoided. Simply scale w, b with suitable factor λ > 0. Scaled parameters define the same hyperplane and halfspaces as before. Use scaled w, b such that

1min1

=+=

bxw kT

lk

17.06.13 17


Constraints after scaling Training vectors xk with are called support vectors.

lkbxwdlkbxwd

kTk

kTk

111111

=∀−≤+⇒−=

=∀+≥+⇒+=

1=+bxw kT

17.06.13 18

Positive support vectors are separated from negative support vectors by a corridor of width . Exercise: Proof this! Why is the statement not completely trivial? The term above is to be maximized under the normalized constraints. Alternatively one can minimize under the normalized constraints.

w2

221 w

17.06.13 19

Constraints can be transformed in uniform format: lkbxwd

lkbxwdkTk

kTk

111111

=∀−≤+⇒−=

=∀+≥+⇒+=

lkbxwdlkbxwd

kTk

kTk

11)(111)(1

=∀−≤+⇒−=

=∀−≤+−⇒+=

lkbxwdlkbxwd

kTk

kTk

101)(1101)(1

=∀≤++⇒−=

=∀≤++−⇒+=

lkbxwd kTk 101)( =∀≤++−

17.06.13 20


under constraints Parameter b has vanished in the function to be maximized. Does this cause a problem? Does it make the optimization senseless? Why cannot we simply let norm of w tend to infinity?

lkbxwd kTk 101)( =∀≤++−

min221

ww →

17.06.13 21

SVM

II.  Optimization theory

Is only presented (without proofs) in so far as is required for an understanding of support vector machines. For a more detailed presentation use Martin Riedmillers slides:

Riedmiller_svm

17.06.13 22

Convexity makes life easier Subset is convex if the following holds:

[ ] Ω∈−+∈∀Ω∈∀Ω∈∀ )(1,0 xyxyx λλ

yxyxx )( −+ λ

nℜ⊆Ω

17.06.13 23

Convexity makes life easier Function f is convex if the following holds:

[ ]1,0))()(()())(( ∈∀−+≤−+ λλλ xfyfxfxyxf

yxyxx )( −+ λ

17.06.13 24

Convexity makes life easier Consider convex function on convex domain:

A local minimum is a vector such that: A global minimum is a vector such that:

ℜ→Ω:f

))()((0 yfxfyrxyyr ≤∧Ω∈⇒≤−∀>∃

))()(( yfxfyy ≤⇒Ω∈∀

Ω∈x

Ω∈x

17.06.13 25

Convexity makes life easier Consider convex function on convex domain:

Theorem: Every local minimum is a global minimum. Proof: •  Let be a local minimum and arbitrary. •  Choose small enough such that

ℜ→Ω:f

Ω∈yΩ∈x1,0 ≤> λλ

))(()( xyxfxf −+≤ λ

17.06.13 26

Convexity makes life easier Using convexity we conclude:

)()(

)()(0

))()((0

))()(()())(()(

yfxf

xfyf

xfyf

xfyfxfxyxfxf

≤

−≤

−≤

−+≤−+≤

λ

λλ

17.06.13 27

Convexity makes life easier Examples of convex functions: •  linear functions - trivial •  affine functions (= linear + constant) - trivial •  square function (1-dimensional) – proof follows •  sum of convex functions •  squared euklidean norm (n-dimensional) – from

results above •  convex function scaled with positive factor - easy

17.06.13 28

Convexity makes life easier

Square function is convex:

:10 <<≠ λandyxConsider

true

xyxyxyxyyxyx

xyxyxxyxyxyxyx

xyxyxxyxyxxxyxxyx

⇔≤

⇔−≤−

⇔−≤−

⇔≤−+

⇔+≤−+

⇔+−≤−+−

⇔+−+≤−+−+

⇔−+≤−+

1)()()()(

)()()(2

))(()()(2))(()()(2

)())((

2

2

2

22

2222

2222

λ

λ

λλ

λλλ

λλλ

λλλ

λλλ

λλ

17.06.13 29

Minimization under equalities

Differentiable function to be minimized Equality constraints Lagrange function

ℜ→Ω:f

lpxhp ,,10)( =∀=

∑=

+=l

p

ppl xhxfxL

11 )()(),,,( ααα

17.06.13 30


Necessary condition on minimum x with constraints is existence of Lagrange multipliers with: Under certain conditions also sufficient.

lαα ,,1

lpxh

xhxfxL

p

l

p

pxpxlx

,,10)(

0))(())(()),,,((1

1

=∀=

=∇+∇=∇ ∑=

ααα

17.06.13 31


In explicit terms:

lpxh

xxh

xxf

p

l

p i

p

pi

,,10)(

0)()(1

=∀=

=∂

∂+

∂∂

∑=

α

17.06.13 32

Example 1: max area rectangle

•  Find rectangle with side lengths x and y, fixed circum- ference sum 2x + 2y = c, and maximum area xy.

•  Function to be minimized

•  Equality constraint

022 =−+ cyx

xyyxf −=),(

17.06.13 33

Solution by square

844ccc xy === α

0220202 =−+=+−=+− cyxxy αα

cxy === ααα 822

17.06.13 34

Example 2: Entropy maximization

•  Function to be maximized


∑=

=−n

kkx

1

01

∑=

=n

kkkn xxxxf

11 log),,(

[ ] ℜ→nf 1,0:

17.06.13 35

Solution

∑

∑∑

=

==

=

=++−=∂

∂

=++−=∂

∂

−+−=

n

kk

nn

n

n

n

kk

n

kkkn

x

xxxxL

xxxxL

xxxxxL

1

1

11

1

111

1

0)2log(log),,,(

0)2log(log),,,(

)1(log),,,(

αα

αα

αα

17.06.13 36

∑=

=

==n

kk

n

x

xx

1

1

1

∑=

=

=+=+n

kk

n

x

xx

1

1

1

2loglog2loglog αα

nnxx 11 ===

Solution by uniform probability distribution

17.06.13 37

Example 3: Likelihood maximization

•  A random process with k independent possible events is observed with number of occurrences for the events

•  If probabilities of events were known to be

•  then likelihood of this probability model under observations above is defined by likelihood function:


knn ,,1

kpp ,,1

17.06.13 38

•  Likelihood function is to be maximized.


∑=

=−k

iip

1

01

∏=

==k

i

nikkkipppnnLppf

1111 ),,,,(),,(

[ ] ℜ→kf 1,0:

17.06.13 39

Exercise: Show that empirical relative frequencies give the most likely probability model: Calculations are are little bit more complicated than in the examples before.

ninn

npk

ii ,,1

1

=∀++

=

17.06.13 40

Minimization under inequalities Differentiable function to be minimized Inequality constraints Lagrange function

ℜ→Ω:f

∑=

+=l

p

ppl xgxfxL

11 )()(),,,( ααα

mpxg p ,,10)( =∀≤

17.06.13 41

Minimization under inequalities

Necessary condition on minimum x with inequality constraints is the existence of Lagrange multipliers which fulfill the following KKT constraints (Karush, Kuhn, Tucker):

lαα ,,1

17.06.13 42

Minimization under inequalities Karush-Kuhn-Tucker constraints Note that there are as many equations as variables.

lpxglpxg

xgxfxL

pp

p

l

p

pxpxlx

l

,,10)(,,10)(

0))(())(()),,,((

0,,

11

1

=∀=

=∀≤

=∇+∇=∇

≥

∑=

α

ααα

αα

17.06.13 43

Duality Primal problem

lpxgtsrequirementosubject

xff

p

x

n

,,10)(

min)(:

=∀≤

→

ℜ→Ω

Ω∈

17.06.13 44

Duality Lagrange function More compact: KKT conditions force us to solve equations under inequality constraints. This is often uncomfortable.

0,,)()(),,,( 11

1 ≥+= ∑=

l

l

p

ppl xgxfxL ααααα

0)()(),(1

≥+= ∑=

αααl

p

pp xgxfxL

17.06.13 45

Duality Dual problem ignores all inequality constraints and, for fixed α, minimizes Lagrange function over x. This defines a lower bound for primal problem. Lemma 1: For arbitrary α ≥ 0

),(inf)( αα xLQx

=

)(inf)(0)(

xfQxgwithx

≤

≤α

17.06.13 46

Duality Proof: Consider an arbitrary y with: Then: Since this holds for all y we conclude:

∑=

≤+=≤=l

p

ppx

yfygyfyLxLQ1

)()()(),(),(inf)( αααα

)(inf)(0)(

xfQxgwithx

≤

≤α

lpyg p ,,10)( =∀≤

17.06.13 47

Duality Having ignored requirements is compensated in a second step by taking the greatest lower bound, that is, by maximizing over all Lagrange multipliers: Lemma 2: Existence of infima and max supposed,

max)(0≥→α

αQ

)(inf)(max0)(

0xfQ

xgwithx

≤≥

≤αα

17.06.13 48

Computing max inf L means to find a saddle-point of function L.

17.06.13 49

Duality Lemma 3: Assume you found β ≥ 0 and y with For short: „Dual value meets primal value.“ Then For short: „Optimal dual = optimal primal = solution of KKT has been obtained.“

)()(,,10)(

yfQlpyg p

=

=∀≤

β

),()()(inf)(max)(0)(

0yLyfxfQQ

xgwithx

βαβα

====≤

≥

17.06.13 50

Duality Proof: was used in Lemma 2 to conclude: Using we obtain

)()( yfQ =β

lpyg p ,,10)( =∀≤

)()(inf)(max)()(0)(

0yfxfQQyf

xgwithx

=≤≤=≤

≥αβ

α

)(inf)(max0)(

0xfQ

xgwithx

≤≥

≤αα

17.06.13 51

Duality Proof continued: In the proof of Lemma 1 we showed: Thus So far, things were rather simple. The non-trivial part is (without proofs):

)(),()( yfyLQ ≤≤ ββ

),()()( ββ yLQyf ==

17.06.13 52

Duality Lemma 4: Under certain conditions (that are fulfilled for the margin maximization problem: quadratic function, linear constraints, compact domains) equality holds: In particular, this means that max and min both exits.

)(inf)(max0)(

0xfQ

xgwithx

≤≥

=αα

17.06.13 53

Duality Thus we have the choice: •  Solve primal problem

•  Solve dual problem

•  Solve KKT

17.06.13 54

Minimization under equality and inequality constraints

Exercise:

Combine the formulas for the case of equality constraints and the case of inequality constraints into formulas for the combination of equality and inequality constraints.

17.06.13 55

Margin maximization: Primal form

min221

ww →

lpbxwd pTp 101)( =∀≤++−

17.06.13 56

Lagrange function

∑∑

∑

==

=

−+−

=++−+

=

l

p

pTpp

n

ii

l

p

pTpp

l

bxwdw

bxwdw

bwL

1

2

1

1

2

1

)1)(()(21

)1)((21

),,,,(

α

α

αα

17.06.13 57

∑

∑

=

=

−=∂

∂

−=∂

∂

l

p

pp

l

l

p

pi

ppi

i

l

db

bwL

xdww

bwL

1

1

1

1

),,,,(

),,,,(

ααα

ααα

Partial derivatives:

17.06.13 58

Karush-Kuhn-Tucker conditions

lpbxwd

lpbxwd

d

xdw

pTpp

pTp

l

p

pp

l

p

ppp

l

,,10)1)(()5(

,,101)()4(

0)3(

)2(

0,,)1(

1

1

1

=∀=−+

=∀≤++−

=

=

≥

∑

∑

=

=

α

α

α

αα

17.06.13 59

Margin maximization: Dual form

Consider dual function Q and maximize:

max),,,(

),,,,,(inf),,,(

0,,21

21,21

1 ≥→

=

ll

lbwl

Q

bwLQ

ααααα

αααααα

17.06.13 60

Lagrange function:

∑ ∑∑∑

∑∑

∑

= ===

==

=

+−−

=−+−

=++−+

=

l

p

l

pp

l

k

pp

pTpp

n

ii

l

p

pTpp

n

ii

l

p

pTpp

l

dbxwdw

bxwdw

bxwdw

bwL

1 11

2

1

1

2

1

1

2

1

)()(21

)1)(()(21

)1)((21

),,,,(

ααα

α

α

αα

17.06.13 61

If then due to subterm in function L and the fact that minimization of L also runs over b we conclude that Thus this case does not participate in the process of maximization of inf L in the definition of Q. So we may assume:

−∞=),,,,,(inf 21, lbwbwL ααα

∑=

≠l

p

ppd

1

0α ∑=

≠−l

p

ppdb

1

0α

∑=

=l

p

ppd

1

0α

17.06.13 62

Lagrange function reduces to: For every fixed α try to explain why:

∑∑

∑

==

=

+−

=++−+

=

l

pp

l

p

pTpp

l

p

pTpp

l

xwdw

bxwdw

bwL

11

2

1

2

1

)(21

)1)((21

),,,,(

αα

α

αα

−∞≠),,(inf αbwLw

17.06.13 63

Existence of the infimum fixes w by setting gradient of L w.r.t. w to zero: Thus: This allows further simplification of function Q:

∑=

=l

p

ppp xdw

1

α

0),,,,(1

1 =−=∇ ∑=

l

p

ppplw xdwbwL ααα

17.06.13 64

∑∑∑

∑∑ ∑

∑∑

∑∑∑

∑∑

== =

== =

==

===

==

+−

=+−

=+−

=+−

=+−

=

l

pp

l

p

qTpqpqp

l

q

l

pp

l

p

pl

q

Tqqq

pp

l

pp

l

p

pTpp

l

pp

l

p

pTpp

l

p

ppp

T

l

pp

l

p

pTpp

T

k

xxdd

xxdd

xwd

xwdxdw

xwdww

Q

11 121

11 121

1121

11121

1121

21

))((

))((

)(

)(

)(

),,,(

ααα

ααα

αα

ααα

αα

ααα

17.06.13 65

max),,,(

)(

),,,(

0,,21

11 121

21

1 ≥

== =

→

+−

=

∑∑∑

ll

l

pp

l

p

l

q

qTpqpqp

l

Q

xxdd

Q

ααααα

ααα

ααα

Summary

17.06.13 66

∑=

=l

p

ppp xdw

1

α

Solution of primal problem obtained from this:

Bias b is determined as follows: Select a non-zero

Lagrange multiplier – why must it exist? Now use

and solve for b.

0)1)(( =−+ bxwd pTppα

17.06.13 67

SVM

III.  Maximum margin classifier at work

17.06.13 68

Generalization •  Let a fresh vector z be presented to the net. •  Inner product with weight vector w is computed.

•  Inner product of x with support vector xp measures similarity between these vectors: Parallel vectors give large inner product. Orthogonal vectors give zero inner product.

∑∑==

==l

p

pTpp

l

p

ppp

TT xzdxdzwz11

)(αα

17.06.13 69

Generalization •  Term dp gives inner product correct sign. •  Term αp weights terms above appropriately

(hopefully) to compute net input. •  Net input can alternatively be seen as computed

by the following architecture with a hidden linear neuron for each support vector whose weight vector is the support vector.

17.06.13 70

),()(11

wznetxzdxdzwzl

p

pTpp

l

p

ppp

TT === ∑∑==

αα

1z

nz

11dα

kkdα

ssdα

11x

1nxkx1

knx sx1

snx

linear

linear

linear

step function

computed here

17.06.13 71

We obtain a quite natural and intutive architecture: •  Among training vectors the support vectors are

determined; only these play a role; they are the most representative among the training vectors.

•  Similarity with each support vector is computed. This is some sort of „case based reasoning“:

support vectors = cases •  Lagrange multipliers weight the similarities.

17.06.13 72

Main disadvantage - plan for a solution •  Linear separability is seldom the case. •  Embedding vectors of low dimensional input

space into higher dimensional feature space may help:

Create additional features whatever you think could be problem relevant.

•  Keep additional complexity limited.

17.06.13 73

SVM

IV.  Kernel functions and kernel trick

17.06.13 74

Kernel functions

Use extra features that appear to be relevant for the problem solution, formally described as a function from (lower-dimensional) input space to (higher-dimensional) feature space:

Nn ℜ→ℜΦ :

17.06.13 75

Kernel functions

Originally given learning problem in input space reads in feature space equivalently as follows:

),(),( 11 ll dxdx

)),(()),(( 11 ll dxdx ΦΦ

17.06.13 76

Kernel functions

Dual optimization now reads in feature space as follows:

max),,,(

)())((

),,,(

0,,21

11 121

21

1 ≥

Φ

== =

Φ

→

+ΦΦ−

=

∑∑∑

ll

l

pp

l

p

l

q

qTpqpqp

l

Q

xxdd

Q

ααααα

ααα

ααα

17.06.13 77

Kernel functions

Classification of a fresh vector z from input space proceeds by computing inner product of embedded vectors as follows and comparing it with threshold:

))()((1∑=

ΦΦl

p

pTpp xzdα

17.06.13 78

Kernel functions

For the process of learning (convex optimization) as well as for the process of classification fresh vectors (generalization) the following operation is central and occurs in high number: Function K is called a kernel function.

)()(),( yxyxK TΦΦ=

17.06.13 79

Kernel functions

Computation of lots of inner products in feature space may be expensive since dimension N might be a large number compare to n. We should look for means to compute inner products in input space:

)()(),( yxyxK TΦΦ=

ℜ→ℜ

=ΦΦ=

:)()()(),(

kfunctionsomewithyxkyxyxK TT

17.06.13 80

Kernel functions

Example: Inner product in feature space:

),2,,,(),(

:22

52

yxyxyxyx =Φ

ℜ→ℜΦ

2

2

2222

)())','(),(())','(),(()','(),('''2''')','(),(

rrrkwithyxyxkyxyxyxyxyyyxyxxxyyxxyxyx

TTT

T

+=

=+

=++++=ΦΦ

17.06.13 81

Classification:

1z

nz

11dα

kkdα

ssdα

11x

1nxkx1

knx sx1

snx

linear

linear

linear

step function

),( 1xzK

),( lxzK

),( kxzK

17.06.13 82

Interpretation: •  Hidden neurons measure similarity between input vector

and training vectors after embedding in feature space.

•  Output neuron uses class labels and Lagrange multipliers to weight similarities and integrate them into a summarized net input.

•  A particularly natural way to measure similarity of input

vector with some training vector is by Gaussian bell shaped functions:

2

2

21),( σσπ

pxzp exzK

−−

=

17.06.13 83

•  It can be shown that this indeed is a Kernel function (Mercer‘s theorem).

•  Other widely used kernels are polynomial kernels of degree d

•  and tanh-Kernels that mimic the behaviour of MLPs (weight vector βw and threshold θ):

The latter are Kernels onl for certain combinations of βw and θ.

dT xzxzK )1(),( +=

)tanh(),( θβ −= xzxzK T

17.06.13 84

•  There are lots of further useful kernel functions, some more general, some tailored to specifica applications.

•  There are lots of cooking recipes for building fresh

kernels out of already constructed ones.

17.06.13 85

SVM

V.  SVM learnability theory

17.06.13 86

Generalization ability of maximum margin classifier

•  Using lots of additional features (remember used embedding into high-dimensional feature space), usually (as for MLPs) has danger of overfitting.

•  Remember the estimations for VC-dimension of

MLPs of order w·log(w) or w2n2.

17.06.13 87

•  SVM do not suffer from this problem: Expected error scales with m/R where m is maximum margin of a training set and R the radius of training vectors, but does not depend on dimension of feature space.

•  More concrete estimation for expected error ε is (with m maximum margin, l size of training set, R radius of training set, δ confidence):

)logloglog(),,,( 4328

642222

2

δδε +=ml

Relm

mR

lRml

17.06.13 88

A single formula What confidence would you like to have?

Fix δ > 0, δ < 1 (for example δ = 1%)

What margin do you expect?

m would be nice (for example m = 2)

How many training data l are available?

17.06.13 89

{

}δ

ε

µµ

−≥

≤−

⇒≥∀

=

1

),(),(

),((,),(,),,(Pr

,

11)(

bwerrorbwerror

bwbwdxdxT

genTemp

T

lll

Under the settings above, what error must be expected under a randomly drawn training set in the worst case?

p

lp

ml

Relm

mR

l

xRand

with

,,1

4328

642

max

)logloglog( 222

2

==

+= δε

17.06.13 90

Do you see any problem with this estimation? What if you draw a random training set of size l and margin 1 instead of the desired margin 2? Using formula again with m = 1 changes l to l‘. Again draw a random training set of size l‘. Does this game finally come to an end?

17.06.13 91

•  Dependence on m/R instead of m is obvious: By scaling training set with a positive factor R one could increase maximum margin m as strong as wished without affecting learning problem.

•  The existence of some difficulties with a sound mathematical formalization of this result should be mentioned – maximum margin cannot be fixed in advance before randomly drawing a training set. The interested reader should take a look into literature on structural risk minimization.

17.06.13 92

SVM

VI.  Soft margin

17.06.13 93

•  Allowing limited classification error (that is bad), larger margin becomes possible (that is good).

•  The amount of classification error is measured

by slack variables ξp, one for each training vector – error may vary from training vector to training vector.

•  Inequality constraints now read as follows:

17.06.13 94

lpbxwdlpbxwd

ppTp

ppTp

111111

=∀+−≤+⇒−=

=∀−+≥+⇒+=

ξ

ξ

lpbxwd ppTp 11)( =∀−≥+ ξ

lpbxwd ppTp 101)( =∀≤−++− ξ

17.06.13 95

larger margin

error measuredby slack variables

17.06.13 96

•  In margin maximization use error values as small as possible.

•  This is expressed by the following function

with constant C that is chosen by the user: •  C controls the balance between margin

maximization and error tolerance.

0,,

)(),,,(

1

1

22211

≥

+= ∑=

l

l

p

pl Cwwf

ξξ

ξξξ

17.06.13 97

Soft margin: Primal problem Under constraints

min)(),,,(,1

22211

ξξξξ

w

l

p

pl Cwwf →+= ∑=

lpbxwd ppTp

l

101)(0,,1

=∀≤−++−

≥

ξ

ξξ

17.06.13 98

Soft margin: Lagrange function

)())(1(

)(

),,,,,,,,,,(

11

1

2221

111

pl

pp

l

p

pTppp

l

p

p

lll

bxwd

Cw

bwL

ξβξα

ξ

ββααξξ

−++−−

++

=

∑∑

∑

==

=

17.06.13 99

Soft margin: KKT conditions

pipi

piCL

pidbL

pixdwwL

lp

lpbxwdlp

lpbxwd

i

i

iiii

l

p

pp

l

p

pi

ppi

i

pp

pTppp

p

pTpp

1010

102

10

10

10)(

10))(1(10

10)(1

1

1

=∀≥

=∀≥

=∀=−−=∂∂

=∀=−=∂∂

=∀=−=∂∂

=∀=−

=∀=+−−

=∀≤−

=∀≤+−−

∑

∑

=

=

β

α

βαξξ

α

α

ξβ

ξα

ξ

ξ

17.06.13 100

Soft margin: Deriving solution for primal problem from KKT Bias b is indirectly derived from an equality constraint for non-zero αi.

lpC

xdw

ppp

l

p

ppp

12

1

=∀+

=

=∑=

βαξ

α

17.06.13 101

Soft margin: Dual function

.)..(0

12

),,,,,,,,(inf),,,,,(

1

1

11,,11

glwd

lpC

xdw

bwLQ

l

p

pp

ppp

l

p

ppp

ppbwpp

∑

∑

=

=

=

=∀+

=

=

=

α

βαξ

α

ββααξββααξ

17.06.13 102

Soft margin: Inserting values in dual function

∑∑

∑

==

=

−++−−

++

=

l

p

pp

l

p

pTppp

l

p

pT

pp

bxwd

Cww

Q

11

1

221

11

)())(1(

)(

),,,,,(

ξβξα

ξ

ββαα

17.06.13 103

Inserting values in dual function continued

∑∑∑ ∑

∑∑

∑∑

∑

=

+

== =

+

=

+

=

==

=

−−−−

++

=

−++−−

++

l

pCp

l

p

pp

l

p

l

p

pTppCp

l

pC

Tl

p

ppp

l

p

pp

l

p

pTppp

l

p

pT

pppp

pp

dbxwd

Cwxd

bxwd

Cww

12

11 12

1

22

121

11

1

221

)1(

)()(

)())(1(

)(

βαβα

βα

βααα

α

ξβξα

ξ

17.06.13 104


∑∑∑∑

∑∑∑ ∑

∑∑

=

+

==

+

=

=

+

== =

+

=

+

=

−++−

=

−−−−

++

l

pC

l

pp

l

pC

Tl

p

ppp

l

pCp

l

p

pp

l

p

l

p

pTppCp

l

pC

Tl

p

ppp

pppp

pppp

pp

Cwxd

dbxwd

Cwxd

12

)(

11

22

121

12

11 12

1

22

121

2

)()(

)1(

)()(

βαβα

βαβα

βα

αα

βααα

α

17.06.13 105


∑ ∑∑

∑∑

∑ ∑∑

∑∑∑

= =

+

=

+

= =

= =

++

=

=

+

= =

−+

+−

=

−−

++−

l

p

l

pCp

l

pC

l

p

l

q

qTpqpqp

l

p

l

pCpC

l

ppp

l

pC

l

p

l

q

qTpqpqp

pppp

pppp

pp

C

xxdd

Cxxdd

1 12

)(

1

22

1 121

1 122

1

1

22

1 121

2

)(

)(

)()(

βαβα

βαβα

βα

α

αα

βαα

αα

17.06.13 106


∑∑∑

∑∑

∑ ∑∑

∑∑

=

+

==

+

= =

= =

+

=

+

= =

−+

+−

=

−+

+−

l

pC

l

pp

l

pC

l

p

l

q

qTpqpqp

l

p

l

pCp

l

pC

l

p

l

q

qTpqpqp

pppp

pppp

xxdd

C

xxdd

12

)(

114

)(

1 121

1 12

)(

1

22

1 121

22

2

)(

)(

)(

βαβα

βαβα

α

αα

α

αα

17.06.13 107


∑∑∑∑

∑∑∑

∑∑

==

+

= =

=

+

==

+

= =

+−−

=

−+

+−

l

pp

l

pC

l

p

l

q

qTpqpqp

l

pC

l

pp

l

pC

l

p

l

q

qTpqpqp

pp

pppp

xxdd

xxdd

114

)(21

1 121

12

)(

114

)(

1 121

2

22

)(

)(

ααα

α

αα

βα

βαβα

17.06.13 108

Final form of dual function with constraints

∑∑∑∑==

+

= =

+−−

=l

pp

l

pC

l

p

l

q

qTpqpqp

ll

ppxxdd

Q

114

)(21

1 121

11

2

)(

),,,,,(

ααα

ββαα

βα

∑=

=∀=

≥l

p

pp

ll

lpd1

11

,,10

0,,,,,

α

ββαα

17.06.13 109

Wild mixture of kernel functions

cyxyxK T +=),(

dT cyxyxK ))((),( += α

2

2

2),( σ

yx

eyxK−

−=

22),( σ

yx

eyxK−

−=

σyx

eyxK−

−=),(

linear polynomial Gauss exponential Laplace

17.06.13 110


))(tanh(),( cyxyxK T += α

2111),(yx

yxK−+

=σ

22),( cyxyxK +−=

22

1),(cyx

yxK+−

=

Cauchy sigmoidal quadratic inverse quadratic

17.06.13 111


θθ yxyx

yxK−

−= sin),(

dyxyxK −−=),(

)1log(),( +−−=dyxyxK

)),min(),min(

),min(1(),(

3312

2

1

iiiiyx

n

iiiiiii

yxyx

yxyxyxyxK

ii +−

++=

+

=∏

Wave Power Log Spline

17.06.13 112


∑=

=n

iii yxyxK

1

),min(),(

∑=

=n

iii yxyxK

1

),min(),( βα

dyxyxK

−+=1

1),(

histogram generalized histogram T-student

Machine Learning -...

Documents

Transcript of Machine Learning -...