# Introductory Machine Learning - Manik Varma Multiple Kernel Learning Manik Varma Microsoft Research India A Quick Review of SVMs >1 Misclassified point Margin = 2 / wtw <1 b Support Vector =0 wtx + b = -1 wt x + b = 0 wtx + b = +1

Support Vector w =0 The C SVM Primal and Dual Primal P = Minw,,b wtw + Ct s. t. Y(Xtw + b1) 1 0 Dual D = Max 1t tYKY s. t. 1tY = 0 0C Duality Primal P = Minx f0(x) s. t. fi(x) 0 1 i N hi(x) = 0 1 i M Lagrangian L(x,,) = f0(x) + i ifi(x) + i ihi(x) Dual D = Max, Minx L(x,,)

s. t. 0 Duality The Lagrange dual is always concave (even if the primal is not convex) and might be an easier problem to optimize Weak duality : P D Always holds Strong duality : P = D Does not always hold Usually holds for convex problems Holds for the SVM QP Karush-Kuhn-Tucker (KKT) Conditions If strong duality holds, then for x*, * and * to be optimal the following KKT conditions must necessarily hold Primal feasibility : fi(x*) 0 & hi(x*) = 0 for 1 i Dual feasibility : * 0 Stationarity : x L(x*, *,*) = 0

Complimentary slackness : i*fi(x*) = 0 If x+, + and + satisfy the KKT conditions for a convex problem then they are optimal Some Popular Kernels Linear : K(xi,xj) = xit-1xj Polynomial : K(xi,xj) = (xit-1xj + c)d Gaussian (RBF) : K(xi,xj) = exp( k k(xik xjk)2) Chi-Squared : K(xi,xj) = exp( 2(xi, xj) ) Sigmoid : K(xi,xj) = tanh(xitxj c) Advantages of Learning the Kernel Improve accuracy and generalization Learn an RBF Kernel : K(xi,xj) = exp( k (xik xjk)2) Kernel Parameter Setting - Underfitting RBF =0.001 5

5 4 4 3 3 2 2 1 1 1 2

3 4 Abs(Distances) 5 Decision Boundaries 1 2 3 Classification 2.5 2

1.5 1 0.5 4 5 Kernel Parameter Setting RBF =1.000 5 5 4 4 3

3 2 2 1 1 1 2 3 4 Abs(Distances)

5 Decision Boundaries 1 2 3 Classification 1 0.8 0.6 0.4 0.2 4 5

Kernel Parameter Setting Overfitting RBF =100.000 5 5 4 4 3 3 2 2

1 1 1 2 3 4 Abs(Distances) 5 Decision Boundaries 1 2

3 Classification 1 0.8 0.6 0.4 0.2 4 5 Advantages of Learning the Kernel Improve accuracy and generalization Learn an RBF Kernel : K(xi,xj) = exp( k (xik xjk)2) Test error as a function of Advantages of Learning the Kernel

Perform non-linear feature selection Learn an RBF Kernel : K(xi,xj) = exp(k k(xik xjk)2) Perform non-linear dimensionality reduction Learn K(Pxi, Pxj) where P is a low dimensional projection matrix parameterized by These are optimized for the task at hand such as classification, regression, ranking, etc. Advantages of Learning the Kernel Multiple Kernel Learning Learn a linear combination of given base kernels K(xi,xj) = k dk Kk(xi,xj) Can be used to combine heterogeneous sources of data Can be used for descriptor (feature) selection MKL Geometric Interpretation MKL learns a linear combination of base kernels K(xi,xj) = k dk Kk(xi,xj) d11

= d22 d33 MKL Toy Example Suppose were given a simplistic 1D shape feature for a binary classification problem Define a linear shape kernel : Ks(si,sj) = sisj The classification accuracy is 100% but the margin is very small s MKL Toy Example Suppose were now given addition 1D colour feature Define a linear colour kernel : Kc(ci,cj) = cicj The classification accuracy is also 100% but the

margin remains very small c MKL Toy Example MKL learns a combined shape-colour feature space K(xi,xj) = d Ks(xi,xj) + (1 d) Kc(xi,xj) c c s d=0 s d=1 MKL Toy Example MKL Another Toy Example

MKL learns a combined shape-colour feature space K(xi,xj) = d Ks(xi,xj) + (1 d) Kc(xi,xj) c c s d=0 s d=1 MKL Another Toy Example Object Categorization Chair Schooner

? Ketch Taj Panda The Caltech 101 Database Database collected by Fei-Fei et al. [PAMI 2006] The Caltech 101 Database Chairs The Caltech 101 Database Bikes Caltech 101 Features and Kernels Features Geometric Blur [Berg and Malik, CVPR 01] PHOW Gray & Colour [Lazebnik et al., CVPR 06] Self Similarity [Shechtman and Irani, CVPR 07]

Kernels RBF for Geometric Blur K(xi,xj) = exp( 2(xi,xj)) for the rest Caltech 101 Experimental Setup Experimental Setup 102 categories including Background_Google and Faces_easy 15 training and 15 test images per category 30 training and up to 15 test images per category Results summarized over 3 random train/test splits Caltech 101 MKL Results Feature comparison 80 15 training 30 training

Accuracy (%) 75 70 65 60 55 MKL avg gb phowGray phowColorssim

Caltech 101 Comparisons Method 15 Training 30 Training LP-Beta [Gehler & Novozin, 09] 74.6 1.0 82.1 0.3 GS-MKL [Yang et al., ICCV 09] 74 84.3 73.0 1.3 NA

72.8 79 MKL [Vedaldi et al., ICCV 09] 71.1 0.6 78.2 0.4 LP-Beta [Gehler & Novozin, ICCV 09] 70.4 0.8 77.7 0.3 65.0 73.1

59.06 0.56 66.23 0.48 Bayesian LMKL [Christoudias et al., Tech Rep 09] In Defense of NN [Boiman et al., CVPR 08] Region Recognition [Gu et al., CVPR 08] SVM-KNN [Zhang et al., CVPR 06] Caltech 101 Over Fitting? Caltech 101 Over Fitting? Caltech 101 Over Fitting?

Caltech 101 Over Fitting? Caltech 101 Over Fitting? Wikipedia MM Subset Experimental Setup 33 topics chosen each with more than 60 images Ntrain = [10, 15, 20, 25, 30] The remaining images are used for testing Features PHOG 180 & 360 Self Similarity PHOW Gray & Colour Gabor filters Kernels Pyramid Match Kernel & Spatial Pyramid Kernel Wikipedia MM Subset Ntrain

Equal Weights MKL LMKL GS-MKL 10 15 20 25 30 38.90.7 42.00.6 44.80.5 47.00.5 49.20.4

45.01.0 50.10.8 54.30.8 56.10.7 58.20.6 47.31.6 53.41.3 56.20.9 57.81.1 60.51.0 49.21.2 56.61.0 61.01.0 64.30.8 67.60.9 LMKL [Gonen and Alpaydin, ICML 08]

GS-MKL [Yang et al., ICCV 09] Feature Selection for Gender Identification FERET faces [Moghaddam and Yang, PAMI 2002] Males Females Feature Selection for Gender Identification Experimental setup 1053 training and 702 testing images We define an RBF kernel per pixel (252 kernels) Results summarized over 3 random train/test splits Feature Selection Results # Pix

AdaBoost 10 76.3 0.9 20 - 30 - 50 - 80

- 100 - 150 - 252 76.3(12.6) Baluja et al. OWL-QN LP-SVM [IJCV 2007] [ICML 2007] [COA 2004] BAHSIC

SSVM QCQP [ICML 2007] [ICML 2007] MKL GMKL 88.7 0.8 93.2 0.9 95.1 0.5 95.5 0.7 79.5 1.9 82.6 0.6 83.4

0.3 86.9 1.0 88.9 0.6 89.5 0.2 91.3 0.5 93.1 0.5 71.6 1.4 84.9 1.9 79.5 2.6 81.2 3.2

80.5 3.3 87.6 0.5 85.6 0.7 86.5 1.3 84.8 0.4 89.3 1.1 88.6 0.2 89.4 2.4 88.8 0.4 90.6 0.6

89.5 0.2 91.0 1.3 90.4 0.2 - 90.6 1.1 92.4 1.4 90.6 0.3 - 90.5 0.2 94.1 1.3

80.8 0.2 83.8 0.7 86.3 1.6 89.4 0.9 90.5 0.2 91.3 1.3 90.3 0.8 - 90.7 0.2 94.5 0.7

- - - - 90.8 0.0 94.3 0.1 - - - 91 (221.3)

91 (58.3) 90.8 (252) - Uniform MKL = 92.6 0.9 - 91.6(146.3) 95.5 (69.6) Uniform GMKL = 94.3 0.1 Object Detection Localize a specified object of interest if it exists in a given image The PASCAL VOC Challenge Database

PASCAL VOC Database Cars PASCAL VOC Database Dogs PASCAL VOC 2009 Database Statistics Table 2: Statistics of the segmentation image sets. main image sets. Object statistics list only trainTable 1: Statistics of theval trainval the `non-difficult' objects used in the evaluation. img obj train img val obj trainval img testobj Aeroplane 47

Bicycle 39 Aerop lane img obj 53 201 267 Bicycl e 51167 Bird 262

Boat 74170 Bottle 220 75 Bus 132 Car 75372 Cat 266 48 Chair 338 Cow 94 86 87 533 - - - -

- - - - - - - - - -

- - 116 - - - 108 649 124 391 55 333 392 783 30 Horse 67161 237 36 167 245 62 328 482 66 Motor 171 235 167 234 338 469 bike 48 49 40

43 88 Perso 1333 2819 1446 2996 2779 5815 n 43 52 58 71 101 Potted 166 311 166 316 332 627 plant 42 57 50 60 92 Sheep 67 163 64 175 131 338 Sofa 51155 172 36 153 175

47 49 308 347 83 Train 164 190 160 191 324 381 - 260 - 129 - - - - - - - -

- - - - - - - Cat 45 Dining 58140 table 69

Dog 152316 207 381 270 52 394 39 179 664 44 308 39 716 164 51 Tvmo 353 516 nitor 352180 259210 173 257 368 417

Total 3473 8505 3581 8713 7054 17218 - - 98 306 63 38 181 101 271 38

Car Person 407 58 153 Bus Motorbike 48 266 obj 153 53 131

42 Horse 206 img obj 505 325 420 258 730 543 668 172

Bottle Dog img 243 155 200 126 358 277 330 86 48 Diningtable obj

468 77 760 537 107 787 87 365 1317 86 622 77 1429 114 336 55 Cow

img 348 Bird Chair obj 236 50 379 267 64 393 48 186 653

61 314 59 713 96 172 Boat 232 40 img test 101 - 138

123 - 136 107 - 190 - - 92 - 123 - - 117

- 100 - 720 - - Pottedplant 43 66 45 97 88

163 - - Sheep 27 64 34 88 61 152

- - Sofa 44 52 53 65 97 117 -

- Train 40 47 46 51 86 98 - -

Tvmonitor 51 64 48 64 99 128 - - Total

749 1601 750 1610 1499 3211 - - Detection By Classification Bird No Bird

Detect by classifying every image window at every position, orientation and scale The number of windows in an image runs into the hundred millions Even if we classify a window in a second it will take us many days to detect a single object in an image Fast Detection Via a Cascade PHOW Gray Fast Linear SVM PHOW Colour PHOG Sym Feature vector PHOG Quasi-linear

SVM Visual Words Self Similarity Non-linear SVM Jumping Window MKL Detection Overview First stage Linear SVM Jumping windows/Branch and Bound Time = O(#Windows) Second stage Quasi-linear SVM 2 kernel

Time = O(#Windows * #Dims) Third stage Non-linear SVM Exponential 2 kernel Time = O(#Windows * #Dims * #SVs) PASCAL VOC Evaluation Predictions are evaluated using precision-recall curves based on bounding box overlap Ground truth Bgt Bgt Bp Predicted Bp Area Overlap = Bgt Bp / Bgt Bp Valid prediction if Area Overlap > Some Examples of MKL Detections

Some Examples of MKL Detections Some Examples of MKL Detections Some Examples of MKL Detections Aeroplanes Bicycles Cars Cows Horses Motorbikes Performance of Individual Kernels car

1 MKL 50.4% avg 49.9% 0.9 ssim 39.1% 0.8 phog180 39.8% phog360 40.9% 0.7 precision phow Color 42.6%

0.6 phow Gray 44.4% 0.5 0.4 0.3 0.2 0.1 0 0.2 0.3 0.4 recall 0.5 0.6

MKL give a big boost over any individual kernel MKL gives only marginally better performance than equal weights but results in a much faster system PASCAL VOC 2009 Results MKL Formulations and Optimization Kernel Target Alignment Semi-Definite Programming-MKL (SDP) Hyper kernels (SDP/SOCP) Block l1-MKL (M-Y regularization + SMO) Semi-Infinite Linear Programming-MKL (SILP) Simple MKL/Generalized MKL (gradient descent) Multi-class MKL Hierarchical MKL Local MKL Mixed norm MKL (mirror descent) SMO-MKL

Kernel Target Alignment Kideal = yy = +1 -1 -1 +1 t Kernel Target Alignment eig(K) 1, 2,v1, v2 K

, = Kopt K2 = v2v2t K1 = v1v1t + d1 K1 Small value d2 Large value K2

Kernel Target Alignment d1 = 1 Kopt Al ign m en t Kideal d2 = 1 Kopt = d1 K1 + d2 K2 such that d12 + d22 = 1

Kernel Target Alignment Kernel Target Alignment [Cristianini et al. 2001] Alignment A(K1,K2) = / () where = i j K1(xi,xj)K2(xi,xj) Ideal Kernel: Kideal = yyt Alignment to Ideal A(K, Kideal) = / n Optimal kernel Kopt = k dkKk where Kk = vkvkt (rank 1) Kernel Target Alignment Kernel Target Alignment Optimal Alignment: A(Kopt) = dk2 / n( dk2) Assume dk2 = 1 Lagrangian L(,d) = dk2 ( dk2 1) Optimal weights: dk 2

Kernel Target Alignment One of the first papers to propose kernel learning. Established taking linear combinations of base kernels Efficient algorithm Formulation based on l2 regularization. Some generalisation bounds have been given but the task is not directly related to classification Does not easily generalize to other loss functions Brute Force Search Over d d2 K = 0.5 K1 + 0.5 K2 d1 Kopt = d1 K1 + d2 K2

Brute Force Search Over d d2 K = 0.1 K1 + 0.8 K2 K = 0.5 K1 + 0.5 K2 d1 Kopt = d1 K1 + d2 K2 Brute Force Search Over d d2 K = 0.1 K1 + 0.8 K2 K = 0.5 K1 + 0.5 K2 d1 Kopt = 1.2 K1 0.3 K2 Kopt = d1 K1 + d2 K2 SDP-MKL d2 NP Hard Region

dk= 1 d1 K 0 (SDP) Lanckriet et al. Kopt = d1 K1 + d2 K2 SDP-MKL SDP-MKL [Lanckriet et al. 2002] Minimise wtw + C ii Subject to yi [wtd(xi) + b] 1 i i 0 K = k dkKk 0 (positive semi-definite) trace(K) = constant SDP-MKL Optimises an appropriate cost function depending

on the task at hand Other loss functions possible (square hinge, KTA, regression, etc) The optimisation involved is an SDP The optimization reduces to a QCQP if d 0 Sparse kernel weights are learnt in the QCQP formulation Improving SDP d2 NP Hard Region Bach et al. Sonnenberg et al. Rakotomamonjy et al. dk= 1 d1 K 0 (SDP Region) Kopt = d1 K1 + d2 K2

Block l1 MKL SDP-MKL dual reduces to a non-smooth QCQP when d 0 Dual Min 1t + Maxk tYKkY s. t. 1tY = 0 0C SMO can not be applied since the objective is not differentiable MY Regulrization 5 4 3 2

1 0 -5 f(x)=|x| M-Y Reg 0 5 FM(x) = Miny f(y) + ||y x||M2 F is differentiable even when f is not The minimum of F and f coincide Block l1 MKL Block l1 MKL [Bach et al. 2004] Min (k k ||wk||2)2 + C ii + k ak2 ||wk||22 s. t.

yi [k wktk(xi) + b] 1 i i 0 M-Y regularization ensures differentiability Block l1 regularization ensures sparseness Optimisation is carried out via iterative SMO SILP-MKL SILP-MKL [Sonnenberg et al. 2005] Primal Minw (k ||wk||2)2 + C ii s. t. yi [k wktk(xi) + b] 1 i i 0 where (implicitly) dk 0 and k dk = 1 SILP-MKL SILP-MKL [Sonnenberg et al. 2005] Dual Maxd Min k dk (tYKkY 1t) s. t. 1tY = 0 0C

dk 0 k dk = 1 Iterative LP-QP solution? A naive LP-QP solver will oscillate SILP-MKL SILP-MKL [Sonnenberg et al. 2005] Dual Max s. t. dk 0, k dk = 1 k dk (tYKkY 1t) for all 1tY = 0, 0 C In each iteration find the that most violates the constraint k dk (tYKkY 1t) and add it to the working set This can be done using a standard SVM by solving * = argmin k dk (tYKkY 1t) SILP-MKL SILP-MKL [Sonnenberg et al. 2005]

Dual Max s. t. dk 0, k dk = 1 k dk (tYKkY 1t) for all 1tY = 0, 0 C The SILP (cutting plane) method can solve a 1M point problem with 20 kernels It generalizes to regression, novelty detection, etc. The LP grows more complex with each iteration and the method does not scale well to a large number of kernels Cutting Planes 2 1.5 x log(x) 1 x

1 0.5 x 0 x 3 x 2 4 -0.5 -1 0.5

1 x 1.5 2 For convex functions : f(x) = maxw G,b wtx + b x f(x) = argmaxw G wtx G turns out to be the set of subgradients Gradient Descent Based Methods Chapelle et al. ML 2002 Simple MKL (Rakotomamonjy et al. ICML 2007, JMLR 2008] Generalized MKL (Varma & Ray ICCV 2007, Varma & Babu ICML 2009] Local MKL [Gonen & Alpaydin, ICML 2008] Hierarchical MKL [Bach NIPS 2008]

Mixed norm MKL [Nath et al. NIPS 09] GMKL Primal Formulation Primal P = Minw,d,b wtw + i L(f(xi), yi) + r(d) s. t. d 0 where f(x) = wtd(x) + b, L is a loss function and r is a regulariser on the kernel parameters This formulation is not convex in general GMKL Primal for Classification Primal P = Mind T(d) s. t. d 0 T(d) = Minw,b wtw + Ct + r(d) s. t. yi(wt d(xi) + b) 1 i i 0 We optimize P using gradient descent We need to prove that dT exists and calculate it

efficiently We optimize T using a standard SVM solver Visualizing T Visualizing T GMKL Differentiability Primal P = Mind T(d) s. t. d 0 W(d) = Max 1t tYKdY + r(d) s. t. 1tY = 0 0C T = W by the principle of strong duality dW exists, by Danskins theorem, if K is strictly positive definite dK and dr exist and are continuous GMKL Gradient Primal P = Mind T(d)

s. t. d 0 W(d) = Max 1t tYKdY + r(d) s. t. 1tY = 0 0C dW = ( ... )* *tY KY* + r = *tY KY* + r Final Algorithm 1. Initialise d0 randomly 2. Repeat until convergence a) Form K using the current estimate of d b) Use any SVM solver to obtain * c) Update dn+1 = max(0, dn nW) Code: http://research.microsoft.com/~manik/code/GMKL/download.html L2 MKL Formulation Min w,b,d kwktwk + C i i + dtd s. t. yi(k dkwktk(xi)+ b) 1 i

i 0, dk 0 Max 1t (1/8) k (tYK kY)2 s. t. 1tY = 0 0C Can be optimized via SMO Roots of a cubic can be found analytically Gradients can be maintained