Multiple Kernel Learning Manik Varma Microsoft Research India A Quick Review of SVMs >1 Misclassified point Margin = 2 / wtw <1 b Support Vector =0 wtx + b = -1 wt x + b = 0 wtx + b = +1
Support Vector w =0 The C SVM Primal and Dual Primal P = Minw,,b wtw + Ct s. t. Y(Xtw + b1) 1 0 Dual D = Max 1t tYKY s. t. 1tY = 0 0C Duality Primal P = Minx f0(x) s. t. fi(x) 0 1 i N hi(x) = 0 1 i M Lagrangian L(x,,) = f0(x) + i ifi(x) + i ihi(x) Dual D = Max, Minx L(x,,)
s. t. 0 Duality The Lagrange dual is always concave (even if the primal is not convex) and might be an easier problem to optimize Weak duality : P D Always holds Strong duality : P = D Does not always hold Usually holds for convex problems Holds for the SVM QP Karush-Kuhn-Tucker (KKT) Conditions If strong duality holds, then for x*, * and * to be optimal the following KKT conditions must necessarily hold Primal feasibility : fi(x*) 0 & hi(x*) = 0 for 1 i Dual feasibility : * 0 Stationarity : x L(x*, *,*) = 0
Complimentary slackness : i*fi(x*) = 0 If x+, + and + satisfy the KKT conditions for a convex problem then they are optimal Some Popular Kernels Linear : K(xi,xj) = xit-1xj Polynomial : K(xi,xj) = (xit-1xj + c)d Gaussian (RBF) : K(xi,xj) = exp( k k(xik xjk)2) Chi-Squared : K(xi,xj) = exp( 2(xi, xj) ) Sigmoid : K(xi,xj) = tanh(xitxj c) Advantages of Learning the Kernel Improve accuracy and generalization Learn an RBF Kernel : K(xi,xj) = exp( k (xik xjk)2) Kernel Parameter Setting - Underfitting RBF =0.001 5
3 Classification 1 0.8 0.6 0.4 0.2 4 5 Advantages of Learning the Kernel Improve accuracy and generalization Learn an RBF Kernel : K(xi,xj) = exp( k (xik xjk)2) Test error as a function of Advantages of Learning the Kernel
Perform non-linear feature selection Learn an RBF Kernel : K(xi,xj) = exp(k k(xik xjk)2) Perform non-linear dimensionality reduction Learn K(Pxi, Pxj) where P is a low dimensional projection matrix parameterized by These are optimized for the task at hand such as classification, regression, ranking, etc. Advantages of Learning the Kernel Multiple Kernel Learning Learn a linear combination of given base kernels K(xi,xj) = k dk Kk(xi,xj) Can be used to combine heterogeneous sources of data Can be used for descriptor (feature) selection MKL Geometric Interpretation MKL learns a linear combination of base kernels K(xi,xj) = k dk Kk(xi,xj) d11
= d22 d33 MKL Toy Example Suppose were given a simplistic 1D shape feature for a binary classification problem Define a linear shape kernel : Ks(si,sj) = sisj The classification accuracy is 100% but the margin is very small s MKL Toy Example Suppose were now given addition 1D colour feature Define a linear colour kernel : Kc(ci,cj) = cicj The classification accuracy is also 100% but the
margin remains very small c MKL Toy Example MKL learns a combined shape-colour feature space K(xi,xj) = d Ks(xi,xj) + (1 d) Kc(xi,xj) c c s d=0 s d=1 MKL Toy Example MKL Another Toy Example
MKL learns a combined shape-colour feature space K(xi,xj) = d Ks(xi,xj) + (1 d) Kc(xi,xj) c c s d=0 s d=1 MKL Another Toy Example Object Categorization Chair Schooner
? Ketch Taj Panda The Caltech 101 Database Database collected by Fei-Fei et al. [PAMI 2006] The Caltech 101 Database Chairs The Caltech 101 Database Bikes Caltech 101 Features and Kernels Features Geometric Blur [Berg and Malik, CVPR 01] PHOW Gray & Colour [Lazebnik et al., CVPR 06] Self Similarity [Shechtman and Irani, CVPR 07]
Kernels RBF for Geometric Blur K(xi,xj) = exp( 2(xi,xj)) for the rest Caltech 101 Experimental Setup Experimental Setup 102 categories including Background_Google and Faces_easy 15 training and 15 test images per category 30 training and up to 15 test images per category Results summarized over 3 random train/test splits Caltech 101 MKL Results Feature comparison 80 15 training 30 training
59.06 0.56 66.23 0.48 Bayesian LMKL [Christoudias et al., Tech Rep 09] In Defense of NN [Boiman et al., CVPR 08] Region Recognition [Gu et al., CVPR 08] SVM-KNN [Zhang et al., CVPR 06] Caltech 101 Over Fitting? Caltech 101 Over Fitting? Caltech 101 Over Fitting?
Caltech 101 Over Fitting? Caltech 101 Over Fitting? Wikipedia MM Subset Experimental Setup 33 topics chosen each with more than 60 images Ntrain = [10, 15, 20, 25, 30] The remaining images are used for testing Features PHOG 180 & 360 Self Similarity PHOW Gray & Colour Gabor filters Kernels Pyramid Match Kernel & Spatial Pyramid Kernel Wikipedia MM Subset Ntrain
GS-MKL [Yang et al., ICCV 09] Feature Selection for Gender Identification FERET faces [Moghaddam and Yang, PAMI 2002] Males Females Feature Selection for Gender Identification Experimental setup 1053 training and 702 testing images We define an RBF kernel per pixel (252 kernels) Results summarized over 3 random train/test splits Feature Selection Results # Pix
91 (58.3) 90.8 (252) - Uniform MKL = 92.6 0.9 - 91.6(146.3) 95.5 (69.6) Uniform GMKL = 94.3 0.1 Object Detection Localize a specified object of interest if it exists in a given image The PASCAL VOC Challenge Database
PASCAL VOC Database Cars PASCAL VOC Database Dogs PASCAL VOC 2009 Database Statistics Table 2: Statistics of the segmentation image sets. main image sets. Object statistics list only trainTable 1: Statistics of theval trainval the `non-difficult' objects used in the evaluation. img obj train img val obj trainval img testobj Aeroplane 47
Bicycle 39 Aerop lane img obj 53 201 267 Bicycl e 51167 Bird 262
749 1601 750 1610 1499 3211 - - Detection By Classification Bird No Bird
Detect by classifying every image window at every position, orientation and scale The number of windows in an image runs into the hundred millions Even if we classify a window in a second it will take us many days to detect a single object in an image Fast Detection Via a Cascade PHOW Gray Fast Linear SVM PHOW Colour PHOG Sym Feature vector PHOG Quasi-linear
SVM Visual Words Self Similarity Non-linear SVM Jumping Window MKL Detection Overview First stage Linear SVM Jumping windows/Branch and Bound Time = O(#Windows) Second stage Quasi-linear SVM 2 kernel
Time = O(#Windows * #Dims) Third stage Non-linear SVM Exponential 2 kernel Time = O(#Windows * #Dims * #SVs) PASCAL VOC Evaluation Predictions are evaluated using precision-recall curves based on bounding box overlap Ground truth Bgt Bgt Bp Predicted Bp Area Overlap = Bgt Bp / Bgt Bp Valid prediction if Area Overlap > Some Examples of MKL Detections
Some Examples of MKL Detections Some Examples of MKL Detections Some Examples of MKL Detections Aeroplanes Bicycles Cars Cows Horses Motorbikes Performance of Individual Kernels car
MKL give a big boost over any individual kernel MKL gives only marginally better performance than equal weights but results in a much faster system PASCAL VOC 2009 Results MKL Formulations and Optimization Kernel Target Alignment Semi-Definite Programming-MKL (SDP) Hyper kernels (SDP/SOCP) Block l1-MKL (M-Y regularization + SMO) Semi-Infinite Linear Programming-MKL (SILP) Simple MKL/Generalized MKL (gradient descent) Multi-class MKL Hierarchical MKL Local MKL Mixed norm MKL (mirror descent) SMO-MKL
Kernel Target Alignment Kideal = yy = +1 -1 -1 +1 t Kernel Target Alignment eig(K) 1, 2,v1, v2 K
, = Kopt K2 = v2v2t K1 = v1v1t + d1 K1 Small value d2 Large value K2
Kernel Target Alignment d1 = 1 Kopt Al ign m en t Kideal d2 = 1 Kopt = d1 K1 + d2 K2 such that d12 + d22 = 1
Kernel Target Alignment Kernel Target Alignment [Cristianini et al. 2001] Alignment A(K1,K2) = / () where = i j K1(xi,xj)K2(xi,xj) Ideal Kernel: Kideal = yyt Alignment to Ideal A(K, Kideal) = / n Optimal kernel Kopt = k dkKk where Kk = vkvkt (rank 1) Kernel Target Alignment Kernel Target Alignment Optimal Alignment: A(Kopt) = dk2 / n( dk2) Assume dk2 = 1 Lagrangian L(,d) = dk2 ( dk2 1) Optimal weights: dk 2
Kernel Target Alignment One of the first papers to propose kernel learning. Established taking linear combinations of base kernels Efficient algorithm Formulation based on l2 regularization. Some generalisation bounds have been given but the task is not directly related to classification Does not easily generalize to other loss functions Brute Force Search Over d d2 K = 0.5 K1 + 0.5 K2 d1 Kopt = d1 K1 + d2 K2
Brute Force Search Over d d2 K = 0.1 K1 + 0.8 K2 K = 0.5 K1 + 0.5 K2 d1 Kopt = d1 K1 + d2 K2 Brute Force Search Over d d2 K = 0.1 K1 + 0.8 K2 K = 0.5 K1 + 0.5 K2 d1 Kopt = 1.2 K1 0.3 K2 Kopt = d1 K1 + d2 K2 SDP-MKL d2 NP Hard Region
dk= 1 d1 K 0 (SDP) Lanckriet et al. Kopt = d1 K1 + d2 K2 SDP-MKL SDP-MKL [Lanckriet et al. 2002] Minimise wtw + C ii Subject to yi [wtd(xi) + b] 1 i i 0 K = k dkKk 0 (positive semi-definite) trace(K) = constant SDP-MKL Optimises an appropriate cost function depending
on the task at hand Other loss functions possible (square hinge, KTA, regression, etc) The optimisation involved is an SDP The optimization reduces to a QCQP if d 0 Sparse kernel weights are learnt in the QCQP formulation Improving SDP d2 NP Hard Region Bach et al. Sonnenberg et al. Rakotomamonjy et al. dk= 1 d1 K 0 (SDP Region) Kopt = d1 K1 + d2 K2
Block l1 MKL SDP-MKL dual reduces to a non-smooth QCQP when d 0 Dual Min 1t + Maxk tYKkY s. t. 1tY = 0 0C SMO can not be applied since the objective is not differentiable MY Regulrization 5 4 3 2
1 0 -5 f(x)=|x| M-Y Reg 0 5 FM(x) = Miny f(y) + ||y x||M2 F is differentiable even when f is not The minimum of F and f coincide Block l1 MKL Block l1 MKL [Bach et al. 2004] Min (k k ||wk||2)2 + C ii + k ak2 ||wk||22 s. t.
yi [k wktk(xi) + b] 1 i i 0 M-Y regularization ensures differentiability Block l1 regularization ensures sparseness Optimisation is carried out via iterative SMO SILP-MKL SILP-MKL [Sonnenberg et al. 2005] Primal Minw (k ||wk||2)2 + C ii s. t. yi [k wktk(xi) + b] 1 i i 0 where (implicitly) dk 0 and k dk = 1 SILP-MKL SILP-MKL [Sonnenberg et al. 2005] Dual Maxd Min k dk (tYKkY 1t) s. t. 1tY = 0 0C
dk 0 k dk = 1 Iterative LP-QP solution? A naive LP-QP solver will oscillate SILP-MKL SILP-MKL [Sonnenberg et al. 2005] Dual Max s. t. dk 0, k dk = 1 k dk (tYKkY 1t) for all 1tY = 0, 0 C In each iteration find the that most violates the constraint k dk (tYKkY 1t) and add it to the working set This can be done using a standard SVM by solving * = argmin k dk (tYKkY 1t) SILP-MKL SILP-MKL [Sonnenberg et al. 2005]
Dual Max s. t. dk 0, k dk = 1 k dk (tYKkY 1t) for all 1tY = 0, 0 C The SILP (cutting plane) method can solve a 1M point problem with 20 kernels It generalizes to regression, novelty detection, etc. The LP grows more complex with each iteration and the method does not scale well to a large number of kernels Cutting Planes 2 1.5 x log(x) 1 x
1 0.5 x 0 x 3 x 2 4 -0.5 -1 0.5
1 x 1.5 2 For convex functions : f(x) = maxw G,b wtx + b x f(x) = argmaxw G wtx G turns out to be the set of subgradients Gradient Descent Based Methods Chapelle et al. ML 2002 Simple MKL (Rakotomamonjy et al. ICML 2007, JMLR 2008] Generalized MKL (Varma & Ray ICCV 2007, Varma & Babu ICML 2009] Local MKL [Gonen & Alpaydin, ICML 2008] Hierarchical MKL [Bach NIPS 2008]
Mixed norm MKL [Nath et al. NIPS 09] GMKL Primal Formulation Primal P = Minw,d,b wtw + i L(f(xi), yi) + r(d) s. t. d 0 where f(x) = wtd(x) + b, L is a loss function and r is a regulariser on the kernel parameters This formulation is not convex in general GMKL Primal for Classification Primal P = Mind T(d) s. t. d 0 T(d) = Minw,b wtw + Ct + r(d) s. t. yi(wt d(xi) + b) 1 i i 0 We optimize P using gradient descent We need to prove that dT exists and calculate it
efficiently We optimize T using a standard SVM solver Visualizing T Visualizing T GMKL Differentiability Primal P = Mind T(d) s. t. d 0 W(d) = Max 1t tYKdY + r(d) s. t. 1tY = 0 0C T = W by the principle of strong duality dW exists, by Danskins theorem, if K is strictly positive definite dK and dr exist and are continuous GMKL Gradient Primal P = Mind T(d)
s. t. d 0 W(d) = Max 1t tYKdY + r(d) s. t. 1tY = 0 0C dW = ( ... )* *tY KY* + r = *tY KY* + r Final Algorithm 1. Initialise d0 randomly 2. Repeat until convergence a) Form K using the current estimate of d b) Use any SVM solver to obtain * c) Update dn+1 = max(0, dn nW) Code: http://research.microsoft.com/~manik/code/GMKL/download.html L2 MKL Formulation Min w,b,d kwktwk + C i i + dtd s. t. yi(k dkwktk(xi)+ b) 1 i
i 0, dk 0 Max 1t (1/8) k (tYK kY)2 s. t. 1tY = 0 0C Can be optimized via SMO Roots of a cubic can be found analytically Gradients can be maintained
A synovial cyst of the spine is a fluid-filled sac that develops along the spine. It's the result of degeneration of a facet joint of the vertebrae of the spine. Most synovial cysts develop in the lumbar spine. M71.38, Other...
Professor Les . Carr. Co-Investigator. Web Science CDT. University of Southampton. Chira. Tochia. DEN Social Media Advocate. iPhD. Web Science Student. University of Southampton. Centre for Digital Entertainment (University of Bath/Bournemouth University) CDT in . Healthcare .
Where the Mountain Meets the Moon. Key Vocabulary. Chapters 37-42. hospitality generous and friendly treatment of visitors and guests (noun) horizon . the line where the earth or sea seems to meet the sky (noun) succession.
UDP stream example and issues. Video consumption rate: 2 Mbps (given) Then the server would transmit one UDP packet full of data every (8000 bits)/(2 Mbps) = 4 ms, assuming each packet contains 8000 bits data
The Leader of the Scottish Board is Suzanne Vaughan (Towers Watson) and the Regions Manager is Tess Joyce ([email protected]). We'd actively like more member feedback on the Scottish Board activities and willing volunteers to join the Board to help support...
Why is there reluctance to make nPEP more widely available? Enhancing the potential benefit of PEP A comprehensive approach to PEP Clinique l'Actuel: 9 years experience with nPEP Slide 26 CATIE's Programming Connection CATIE Resources: PEP factsheet and article Slide...
Ready to download the document? Go ahead and hit continue!