Podpůrné vektorové stroje
Transkript
Podpůrné vektorové stroje
Aplikace teorie neuronových sítí Doc. RNDr. Iveta Mrázová, CSc. Katedra teoretické informatiky Matematicko-fyzikální fakulta Univerzity Karlovy v Praze Aplikace teorie neuronových sítí – Podpůrné vektorové stroje – – Support vector machines – Doc. RNDr. Iveta Mrázová, CSc. Katedra teoretické informatiky Matematicko-fyzikální fakulta Univerzity Karlovy v Praze I. Mrázová: ATNS (NAIL013) 2 Optimální dělící nadrovina ( ) ( ) (w ⋅ x )+ b = 0 Předp.: Trénovací vzory x1 , y1 , K , xl , y l ; xi ∈ R n ; y i ∈ {+ 1, − 1} lze separovat pomocí nadroviny ⇒ optimální nadrovina Separace probíhá bez chyb Vzdálenost vektorů nejbližších k nadrovině je maximální (w ⋅ x )+ b ≥ 1 jestliže y = 1 (w ⋅ x )+ b ≤ − 1 jestliže y = − 1 y [ ( w ⋅ x ) + b ] ≥ 1 ; j = 1, K , l ( zahrnuto w i b ) + minimaliza ce Φ (w ) = w i i ⇒ i 2 I. Mrázová: ATNS (NAIL013) 3 Optimální dělicí nadrovina (2) Kanonická nadrovina: ( ) min * w ⋅ x i + b = 1 , kde X xi∈ X * { = x1 , K , x l } omezení Parametry nadroviny VC-dimenze I. Mrázová: ATNS (NAIL013) 4 VC-dimenze Věta: (V. N. Vapnik – 1995) Podmnožina kanonických nadrovin ( ) definovaná na množině vektorů X* = { x ,K , x 1 l | xi ∈ R n ; 1 ≤ i ≤ l xi − a ≤ R , omezených pomocí R: {( ) } f x, w, b = sgn w ⋅ x + b } xi ∈ X * ( a odpovídá středu koule o poloměru R ) w ≤ A má VC-dimenzi h omezenou a splňujících podmínku pomocí h ≤ min ( [R 2 ] ) A2 , n + 1 I. Mrázová: ATNS (NAIL013) 5 Konstrukce optimální nadroviny Minimalizace: ( ) Φ w = za podmínky: y i ( 1 w ⋅w 2 [ ( x ⋅ w )+ b ] ≥ 1 ; i sedlový bod funkcionálu ( L w, b,α ) ) ( ) l 1 = w⋅w − ∑ αi 2 i =1 i = 1, K , l {[ ( x ⋅ w )+ b ] y i i −1 α i … Lagrangeovy multiplikátory Minimalizace vzhledem k w a b Maximalizace vzhledem k α i > 0 I. Mrázová: ATNS (NAIL013) 6 } Konstrukce optimální nadroviny (2) V sedlovém bodě by mělo řešení splňovat: ∂ L ⎛⎜ w 0 , b 0 , α ⎝ ∂b 0 ∂ L ⎛⎜ w 0 , b 0 , α ⎝ ∂w ⎞⎟ ⎠ = 0 0 ⎞⎟ ⎠ = 0 ⇒ vlastnosti optimální nadroviny: 1. Koeficienty optimální nadroviny by měly splňovat: l ∑ α i=1 2. 0 i yi = 0 ; α 0 i ≥ 0 ; i = 1,K , l Optimální nadrovina (vektor w0 ) je lineární kombinací vektorů z trénovací množiny: w0 = l ∑ i =1 y iα 0 i xi ; α 0 i I. Mrázová: ATNS (NAIL013) ≥ 0 ; i = 1, K , l 7 Konstrukce optimální nadroviny (3) 3. Pouze tzv. podpůrné vektory mohou mít nenulové koeficien- ∑ ty α i0 (v rozvoji w0 ): w 0 = y i α i0 x i ; α i0 ≥ 0 podpurné vektory ⇒ Dělicí nadrovina splňuje podmínku: α i0 = {[ ( x i ]y ) ⋅ w 0 + b0 i } −1 = 0; ⇒ Maximalizace funkcionálu: W (α l )= ∑ α i =1 i 1 − 2 za podmínky ∑ α i=1 i y ∑ α iα i j yi y j (x i ⋅x j ) i, j αi ≥ 0; v nezáporném kvadrantu l l i = 1, K , l i = 1, K , l = 0 I. Mrázová: ATNS (NAIL013) 8 Klasifikace ( ) Nechť α 0 = α10 ,K, α l0 je řešení → W0 = ⇒ klasifikace: ( ) ∑ yi α i0 xi podpurné vektory ( ) ⎛ ⎞ 0 f x = sgn ⎜⎜ yi α i xi ⋅ x − b0 ⎟⎟ ∑ ⎝ podpurné vektory ⎠ xi K podpůrný vektor α i0 K přísl. Lagrangeův koeficient 1 ⎡⎛ b0 K konstanta (práh) b0 = w0 ⋅ x* (1) ⎞⎟ + ⎛⎜ w0 ⋅ x * (− 1)⎞⎟⎤ ⎜ ⎠ ⎝ ⎠⎥⎦ 2 ⎢⎣ ⎝ x* (1) K podpůrný vektor patřící do 1. třídy x* (− 1) K podpůrný vektor patřící do 2. třídy I. Mrázová: ATNS (NAIL013) 9 Příklad Konstrukce dělicí nadroviny odpovídající polynomu 2. stupně: ⇒ vytvořit příznakový prostor Z , který bude mít souřadnic z1 = x1 ,K, zn = xn zn+1 = (x1 ) ,K, z2n = (xn ) 2 n 2 z2n+1 = x1x2 ,K, zN = xn xn−1 r x = ( x1 , K , xn ) n (n+3) N= 2 souřadnic n souřadnic n (n −1) souřadnic 2 Dělicí nadrovina konstruovaná v novém (příznakovém) prostoru (Z) odpovídá polynomu 2. stupně ve vstupním prostoru I. Mrázová: ATNS (NAIL013) 10 Problémy Dělicí nadplocha nemusí nutně správně generalizovat x Pokud lze optimální nadrovinu zkonstruovat na základě malého počtu podpůrných vektorů (vzhledem k trénovací množině), bude systém vykazovat vysoký stupeň generalizace Vysoká dimenze příznakového prostoru ¾ i v případě, že optimální nadrovina generalizuje správně a lze ji teoreticky najít Konvoluce skalárního součinu I. Mrázová: ATNS (NAIL013) 11 Poznámky Při konstrukci optimální dělicí nadroviny v příznakovém prostoru Z není nutné uvažovat příznakový prostor v explicitní formě X je nutná možnost vyjádřit skalání součin pro podpůrné vektory v příznakovém prostoru ( z ⋅ z ) = K (x, x ) i i z … obraz (v příznakovém prostoru) vektoru x (ze vstupního prostoru) I. Mrázová: ATNS (NAIL013) 12 Podpůrné vektorové stroje (Suppor Vector (SV-) Machines) Idea: Vstupní vzor (vektor x ) se zobrazí do vícerozměrného příznakového prostoru Z pomocí nelineárního, apriorně zvoleného zobrazení; V tomto prostoru pak probíhá konstrukce optimální dělicí nadroviny Optimální dělicí nadrovina v příznakovém prostoru Příznakový • • • • •prostor • • • • • • • • • • • vstupní prostor I. Mrázová: ATNS (NAIL013) 13 Konstrukce SV-strojů Konvoluce skalárního součinu umožňuje konstrukci diskriminačních funkcí (nelineárních ve vstupním prostoru) ⎛ ⎞ f x = sgn ⎜⎜ y i α i K xi , x − b0 ⎟⎟ ∑ ⎝ podpurné vektory ⎠ ( ) ( ) (ekvivalentní lineárním diskriminačním funkcím v příznakovém prostoru; K xi , x … konvoluce skalárního součinu) ( ) I. Mrázová: ATNS (NAIL013) 14 Konstrukce SV-strojů (2) K nalezení koeficientů αi (v separabilním případě) postačí nalézt maximum funkcionálu: () za podmínky l i =1 ( 1 l W α = ∑ α i − ∑ α iα j y i y j K x i , x j 2 i, j i =1 ∑α y i i l ) = 0 ; α i ≥ 0 ; i = 1,K, l (nelineární programovaní – např.: J. J. More, G. Toraldo (1991): „On the solution of large quadratic programming problems with bound constrains,“ SIAM Optimization, 1, (1), pp. 93-113) ⇒ učení Složitost konstrukce závisí na počtu podpůrných vektorů (spíš než na dimenzi příznakového prostoru) I. Mrázová: ATNS (NAIL013) 15 Konstrukce SV-strojů (3) Použitím různých funkcí pro konvoluci skalárního součinu lze zkonstruovat učící se stroje s různými typy nelineárních dělicích nadploch ve vstupním prostoru Polynomiální učící se stroje RBF-stroje Dvouvrstvé neuronové sítě I. Mrázová: ATNS (NAIL013) 16 Konstrukce SV-strojů (4) ( ) ⎛ l ⎞ y = sgn⎜ ∑ yiα i F xi , x − b ⎟ ⎝ i =1 ⎠ rozhodovací pravidlo váhy y1α 1 , K , y lα l ( ) F x1 , x y1α 1 y 2α 2 ( …) y lα l F x2 , x ( ) F xl , x nelineární transf. (vych. z podp. vekt.) x1 , K , xl … x1 x2 x3 xn vstupní vektor x = ( x1 , K , x n ) I. Mrázová: ATNS (NAIL013) 17 Dvouvrstvé neuronové sítě Volba jádra: ( ) ( ) K x , xi = S [ v x , xi + c ] S(u) … sigmoidální funkce tanh (vu + c ), u ≤1 Konstrukce SV-strojů: ( ) (( ) ) ⎧ l ⎫ f x, α = sgn ⎨∑ α i S v x, xi + c + b ⎬ ⎩ i =1 ⎭ I. Mrázová: ATNS (NAIL013) 18 Dvouvrstvé neuronové sítě Automatické zjištění následujících údajů: 1. 2. 3. Architektura dvouvrstvého stroje ¾ Počet l skrytých neuronů ∼ počet podpůrných vektorů Váhové vektory wi = xi neuronů první (skryté) vrstvy (∼ podpůrné vektory) Váhový vektor pro druhou vrstvu (hodnoty α ) Příklad: K (x, x ) b = 2 c =1 i ( ⎛ b x ⋅ xi = tanh ⎜ ⎜ 256 ⎝ I. Mrázová: ATNS (NAIL013) ) − c ⎞⎟ ⎟ ⎠ 19 Aplikace teorie neuronových sítí – SVM- stroje – – rozšiřující materiály – Bing Liu: http://www.cs.uic.edu/~liub/WebMiningBook.html I. Mrázová: ATNS (NAIL013) 20 Introduction to SVM Support vector machines were invented by V. Vapnik and his co-workers in 1970s in Russia and became known to the West in 1992. SVMs are linear classifiers that find a hyperplane to separate two class of data, positive and negative. Kernel functions are used for nonlinear separation. SVM not only has a rigorous theoretical foundation, but also performs classification more accurately than most other methods in applications, especially for high dimensional data. It is perhaps the best classifier for text classification. I. Mrázová: ATNS (NAIL013) 22 Basic concepts Let the set of training examples D be {(x1, y1), (x2, y2), …, (xr, yr)}, where xi = (x1, x2, …, xn) is an input vector in a real-valued space X ⊆ Rn and yi is its class label (output value), yi ∈ {1, 1}. 1: positive class and -1: negative class. SVM finds a linear function of the form (w: weight vector) f(x) = 〈w ⋅ x〉 + b ⎧ 1 if 〈 w ⋅ x i 〉 + b ≥ 0 yi = ⎨ ⎩− 1 if 〈 w ⋅ x i 〉 + b < 0 I. Mrázová: ATNS (NAIL013) 23 The hyperplane The hyperplane that separates positive and negative training data is 〈w ⋅ x〉 + b = 0 It is also called the decision boundary (surface). So many possible hyperplanes, which one to choose? I. Mrázová: ATNS (NAIL013) 24 Maximal margin hyperplane SVM looks for the separating hyperplane with the largest margin. Machine learning theory says this hyperplane minimizes the error bound I. Mrázová: ATNS (NAIL013) 25 Linear SVM: separable case Assume the data are linearly separable. Consider a positive data point (x+, 1) and a negative (x-, -1) that are closest to the hyperplane <w ⋅ x> + b = 0. We define two parallel hyperplanes, H+ and H-, that pass through x+ and x- respectively. H+ and H- are also parallel to <w ⋅ x> + b = 0. I. Mrázová: ATNS (NAIL013) 26 Compute the margin Now let us compute the distance between the two margin hyperplanes H+ and H-. Their distance is the margin (d+ + d− in the figure). Recall from vector space in algebra that the (perpendicular) distance from a point xi to the hyperplane 〈w ⋅ x〉 + b = 0 is: | 〈w ⋅ xi 〉 + b | || w || (36) where ||w|| is the norm of w, 2 2 2 || w ||= < w ⋅ w > = w1 + w2 + ... + wn (37) I. Mrázová: ATNS (NAIL013) 27 Compute the margin (cont …) Let us compute d+. Instead of computing the distance from x+ to the separating hyperplane 〈w ⋅ x〉 + b = 0, we pick up any point xs on 〈w ⋅ x〉 + b = 0 and compute the distance from xs to 〈w ⋅ x+〉 + b = 1 by applying the distance Eq. (36) and noticing 〈w ⋅ xs〉 + b = 0, d + | 〈w ⋅ x s〉 + b − 1 | 1 = = || w || || w || margin = d + + d − 2 = || w || I. Mrázová: ATNS (NAIL013) (38) (39) 28 A optimization problem! Definition (Linear SVM: separable case): Given a set of linearly separable training examples, D = {(x1, y1), (x2, y2), …, (xr, yr)} Learning is to solve the following constrained minimization problem, 〈w ⋅ w〉 2 Subject to : yi (〈 w ⋅ x i 〉 + b ) ≥ 1, i = 1, 2, ..., r Minimize : yi (〈 w ⋅ xi 〉 + b ≥ 1, i = 1, 2, ..., r 〈w ⋅ xi〉 + b ≥ 1 for yi = 1 〈w ⋅ xi〉 + b ≤ -1 for yi = -1. (40) summarizes I. Mrázová: ATNS (NAIL013) 29 Solve the constrained minimization Standard Lagrangian method r 1 LP = 〈 w ⋅ w〉 − α i [ yi (〈 w ⋅ x i 〉 + b) − 1] (41) 2 i =1 where αi ≥ 0 are the Lagrange multipliers. Optimization theory says that an optimal solution to (41) must satisfy certain conditions, called KuhnTucker conditions, which are necessary (but not sufficient) Kuhn-Tucker conditions play a central role in constrained optimization. ∑ I. Mrázová: ATNS (NAIL013) 30 Kuhn-Tucker conditions Eq. (50) is the original set of constraints. The complementarity condition (52) shows that only those data points on the margin hyperplanes (i.e., H+ and H-) can have αi > 0 since for them yi(〈w ⋅ xi〉 + b) – 1 = 0. These points are called the support vectors, All the other parameters αi = 0. I. Mrázová: ATNS (NAIL013) 31 Solve the problem In general, Kuhn-Tucker conditions are necessary for an optimal solution, but not sufficient. However, for our minimization problem with a convex objective function and linear constraints, the Kuhn-Tucker conditions are both necessary and sufficient for an optimal solution. Solving the optimization problem is still a difficult task due to the inequality constraints. However, the Lagrangian treatment of the convex optimization problem leads to an alternative dual formulation of the problem, which is easier to solve than the original problem (called the primal). I. Mrázová: ATNS (NAIL013) 32 Dual formulation From primal to a dual: Setting to zero the partial derivatives of the Lagrangian (41) with respect to the primal variables (i.e., w and b), and substituting the resulting relations back into the Lagrangian. I.e., substitute (48) and (49), into the original Lagrangian (41) to eliminate the primal variables 1 r = αi − y i y j α iα j 〈 x i ⋅ x j 〉 , 2 i , j =1 i =1 r LD ∑ ∑ I. Mrázová: ATNS (NAIL013) (55) 33 Dual optimization prolem This dual formulation is called the Wolfe dual. For the convex objective function and linear constraints of the primal, it has the property that the maximum of LD occurs at the same values of w, b and αi, as the minimum of LP (the primal). Solving (56) requires numerical techniques and clever strategies, which are beyond our scope. I. Mrázová: ATNS (NAIL013) 34 The final decision boundary After solving (56), we obtain the values for αi, which are used to compute the weight vector w and the bias b using Equations (48) and (52) respectively. The decision boundary 〈 w ⋅ x〉 + b = ∑ y α 〈 x ⋅ x〉 + b = 0 i i i (57) i∈sv Testing: Use (57). Given a test instance z, ⎛ ⎞ sign(〈 w ⋅ z〉 + b) = sign⎜⎜ α i yi 〈 x i ⋅ z〉 + b ⎟⎟ ⎝ i∈sv ⎠ ∑ (58) If (58) returns 1, then the test instance z is classified as positive; otherwise, it is classified as negative. I. Mrázová: ATNS (NAIL013) 35 Linear SVM: Non-separable case Linear separable case is the ideal situation. Real-life data may have noise or errors. Class label incorrect or randomness in the application domain. Recall in the separable case, the problem was 〈w ⋅ w 〉 Minimize : 2 Subject to : y i ( 〈 w ⋅ x i 〉 + b ) ≥ 1, i = 1, 2, ..., r With noisy data, the constraints may not be satisfied. Then, no solution! I. Mrázová: ATNS (NAIL013) 36 Relax the constraints To allow errors in data, we relax the margin constraints by introducing slack variables, ξi (≥ 0) as follows: 〈w ⋅ xi〉 + b ≥ 1 − ξi for yi = 1 〈w ⋅ xi〉 + b ≤ −1 + ξi for yi = -1. The new constraints: Subject to: yi(〈w ⋅ xi〉 + b) ≥ 1 − ξi, i =1, …, r, ξi ≥ 0, i =1, 2, …, r. I. Mrázová: ATNS (NAIL013) 37 Geometric interpretation Two error data points xa and xb (circled) in wrong regions I. Mrázová: ATNS (NAIL013) 38 Penalize errors in objective function We need to penalize the errors in the objective function. A natural way of doing it is to assign an extra cost for errors to change the objective function r to 〈 w ⋅ w〉 (60) Minimize : + C (∑ ξ i ) k 2 i =1 k = 1 is commonly used, which has the advantage that neither ξi nor its Lagrangian multipliers appear in the dual formulation. I. Mrázová: ATNS (NAIL013) 39 New optimization problem Minimize Subject to r 〈w ⋅ w 〉 : + C ∑ ξi (61) 2 i =1 : y i ( 〈 w ⋅ x i 〉 + b ) ≥ 1 − ξ i , i = 1, 2, ..., r ξ i ≥ 0 , i = 1, 2, ..., r This formulation is called the soft-margin SVM. The primal Lagrangian is r r r 1 L P = 〈 w ⋅ w 〉 + C ∑ ξ i − ∑ α i [ y i ( 〈 w ⋅ x i 〉 + b ) − 1 + ξ i ] − ∑ μ iξ i 2 i =1 i =1 i =1 (62) where αi, μi ≥ 0 are the Lagrange multipliers I. Mrázová: ATNS (NAIL013) 40 Kuhn-Tucker conditions I. Mrázová: ATNS (NAIL013) 41 From primal to dual As the linear separable case, we transform the primal to a dual by setting to zero the partial derivatives of the Lagrangian (62) with respect to the primal variables (i.e., w, b and ξi), and substituting the resulting relations back into the Lagrangian. Ie.., we substitute Equations (63), (64) and (65) into the primal Lagrangian (62). From Equation (65), C − αi − μi = 0, we can deduce that αi ≤ C because μi ≥ 0. I. Mrázová: ATNS (NAIL013) 42 Dual The dual of (61) is Interestingly, ξi and its Lagrange multipliers μi are not in the dual. The objective function is identical to that for the separable case. The only difference is the constraint αi ≤ C. I. Mrázová: ATNS (NAIL013) 43 Find primal variable values The dual problem (72) can be solved numerically. The resulting αi values are then used to compute w and b. w is computed using Equation (63) and b is computed using the Kuhn-Tucker complementarity conditions (70) and (71). Since no values for ξi, we need to get around it. From Equations (65), (70) and (71), we observe that if 0 < αi < C then both ξi = 0 and yi〈w ⋅ xi〉 + b – 1 + ξi = 0. Thus, we can use any training data point for which 0 < αi < C and Equation (69) 1 − (with ξi = 0) to compute b: b = yi r ∑ I. Mrázová: ATNS (NAIL013) i =1 y iα i 〈 x i ⋅ x j 〉 = 0 . (73) 44 (65), (70) and (71) in fact tell us more (74) shows a very important property of SVM. The solution is sparse in αi. Many training data points are outside the margin area and their αi’s in the solution are 0. Only those data points that are on the margin (i.e., yi(〈w ⋅ xi〉 + b) = 1, which are support vectors in the separable case), inside the margin (i.e., αi = C and yi(〈w ⋅ xi〉 + b) < 1), or errors are non-zero. Without this sparsity property, SVM would not be practical for large data sets. I. Mrázová: ATNS (NAIL013) 45 The final decision boundary The final decision boundary is (we note that many αi’s r are 0) 〈 w ⋅ x〉 + b = ∑ y α 〈x i i i ⋅ x〉 + b = 0 (75) i =1 The decision rule for classification (testing) is the same as the separable case, i.e., sign(〈w ⋅ x〉 + b). Finally, we also need to determine the parameter C in the objective function. It is normally chosen through the use of a validation set or cross-validation. I. Mrázová: ATNS (NAIL013) 46 How to deal with nonlinear separation? The SVM formulations require linear separation. Real-life data sets may need nonlinear separation. To deal with nonlinear separation, the same formulation and techniques as for the linear case are still used. We only transform the input data into another space (usually of a much higher dimension) so that a linear decision boundary can separate positive and negative examples in the transformed space, The transformed space is called the feature space. The original data space is called the input space. I. Mrázová: ATNS (NAIL013) 47 Space transformation The basic idea is to map the data in the input space X to a feature space F via a nonlinear mapping φ, φ:X →F x a φ ( x) (76) After the mapping, the original training data set {(x1, y1), (x2, y2), …, (xr, yr)} becomes: {(φ(x1), y1), (φ(x2), y2), …, (φ(xr), yr)} I. Mrázová: ATNS (NAIL013) (77) 48 Geometric interpretation In this example, the transformed space is also 2-D. But usually, the number of dimensions in the feature space is much higher than that in the input space I. Mrázová: ATNS (NAIL013) 49 Optimization problem in (61) becomes I. Mrázová: ATNS (NAIL013) 50 An example space transformation Suppose our input space is 2-dimensional, and we choose the following transformation (mapping) from 2-D to 3-D: 2 2 ( x1 , x 2 ) a ( x1 , x 2 , 2 x1 x 2 ) The training example ((2, 3), -1) in the input space is transformed to the following in the feature space: ((4, 9, 8.5), -1) I. Mrázová: ATNS (NAIL013) 51 Problem with explicit transformation The potential problem with this explicit data transformation and then applying the linear SVM is that it may suffer from the curse of dimensionality. The number of dimensions in the feature space can be huge with some useful transformations even with reasonable numbers of attributes in the input space. This makes it computationally infeasible to handle. Fortunately, explicit transformation is not needed. I. Mrázová: ATNS (NAIL013) 52 Kernel functions We notice that in the dual formulation both the construction of the optimal hyperplane (79) in F and the evaluation of the corresponding decision function (80) only require dot products 〈φ(x) ⋅ φ(z)〉 and never the mapped vector φ(x) in its explicit form. This is a crucial point. Thus, if we have a way to compute the dot product 〈φ(x) ⋅ φ(z)〉 using the input vectors x and z directly, no need to know the feature vector φ(x) or even φ itself. In SVM, this is done through the use of kernel functions, (82) denoted by K : K(x, z) = 〈φ(x) ⋅ φ(z)〉 I. Mrázová: ATNS (NAIL013) 53 An example kernel function Polynomial kernel: K(x, z) = 〈x ⋅ z〉d Let us compute the kernel with degree d = 2 in a 2-dimensional space: x = (x1, x2) and z = (z1, z2). 〈 x ⋅ z 〉 2 = ( x1 z 1 + x 2 z 2 ) 2 2 2 2 = x1 z 1 + 2 x 1 z 1 x 2 z 2 + x 2 z 2 2 2 2 (83) (84) 2 2 = 〈 ( x1 , x 2 , 2 x1 x 2 ) ⋅ ( z 1 , z 2 , 2 z 1 z 2 ) 〉 = 〈φ ( x ) ⋅ φ ( z ) 〉 , This shows that the kernel 〈x ⋅ z〉2 is a dot product in a transformed feature space I. Mrázová: ATNS (NAIL013) 54 Kernel trick The derivation in (84) is only for illustration purposes. We do not need to find the mapping function. We can apply the kernel function directly by replace all the dot products 〈φ(x) ⋅ φ(z)〉 in (79) and (80) with the kernel function K(x, z) (e.g., the polynomial kernel 〈x ⋅ z〉d in (83)). This strategy is called the kernel trick. I. Mrázová: ATNS (NAIL013) 55 Is it a kernel function? The question is: how do we know whether a function is a kernel without performing the derivation such as that in (84)? I.e, How do we know that a kernel function is indeed a dot product in some feature space? This question is answered by a theorem called the Mercer’s theorem, which we will not discuss here. I. Mrázová: ATNS (NAIL013) 56 Commonly used kernels It is clear that the idea of kernel generalizes the dot product in the input space. This dot product is also a kernel with the feature map being the identity I. Mrázová: ATNS (NAIL013) 57 Some other issues in SVM SVM works only in a real-valued space. For a categorical attribute, we need to convert its categorical values to numeric values. SVM does only two-class classification. For multi-class problems, some strategies can be applied, e.g., oneagainst-rest, and error-correcting output coding. The hyperplane produced by SVM is hard to understand by human users. The matter is made worse by kernels. Thus, SVM is commonly used in applications that do not required human understanding. I. Mrázová: ATNS (NAIL013) 58