Support Vector Machine

History

SVM was invented by Vladimir N. Vapnik and Alexey Ya. Chervonenkis (Institute of Control Sciences of the Russian Academy of Sciences, Moscow, Russia) in the framework of the “Generalized Portrait Method” for computer learning and pattern recognition for linearly separated data. The development of these ideas started in 1962 and they were first published in 1964. In 1992, Vapnik et al. presented the concept of applying what is called the ‘kernel trick,’ which allows the use of the SVM to classify linearly non-separable data hard margins. The soft margin was given by Vapnik et al. (1995), which extended the version of hard margin SVM.

Hard margin SVM can work only when data are completely linearly separable without any errors (noise or outliers). In cases of errors, either the margin is smaller or hard margin SVM fails. Soft margin SVM was proposed by Vapnik to solve this problem by introducing slack variables. SVM has become popular because of its success in both regression and classification tasks:

Support Vector Machine (SVM): used for classification.
Support Vector Regression (SVR): used for regression.

However, it is widely used in classification objectives.

Overview of Support Vector Machine

A support vector machine (SVM) performs classification by finding an optimal separating hyperplane by leaving the largest possible fraction of points of the same class on the same side and maximizing the distance of either class from the hyperplane and minimizing the risk of misclassifying the training samples and the unseen test samples. The main goal of a support vector machine is to find the optimal separating hyperplane which maximizes the margin of the training data which is the distance between the hyperplane and the closest data point.

If a hyperplane is very close to a data point, its margin will be small. The further a hyperplane is from a data point, the larger its margin will be. This means that the optimal hyperplane will be the one with the largest margin.

The two classes labels are +1 (a positive example) and −1 (negative example). The

the main use of it is to perform the binary classification of the data which are linearly separable. It attempts to find the best plane or hyperplane that separates the data between two classes.

In 2 dimensions, a line must be found.

In 3 dimensions, a plane (surface) must be found.

In 4 or more dimensions, a hyperplane must be found.

The equation of the linear classifier is as follows:

𝑔(𝑥)=𝑠𝑖𝑔𝑛(𝑤𝑡𝑥+𝑏) Eq. 1

where w = weight vector (orientation of the hyperplane), b = bias, and x = the input feature vector.

Classification Rule

If WtX1 + b > 0 then X1 ∈ Class1, where X1 lies on the positive side

If WtX1 + b < 0 then X1 ∈ Class2, where X1 lies on the negative side.

SVM attempts to maximize the distance of the separating boundary between the two classes by maximizing the distance of the separating plane of the feature vectors whether the feature vector belongs to Class1 or to Class2.

If (Xi ∈ Class1 and Yi = +1) then Yi (W0Xi + b) > 0.

If (Xi ∈ Class2 and Yi = −1) then Yi (W0Xi + b) < 0.

Unknown feature vector (P) has to be classified to either Class1 or Class2 using W and b that were obtained after designing the classifier.

If WP + b > 0 then P ∈ Class1.

If WP + b < 0 then P ∈ Class2.

Types of SVM classifiers include:

- Linear SVM Classifiers; and

- Non-Linear SVM Classifiers.

Linear Support Vector Machine (LSVM)

This is a type of SVM classifier where the data used in this form of SVM are linear repeatable and with low variance. It has become popular for solving classification tasks due to its fast and simple online application to large-scale datasets. The distance from any point to the separator can be illustrated in Equation

𝑟=𝑤𝑡𝑥+𝑏||𝑤|| Eq.2

Moreover, the length of the optimal margin can be computed using Eq. 3:

𝑤||𝑤|| .(𝑥2 .𝑥1)=𝑤𝑖𝑑𝑡ℎ= 2||𝑤|| Eq. 3

To design an SVM, W must be minimized simultaneously while maximizing bias b. By training the classifier, we determine what W vector and b bias are by using the initial value of W and b and for every training sample belonging to Class1, we have attempted to see whether Wt. X + b > 0. If not, then we have modified W and b such that the position and/or orientation of the hyperplane is so modified that a particular X, which is taken from Class1, is moved to the positive side of this hyperplane. Similarly, if we take a vector from Class2, we check whether this Wt. X < 0, where X is taken from Class2. If it is not negative, then again we have modified W and b such that a particular X is moved to the negative side of the hyperplane.

SVM as a Minimizing Problem

Maximizing 2/|w| is the same as minimizing |w|/2. Hence SVM becomes a minimization problem:

min12||𝑤||2𝑠.𝑡.𝑦𝑖(𝑤.𝑥𝑖+𝑏)≥1,∀ 𝑥𝑖 Eq. 4

Yi = {+, −} classes

We are now optimizing a quadratic function subject to linear constraints. Moreover, quadratic optimization problems are a standard, well-known class of mathematical optimization problems and many algorithms exist to solve them.

Therefore, the SVM is transformed into a minimization problem where we endeavor to maximize the margin between the two classes. It is used for binary classification where data is linearly separable.

Non-Linear SVM Classifier

Used with non-linear separated data and the situations where a non-linear region can separate the groups more efficiently, SVM handles this by using a kernel function (non-linear) to map the data into a different space where a hyperplane (linear) cannot

𝑟=𝑤𝑡𝑥+𝑏||𝑤|| Eq. 2

be used to perform the separation.

Kernel Function

The kernel function is used to convert non-linear space into linear space so we can use a linear model to separate non-linear separated data and reduce the computation cost. This uses the inner product of the new vectors where we have a binary classification with x input and single features (non-linear separated data).

By mapping the input example to a new representation or too high a dimensionality, we can use a linear model to classify the non-linear dataset.

𝒈(𝒙)= 𝒘𝒕∅(𝒙)+𝒃 Eq. 5

𝒈(𝒙)= Σ∝𝒊𝒊 ∈𝒔𝒗∅(𝒙𝒊)𝒕 ∅(𝒙)+𝒃 Eq. 6

𝑲(𝒙𝒂 ,𝒙𝒃)= ∅(𝒙𝒂) ∅(𝒙𝒃) Eq. 7

where K is kernel function and ∅ is the mapping from X to an (inner product) feature space.

Kernel Trick

This is the kernel function that transforms data into a higher dimensional feature space to make it possible to perform a linear separation, and compute the inner product of the definition space without visiting it and making the number of dimensions depend on the number of examples, not on the dimension of the space that is defined.

Let y = {𝑦1,𝑦2} with two feature spaces that are non-linear separable.

The inner product is a function between the transformation of x and x':

Let

𝑧𝑡𝑧′=𝐾(𝑥,𝑥′) 𝒕𝒉𝒆 𝒌𝒆𝒓𝒏𝒆𝒍 Eq. 8

y = ∅(𝑥) Eq. 9

𝑙(𝛼)= Σ𝛼𝑛− 12𝑁𝑛=1ΣΣ𝑧𝑛𝑧𝑚𝛼𝑛𝛼𝑚𝒚𝒏𝒕𝑚𝑀=1𝑁𝑛=1𝒚𝒎 Eq. 10

Conditions: 𝛼𝑛 ≥ 0 for n = 1, 2… N and Σ∝𝑛𝑦𝑛𝑁𝑛=1=0

𝑔(𝑥)=sin (𝑤𝑡.z +b) Eq. 11

We need 𝒚𝒏𝒕𝒚.

𝑏= 𝑧𝑚(𝑤𝑡𝑦𝑚+𝑏)=1 Eq. 12

Using the original method for classification without using kernel function, the transformation function will convert from non-linear separable to linear separable as follows:

Φ(𝑦)→ 𝑦12,𝑦22 ,√2𝑦1𝑦2

Where;

Φ(𝑦) is a transformation function that converts 2 dimensions into 3 dimensions.

By using the decision boundary in 3-dimensional space:

𝛽0+𝛽1𝑦12+𝛽3√2𝑦𝑦2=0 Eq. 13

For i point here, we use 4 operations for i point.

On the other hand, we have j point:

Again, we use 4 operations for j point, and finally, we compute the dot product between the vector (similarity measure):

(Φ(yi),Φ(yj)) = 𝑦𝑖12𝑦𝑗12+ 𝑦𝑖2 2𝑦𝑗22+ 2𝑦𝑖1𝑦𝑖22𝑦𝑗1𝑦𝑗2 Eq. 18

Here, we use 3 operations of the product and 2 additional operations.

The total number of operations using the original method is: 4 + 4 + 3 + 2 = 13 operations.

Using the kernel trick method:

(𝑦𝑖,𝑦𝑗)2=({𝑦𝑖1,𝑦𝑖2},{𝑦𝑗1,𝑦𝑗2})2 Eq. 19

Φ(𝑦𝑖)=(𝑦𝑖1,𝑦𝑖2) Eq. 14

=(𝑦𝑖12,𝑦𝑖22,√2𝑦𝑖1𝑦𝑖2) Eq. 15

Φ(𝑦𝑗)=(𝑦𝑗1,𝑦𝑗2) Eq. 16

=(𝑦𝑗12,𝑦𝑗22,√2𝑦𝑗1𝑦𝑗2) Eq. 17

= (𝑦𝑖1𝑦𝑖2+𝑦𝑗1𝑦𝑗2)2 Eq. 20

= 𝑦𝑖12𝑦𝑗12+ 𝑦𝑖2 2𝑦𝑗22+ 2𝑦𝑖1𝑦𝑖22𝑦𝑗1𝑦𝑗2 Eq. 21

The number of operations used by the kernel is as follows:

- Using three multiplication operations (two for the dot product and one for the square operation); and

- Using one additional operation. The total number is 3 + 1 = 4 operations used by the kernel, which is less than the original method than the mapping data from 2-dimensional space to 3-dimensional space followed by applying the required operations while using the kernel trick which is about to stay in the same 2-dimensional space and computing the same result as in the 3-dimensional space.

Kernel Types

Polynomial
Gaussian
Gaussian Radial Basis Function (RBF)
Laplace RBF
Hyperbolic tangent
Sigmoid
Bessel function of the first kernel
ANOVA radial basis
Linear splines kernel in one dimension

Refreences:

- A. Roy, D. Dutta, P. Bhattacharya, and S. Choudhury, “Filter and Fuzzy C means based feature extraction and classification of diabetic retinopathy using support vector machines,” Proc. 2017 IEEE Int. Conf. Commun. Signal Process. ICCSP 2017, vol. 2018-Janua, no. c, pp. 1844–1848, 2018.

Junior 4 Data Scientist

Comments

Flickr

Sponsor

Labels

Blog Archive

Support Vector Machine

History

Overview of Support Vector Machine

Linear Support Vector Machine (LSVM)

SVM as a Minimizing Problem

Non-Linear SVM Classifier

Kernel Function

Kernel Trick

Kernel Types

About Inas AL-Kamachy

0 Comments:

Post a Comment

Recent comments

Flickr

Sponsor

Labels

Blog Archive

Support Vector Machine

History

Overview of Support Vector Machine

Linear Support Vector Machine (LSVM)

SVM as a Minimizing Problem

Non-Linear SVM Classifier

Kernel Function

Kernel Trick

Kernel Types

About Inas AL-Kamachy

RELATED POSTS

0 Comments:

Post a Comment