History
SVM was invented by Vladimir N. Vapnik and Alexey Ya. Chervonenkis (Institute of Control Sciences of the Russian Academy of Sciences, Moscow, Russia) in the framework of the “Generalized Portrait Method” for computer learning and pattern recognition for linearly separated data. The development of these ideas started in 1962 and they were first published in 1964. In 1992, Vapnik et al. presented the concept of applying what is called the ‘kernel trick,’ which allows the use of the SVM to classify linearly non-separable data hard margins. The soft margin was given by Vapnik et al. (1995), which extended the version of hard margin SVM.
Hard margin SVM can work only when data are completely linearly separable without any errors (noise or outliers). In cases of errors, either the margin is smaller or hard margin SVM fails. Soft margin SVM was proposed by Vapnik to solve this problem by introducing slack variables. SVM has become popular because of its success in both regression and classification tasks:
- Support Vector Machine (SVM): used for classification.
- Support Vector Regression (SVR): used for regression.
However, it is widely used in classification objectives.
Overview of Support Vector Machine
A support vector machine (SVM) performs classification by finding an optimal separating hyperplane by leaving the largest possible fraction of points of the same class on the same side and maximizing the distance of either class from the hyperplane and minimizing the risk of misclassifying the training samples and the unseen test samples. The main goal of a support vector machine is to find the optimal separating hyperplane which maximizes the margin of the training data which is the distance between the hyperplane and the closest data point.
If a hyperplane is very close to a data point, its margin will be small. The further a hyperplane is from a data point, the larger its margin will be. This means that the optimal hyperplane will be the one with the largest margin.
The two classes labels are +1 (a positive example) and −1 (negative example). The
8
the main use of it is to perform the binary classification of the data which are linearly separable. It attempts to find the best plane or hyperplane that separates the data between two classes.
In 2 dimensions, a line must be found.
In 3 dimensions, a plane (surface) must be found.
In 4 or more dimensions, a hyperplane must be found.
The equation of the linear classifier is as follows:
π(π₯)=π πππ(π€π‘π₯+π) Eq. 1
where w = weight vector (orientation of the hyperplane), b = bias, and x = the input feature vector.
Classification Rule
If WtX1 + b > 0 then X1 ∈ Class1, where X1 lies on the positive side
If WtX1 + b < 0 then X1 ∈ Class2, where X1 lies on the negative side.
SVM attempts to maximize the distance of the separating boundary between the two classes by maximizing the distance of the separating plane of the feature vectors whether the feature vector belongs to Class1 or to Class2.
If (Xi ∈ Class1 and Yi = +1) then Yi (W0Xi + b) > 0.
If (Xi ∈ Class2 and Yi = −1) then Yi (W0Xi + b) < 0.
Unknown feature vector (P) has to be classified to either Class1 or Class2 using W and b that were obtained after designing the classifier.
If WP + b > 0 then P ∈ Class1.
If WP + b < 0 then P ∈ Class2.
Types of SVM classifiers include:
- Linear SVM Classifiers; and
- Non-Linear SVM Classifiers.
Linear Support Vector Machine (LSVM)
This is a type of SVM classifier where the data used in this form of SVM are linear repeatable and with low variance. It has become popular for solving classification tasks due to its fast and simple online application to large-scale datasets. The distance from any point to the separator can be illustrated in Equation
π=π€π‘π₯+π||π€|| Eq.2
Moreover, the length of the optimal margin can be computed using Eq. 3:
π€||π€|| .(π₯2 .π₯1)=π€πππ‘β= 2||π€|| Eq. 3
To design an SVM, W must be minimized simultaneously while maximizing bias b. By training the classifier, we determine what W vector and b bias are by using the initial value of W and b and for every training sample belonging to Class1, we have attempted to see whether Wt. X + b > 0. If not, then we have modified W and b such that the position and/or orientation of the hyperplane is so modified that a particular X, which is taken from Class1, is moved to the positive side of this hyperplane. Similarly, if we take a vector from Class2, we check whether this Wt. X < 0, where X is taken from Class2. If it is not negative, then again we have modified W and b such that a particular X is moved to the negative side of the hyperplane.
SVM as a Minimizing Problem
Maximizing 2/|w| is the same as minimizing |w|/2. Hence SVM becomes a minimization problem:
min12||π€||2π .π‘.π¦π(π€.π₯π+π)≥1,∀ π₯π Eq. 4
Yi = {+, −} classes
We are now optimizing a quadratic function subject to linear constraints. Moreover, quadratic optimization problems are a standard, well-known class of mathematical optimization problems and many algorithms exist to solve them.
Therefore, the SVM is transformed into a minimization problem where we endeavor to maximize the margin between the two classes. It is used for binary classification where data is linearly separable.
Non-Linear SVM Classifier
Used with non-linear separated data and the situations where a non-linear region can separate the groups more efficiently, SVM handles this by using a kernel function (non-linear) to map the data into a different space where a hyperplane (linear) cannot
π=π€π‘π₯+π||π€|| Eq. 2
be used to perform the separation.
Kernel Function
The kernel function is used to convert non-linear space into linear space so we can use a linear model to separate non-linear separated data and reduce the computation cost. This uses the inner product of the new vectors where we have a binary classification with x input and single features (non-linear separated data).
By mapping the input example to a new representation or too high a dimensionality, we can use a linear model to classify the non-linear dataset.
π(π)= ππ∅(π)+π Eq. 5
π(π)= Ξ£∝ππ ∈ππ∅(ππ)π ∅(π)+π Eq. 6
π²(ππ ,ππ)= ∅(ππ) ∅(ππ) Eq. 7
where K is kernel function and ∅ is the mapping from X to an (inner product) feature space.
Kernel Trick
This is the kernel function that transforms data into a higher dimensional feature space to make it possible to perform a linear separation, and compute the inner product of the definition space without visiting it and making the number of dimensions depend on the number of examples, not on the dimension of the space that is defined.
Let y = {π¦1,π¦2} with two feature spaces that are non-linear separable.
The inner product is a function between the transformation of x and x':
Let
π§π‘π§′=πΎ(π₯,π₯′) πππ ππππππ Eq. 8
y = ∅(π₯) Eq. 9
π(πΌ)= Ξ£πΌπ− 12ππ=1ΣΣπ§ππ§ππΌππΌππππππ=1ππ=1ππ Eq. 10
Conditions: πΌπ ≥ 0 for n = 1, 2… N and Ξ£∝ππ¦πππ=1=0
π(π₯)=sin (π€π‘.z +b) Eq. 11
We need ππππ.
π= π§π(π€π‘π¦π+π)=1 Eq. 12
Using the original method for classification without using kernel function, the transformation function will convert from non-linear separable to linear separable as follows:
Ξ¦(π¦)→ π¦12,π¦22 ,√2π¦1π¦2
Where;
Ξ¦(π¦) is a transformation function that converts 2 dimensions into 3 dimensions.
By using the decision boundary in 3-dimensional space:
π½0+π½1π¦12+π½3√2π¦π¦2=0 Eq. 13
For i point here, we use 4 operations for i point.
On the other hand, we have j point:
Again, we use 4 operations for j point, and finally, we compute the dot product between the vector (similarity measure):
(Ξ¦(yi),Ξ¦(yj)) = π¦π12π¦π12+ π¦π2 2π¦π22+ 2π¦π1π¦π22π¦π1π¦π2 Eq. 18
Here, we use 3 operations of the product and 2 additional operations.
The total number of operations using the original method is: 4 + 4 + 3 + 2 = 13 operations.
Using the kernel trick method:
(π¦π,π¦π)2=({π¦π1,π¦π2},{π¦π1,π¦π2})2 Eq. 19
Ξ¦(π¦π)=(π¦π1,π¦π2) Eq. 14
=(π¦π12,π¦π22,√2π¦π1π¦π2) Eq. 15
Ξ¦(π¦π)=(π¦π1,π¦π2) Eq. 16
=(π¦π12,π¦π22,√2π¦π1π¦π2) Eq. 17
= (π¦π1π¦π2+π¦π1π¦π2)2 Eq. 20
= π¦π12π¦π12+ π¦π2 2π¦π22+ 2π¦π1π¦π22π¦π1π¦π2 Eq. 21
The number of operations used by the kernel is as follows:
- Using three multiplication operations (two for the dot product and one for the square operation); and
- Using one additional operation. The total number is 3 + 1 = 4 operations used by the kernel, which is less than the original method than the mapping data from 2-dimensional space to 3-dimensional space followed by applying the required operations while using the kernel trick which is about to stay in the same 2-dimensional space and computing the same result as in the 3-dimensional space.
Kernel Types
- Polynomial
- Gaussian
- Gaussian Radial Basis Function (RBF)
- Laplace RBF
- Hyperbolic tangent
- Sigmoid
- Bessel function of the first kernel
- ANOVA radial basis
- Linear splines kernel in one dimension
- A. Roy, D. Dutta, P. Bhattacharya, and S. Choudhury, “Filter and Fuzzy C means based feature extraction and classification of diabetic retinopathy using support vector machines,” Proc. 2017 IEEE Int. Conf. Commun. Signal Process. ICCSP 2017, vol. 2018-Janua, no. c, pp. 1844–1848, 2018.
0 Comments:
Post a Comment