Support Vector Machines (SVM)
Invented by Vapnik & Lerner in 1963 (in Moscow).
Support Vector Machine (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges.
SVMs are mostly used for classification tasks, but they can also be used for regression.
SVM is binary classification algorithm. Given a set of points of 2 types in N dimensional place, SVM generates a (N - 1) dimensional hyperplane to separate those points into 2 groups. Say you have some points of 2 types in a paper which are linearly separable. SVM will find a straight line which separates those points into 2 types and situated as far as possible from all those points.
Support Vector machines have some special data points nearest to the hyperplane, which we call Support Vectors and a separating hyperplane which is known as Support Vector Machine.
A line that is used to classify one class from another is also called a hyperplane.
Classifies data by finding the linear decision boundary (hyperplane) that separates all data points of one class from those of the other class. The best hyperplane for an SVM is the one with the largest margin between the two classes, when the data is linearly separable. If the data is not linearly separable, a loss function is used to penalize points on the wrong side of the hyperplane. SVMs use a kernel transform to transform nonlinearly separable data into higher dimensions where a linear decision boundary can be found.
SVMs have been extensively used for solving complex classification problems such as image recognition, voice detection etc. And able to deal with computationally heavy data sets, classifying nonlinearly separable data.
SVMs are linear models that require numeric attributes.
SVM is a binary classification model that constructs a hyperplane to separate observations into two classes.
Support Vectors are simply the coordinates of data points which are nearest to the optimal separating hyperplane.
Many classification algorithms such as SVM (Support Vector machines) and Random Forest (RF) which are known to generally perform well in classification tasks, do not output the class probabilities along with the output classes. Both SVM and RF output scores which can be used for ranking outputs, but they cannot be used for calculating expected utility of different alternatives in decision situations. There are several methods such as Platts scaling, binning etc., which are used to provide probability estimates to the output classes from these labels. But these methods affect the ranking order of examples. We can use Venn Aber's predictor for estimating probabilities for output labels. Venn Aber’s is shown to be equivalent to Venn Predictors based on Isotonic regression. We explicitly maintain the order of the ranking examples and at the same time maximize the likelihood of the probability estimates. We use negative Log Likelihood and Gini index for comparing the different methods.
The hyperplane in 2D can be represented by a line, whereas a hyperplane in 3D can be represented by a plane.
A positive value would mean that the set of values of the features is in one class; a negative value would imply it belongs to the other class. A value of zero would imply that the point lies on the line/hyperplane.
There can be several lines (hyperplanes) possible in the same data set.
The dimension of the hyperplane is calculated as number of features - 1.
If you derive the d-dimensional hyperplane from d attributes/n-dimensional space, you can write the expression as follows:
∑i = 1d(Wi.Xi)+W0 = 0
W 0 + W 1 x 1 + W 2 x 2 + . . . + W n − 1 x n − 1 + W n x n = 0
When SVM model is overfitting, adding more training points and increasing value of C will reduce model overfitting problem. Because if you train the model on a large number of data, it will try to learn from the data rather fitting to the each data points. and if you increase the value of C, ultimately you are allowing your model to misclassify some data points. And, hence, the model will not be overfitted in these two cases.
- Linear SVM
- Kernel SVM
Maximal Margin Classifier:
A margin is calculated by considering only the points closest to the hyperplane.
The best line is the one that maintains the largest possible equal distance from the nearest points of both the classes, that also referred as a Maximal Margin Classifier.
Maximal margin line (i.e hyperplane) would be considered the best fit for the given data.
The Maximal Margin Classifier divides the data set in such a way that it is equidistant from both the classes. Thus, it maintains an equal distance from both classes, making the model less biased to the training data. Also, training errors are reduced.
Two major constraints that need to be taken into account while maximising the margin.
i) The standardisation of coefficients such that the summation of the square of the coefficients of all the attributes is equal to 1.
For example, if you have 20 attributes, then the summation of square of the coefficients should be ∑i=0 to 20 (Wi^2) = 1
ii) The maximal margin hyperplane should also follow the constraint: (li x (Wi.Yi)) ⩾ M
where, li = label (1/-1 or L/-L), Wi = coefficient of attributes, Yi = data points of all the attributes in each row
Maximal Margin Classifier will perform perfectly on the training data set. But on the unseen data, it may perform poorly.
Maximal Margin Separator is more generalisable because it takes ensures a minimum margin of safety from both the classes; this makes the model less biased towards any one of the classes.
Unfortunately the maximal margin classifier only works if there exists a separating hyperplane for the data. If the cases cannot be perfectly separated by a hyperplane, then we can never satisfy the conditions of the maximal margin classifier.
It can be extremely sensitive to individual observations. In other words, the model can drastically change if a few points are changed.
Soft Margin Classifier/Support Vector Classifier:
The support vector classifier works well when the data is partially intermingled (i.e. most of the data can be classified correctly with some misclassifications).
The hyperplane that allows certain points to be deliberately misclassified is also called the Support Vector Classifier.
Only the data points that lie closest to the hyperplane are useful for constructing the classifier.
The soft margin is used in constructing the Soft Margin Classifier which allows some points to be misclassified, whereas the hard margin ensures no points are misclassified.
SVCs are relatively immune to outliers, because SVCs are formulated from the support vector points. It implies that the SVC (i.e hyperplane) will not be changed if we do not change the support vectors.
(li x (Wi.Yi)) ⩾ M(1−εi)
A slack variable (ε) is used to control misclassifications.
The value of slack, tolerance, lies between 0 and +infinity.
For points which are at a distance of more than M, i.e. at a safe distance from the hyperplane, the value of the slack variable is 0.
If a data point is correctly classified but falls inside the margin (or violates the margin), then the value of its slack ϵ is between 0 and 1.
If a data point is incorrectly classified (i.e. it violates the hyperplane), the value of epsilon (ϵ) > 1.
The summation of all the slack variables/epsilons of each data point is denoted by cost or 'C'.
Parameter C is called the tuning parameter — one of the hyperparameters in SVM.
- When C is large, the slack variables can be large, the model is flexible, more generalisable, and less likely to overfit, it has a high bias and low variance.
- When C is small, we force the individual slack variables to be small, the model is less flexible, less generalisable, and more likely to overfit, it has a high variance and less bias.
Kernels:
SVM is capable of fitting non-linear boundaries using a simple and elegant method known as kernel trick.
The support vector machine is an extension of the support vector classifier that enlarges the feature space by using kernels.
Kernels enable the linear SVM model to separate nonlinearly separable data points.
We can transform nonlinear boundaries to linear boundaries by applying certain functions to the original attributes.
The original space (X, Y) is called the original attribute space, and the transformed space (X', Y') is called the feature space.
Process of transforming the original attributes into a new feature space is called ‘feature transformation’. The number of attributes increases, there is an exponential increase in the number of dimensions in the transformed feature space.
SVM algorithm only needs the pairwise inner products or dot products of the form (Xi^T.Xj), to find a best fit model.
Kernel functions use this fact to bypass the explicit transformation process from the attribute space to the feature space, and rather do it implicitly.
Kernel Types:
- Linear kernel - same as support vector classifier or hyperplane, without any transformation. Requires one tuning parameter 'c' to select the best-fit linear model.
- Polynomial Kernel - capable of creating nonlinear, polynomial decision boundaries.
- Radial Basis Function (RBF) Kernel - capable of transforming highly nonlinear feature spaces to linear ones and capable of creating elliptical (i.e. enclosed) decision boundaries. Needs two tuning parameters - Sigma/Gamma and 'c'.
Sigma/Gamma is used to increase the nonlinearity in the decision surface.
The higher the value of sigma, the more is the nonlinearity introduced; the lower the value of sigma, the lesser is the nonlinearity.
Reducing sigma and increasing C (i.e. increasing the summation of epsilons) makes the model more generalisable. It means that the model allows some data point to be misclassified. Thus, the problem of overfitting can be solved by this condition.
Advantages of SVM:
- Effective in high dimensional spaces
- R^N > samples, that is, SVM is effective in cases where number of dimensions is greater than the number of samples
- Different Kernel functions for various decision functions
- We can add kernel functions
- It works really well with clear margin of separation
- It is effective in high dimensional spaces
- It uses a subset of training points in the decision function (called support vectors), so it is also memory efficient
- SVMs can handle large feature space
- These can handle nonlinear feature interaction
- They do not rely on the entire dimensionality of the data for the transformation
Disadvantages of SVM:
- Poor performance when #features > #samples
- SVMs do not provide probability estimates
- It doesn’t perform well, when we have large data set because the required training time is higher
- It also doesn’t perform very well, when the data set has more noise i.e. target classes are overlapping
- SVM doesn’t directly provide probability estimates, these are calculated using an expensive five-fold cross-validation. It is related SVC method of Python scikit-learn library
- SVMs are not efficient in terms of computational cost when the number of observations is large
- It is tricky and time-consuming to find the appropriate kernel for a given data
SVM algorithm in Python Language:
In Python, SVM algorithm implemented using sci-kit learn (sk learn) package.
from sklearn.svm import SVC
mdl = SVC()
model = SVC(C = 1)
model = svm.SVC(kernel='linear', c=1, gamma=1)
clf = svm.SVC(gamma=0.001, C=100)
sklearn.svm.SVC(C=1.0, kernel='rbf', degree=3, gamma=0.0, coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, random_state=None)
svc = svm.SVC(kernel='linear', C=1, gamma=0).fit(X, y)
poly_svm = svm.SVC(kernel='poly', C=C, degree=3).fit(X_tr, Y_tr)
svc = SVC(kernel="rbf", C=1)
>>> clf = svm.SVC(gamma='scale')
model.fit(X, y)
mdl.fit(ds.data, ds.target)
>>> clf.set_params(kernel='linear').fit(X, y)
>>> clf.set_params(kernel='rbf', gamma='scale').fit(X, y)
predicted = model.predict(x_test)
predicted = mdl.predict(ds.data)
model.score(X, y)
print(metrics.classification_report(ds.target, predicted))
confusion_matrix(y_true=y_test, y_pred=y_pred)
print("accuracy", metrics.accuracy_score(y_test, y_pred)) # accuracy
print("precision", metrics.precision_score(y_test, y_pred)) # precision
print("recall", metrics.recall_score(y_test, y_pred)) # recall/sensitivity
from sklearn.svm import LinearSVC
clf = LinearSVC()
classifier_liblinear = svm.LinearSVC()
Support Vector Regression (SVR) - A type of regression from SVM (Support Vector Machine).
from sklearn.svm import SVR
svr_rbf = SVR(kernel='rbf', C=1e3)
estimator = SVR(kernel="linear")
svr_linear = SVR(kernel='linear', C=1e3)
svr_sigmoid = SVR(kernel='sigmoid', C=1e3)
clf = svm.SVR(kernel='rbf', C=100, gamma=0.001)
svr_rbf.fit(X, y)
y_pred_sigmoid = svr_sigmoid.predict(X)
Solved MNIST (identifying hand written digits) problem with SVM algorithm
https://www.kaggle.com/thirumani/mnist-svm
SVM algorithm in R language :
kernlab library / package (and ksvm function) to implement Machine Learning algorithm SVM in R programming language.
library(kernlab)
ksvm(x, y = NULL, scaled = TRUE, type = NULL, kernel ="rbfdot", kpar = "automatic", C = 1, nu = 0.2, epsilon = 0.1, prob.model = FALSE, class.weights = NULL, cross = 0, fit = TRUE, cache = 40, tol = 0.001, shrinking = TRUE, ..., subset, na.action = na.omit)
ksvm(x, y = NULL, type = NULL, C = 1, nu = 0.2, epsilon = 0.1, prob.model = FALSE, class.weights = NULL, cross = 0, fit = TRUE, cache = 40, tol = 0.001, shrinking = TRUE, ...)
ksvm(x, y = NULL, type = NULL, kernel = "stringdot", kpar = list(length = 4, lambda = 0.5), C = 1, nu = 0.2, epsilon = 0.1, prob.model = FALSE, class.weights = NULL, cross = 0, fit = TRUE, cache = 40, tol = 0.001, shrinking = TRUE, ..., na.action = na.omit)
model_1 <- ksvm(spam ~ ., data = train, scale = FALSE, C=1)
model_10 <- ksvm(digit ~ ., data = train, scale = FALSE, C=10)
Model_linear <- ksvm(letter~ ., data = train, scale = FALSE, kernel = "vanilladot")
model_rbf <- ksvm(number ~ ., data =train, scale=FALSE, kernel = "rbfdot")
filter <- ksvm(type~., data=spamtrain, kernel="rbfdot", kpar=list(sigma=0.05), C=5, cross=3)
irismodel <- ksvm(Species~., data=iris, type="C-bsvc", kernel=rbf, C=10, prob.model=TRUE)
model_rbf <- ksvm(number ~ ., data =train, scale=FALSE, kernel = "polydot")
model_poly_10 <- ksvm(Digit ~ ., data = train_data, scale = FALSE, C=10, kernel = "polydot", cross=3, kpar = list(degree = 2))
tsv <- ksvm(reuters, rlabels, kernel="stringdot", kpar=list(length=5), cross=3, C=10)
regm <- ksvm(x, y, epsilon=0.01, kpar=list(sigma=16), cross=3)
Ktest <- as.kernelMatrix(crossprod(t(xtest), t(x[SVindex(svp2), ])))
summary(model_1)
rbfdot(sigma = 1)
rbf <- rbfdot(sigma=0.1)
rbfkernel(x, y)
polydot(degree = 1, scale = 1, offset = 1)
tanhdot(scale = 1, offset = 1)
vanilladot()
laplacedot(sigma = 1)
besseldot(sigma = 1, order = 1, degree = 1)
anovadot(sigma = 1, degree = 1)
splinedot()
prop.table(table(eval_linear==test_data[,1]))
Related Articles: Machine Learning Algorithms
No comments:
Post a Comment