Machine learning is one of the most highly researched fields at the moment, in the entire world. Throughout the last two decades, machine learning has provided solutions to countless problems across the globe. From object detection to self-driving automobiles, machine learning has set the world to exponential technological growth.
The two main types of machine learning are supervised and unsupervised learning. The most basic difference between supervised and unsupervised learning is that in case supervised learning the algorithm has an output available for each kind of input and the algorithm is to determine the correlation between the input and the output by itself, while in the case of unsupervised learning, we do not have an output available.
The two most basic types of supervised machine learning are classification and regression. Classification is basically the categorization of input data into a set of output classes.
Here, the output has distinct values, and the hypothesis function only gives output as one of the finite values representing a class. In the case of regression, we have continuous output corresponding to each input and a new input can output any value depending upon what the hypothesis function gives, which does not exactly map to a specific set of outputs but rather can be an arbitrary value. To learn more about the difference between classification and regression, refer here.
In this article, we will analyze some of the most popular machine learning classification algorithms that are practically used nowadays. Hence, the article is divided into the following sections:
Classification, in machine learning specifically, refers to the categorization of each input data into a specific number of classes. It is a supervised learning algorithm, which means we have a set of output data that corresponds to the given input data for the learning stage of the model.
The output data categorizes the input data into a specific number of classes. For instance, if we have image data as input and the output data consists of a Boolean number, that is 0 or 1, where 1 means that the input picture is of a cat and 0 refers to the image that is not of a cat. Our algorithm’s task, as a classification algorithm, will be to learn from this training data and compute a hypothesis function.
The hypothesis function will be such that when an image is given to the algorithm, it will compute an output which will be either 0 or 1, where 1 means that the input image is of a cat and 0 means that it is not.
Classification methods in machine learning give rise to a concept of two types of learning algorithms:
- Lazy Learner
- Eager Learner
Both of these have a different approach for classifying the input data. Lazy learners wait for the test data to start the classification. They basically store the original data, as it is, without the learning process and start learning as they are classifying the test data.
This gives an algorithm that takes less time learning and more time classifying the data. On the other hand, eager learners start the learning process, otherwise known as training, as soon as they get the data and do not wait for a test set to proceed. When they get the test data, they only perform the classification using the already learned features.
This provides an algorithm which takes more time training while takes very little time classifying new data.
Classification algorithms are extremely useful for several artificial intelligence-based systems whether they are object detectors or generative networks. Classification is also widely used in the field of computer vision and even in natual language processing. Hence, today we will look at some of the most popular machine learning classification algorithms. So, without further ado, let us dive right in:
Logistic regression is a classification algorithm based on the concepts of statistics and probability. It uses a set of independent variables to predict an output class. Logistic regression is an algorithm that was developed to classify a binary problem, that is it could only classify two output classes as the hypothesis function gives a probability value between 0 and 1.
But now, in Sklearn, it can be used not only for binary classification but also for multi-class classification based on the fact that multiple hypothesis functions can be built based on the number of output classes.
Logistic regression finds a relationship between the dependent and a set of independent variables. As said above, the hypothesis gives a probability value between 0 and 1. The threshold value is then used to floor aur ceil the value of the probability to one of the two binary classes. A probability lower than the threshold value is mapped to 0, whereas one higher than the threshold is mapped to 1.
Naïve Bayes is a set of probabilistic machine learning classifiers based on bayes theorem. Bayes theorem is defined as follows:
This equation gives the probability of event A happening such that B has already occurred. Naïve Bayes assumes that the input features are all independent of each other. In case the features are actually dependent on each other, their properties contribute to probabilities independently. It is simpler than other machine learning classification algorithms. In the case of n features, the probability of a single feature contributing to the hypothesis is given by:
There are several types of Naïve Bayes classifiers. These include:
- Multinomial Naïve Bayes
- Bernoulli Naïve Bayes
- Gaussian Naïve Bayes
The most widely used classifier out of these is the gaussian naive bayes. It assumes that the probability values are sampled from a gaussian or normal distribution function. For the purpose of implementation, we are also going to use the gaussian Naïve Bayes classifier.
Decision Tree is one of the most widely used classification algorithms due to its simplicity. It also provides a good prediction for datasets that are not composed of overly complex attributes. It is in the form of a flowchart, which can also be thought of as a tree with a root with branches emerging out of it. It performs a series of tests on the attributes of the input data in a single node to split the node into further branches based on the output of the test. Each new branch leads to another node which, in the same way, splits into more branches. This is what gives it a tree-like structure. All these nodes and branches end in final nodes called leaf nodes which give the output as a single class.
The calculation of the node split in a tree is based on a technique called the Attribute Selection Measure (ASM). There are two basic methods of ASM:
- Information Gain
- Gini Index
Information Gain is based on the measure of changes in entropy in the whole dataset based on each attribute. In other words, we calculate how much information an attribute possesses. The attribute with the highest information gain is used as the one on which the test is performed.
Gini index measures the purity or impurity used for the creation of a decision tree. An attribute having a lower value of the gini index is preferred to be used in a node.
Support vector machine is one of the most powerful algorithms used in machine learning. It is a complex algorithm that works well even on large and complex datasets. The main idea behind the SVM algorithm is to determine a hyperplane in an n-dimensional space, where n is the number of features in the dataset, to classify each class in the dataset. The optimal hyperplane is the plane having the maximum margin, that is the maximum distance between the data points of each class.
The margin can either be a hard margin or a soft margin. In the case of a hard margin, the algorithm cannot afford to have any incorrectly classified data points, while in the case of a soft margin, the classifier can have a few outliers be incorrectly classified in order to achieve an overall better performance.
KNN is a lazy learner classification algorithm. It stores the whole training data in an n-dimensional space. It does not work on constructing a general function to classify any new instance of the dataset, rather just stores the dataset in a space and works on each new instance from scratch. The classification is done by a simple majority voting scheme for each point in the n-dimensional space. In order to classify a new instance in the space, it looks at all the points closest to the new point. All of the close points are referred to as the nearest neighbors of the instance. The label which the majority of the neighbors have is assigned to the new instance. The “K” is the number of neighbors used for the classification purpose.
The random forest algorithm is an ensemble learning technique where we use multiple decision trees to construct a classification model based on majority voting from each of the decision trees. During training, each of the multiple decision trees is provided with a subset of the whole dataset. Then a majority voting technique is used to output a single most probable prediction using the outputs from each of the trees. Each decision tree in the random forest works as described above in the decision tree section.
To learn more about the basics of random forest and the decision tree algorithm, visit our deep dive into the random forest classification algorithm.
Now that we have looked over the basics of each classification method, we must head over to the implementation of the algorithms and see which algorithm works best.
We will be implementing the algorithms in Python and Sklearn. SKlearn or Scikit-Learn is an extensive library for the development of various machine learning models.
It consists of a vast number of pre-developed machine learning tools for the development of ML models including both classification and regression models. Moreover, the Sklearn library also has a collection of performance metrics to evaluate the developed machine learning models.
Before moving on the implementation of these algorithms, we have to install Sklearn in the same environment of Python we are working on:
Using the pip command, install as follows:
pip install -U scikit-learn
Using conda environment, install using:
conda install scikit-learn
Now, let us dive into the implementation of the algorithms:
First, we will import all the required modules, including the classification algorithm models. We would be using a dummy dataset that is already present in the sklearn library database of datasets. The dataset we would be using is the same one we used in our deep dive into the random forest classification algorithm.
# Random Forest Classification # Importing the libraries import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, plot_confusion_matrix from sklearn.datasets import load_digits #importing the classification models from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import GaussianNB from sklearn.tree import DecisionTreeClassifier from sklearn.svm import SVC from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier
matplotlib.pyplot library is used to visualize image-based data. Train test split module is used for splitting the whole dataset into two subsets, that is the training and the testing data. The metrics module provides a number of other modules for the evaluation of the models developed.
load_digits is the module used to load a dataset already present in the Sklearn library. After that we import all the classes necessary for the development of our machine learning classification algorithms. Now, we import the dataset and plot it for visualization.
# Importing the dataset dataset = load_digits() # Vizualizing the Dataset _, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 3)) for ax, image, label in zip(axes, dataset.images, dataset.target): ax.set_axis_off() ax.imshow(image, cmap=plt.cm.gray_r) ax.set_title("Training: %i" % label)
Here we load the data using the load_digits module and then use the matplotlib library for its visualization.
Now, we unpack the data into X and Y sets, where X is the input features and Y are the output targets corresponding to instances in X. Then we split the whole data into two subsets of training and testing.
# Unpacking the data into X(input) and Y(target output) n_samples = len(dataset.images) X = dataset.images.reshape((n_samples, -1)) Y = dataset.target # Splitting the dataset into the Training set and Test set X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.25)
After splitting the data, we can now move on to the training and evaluation of each classification model.
# Logistic Regression # Fitting LR Classification to the Training set classifier = LogisticRegression(max_iter=1500) classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) # Computing Accuracy accuracy = accuracy_score(y_test, y_pred) print("Accuracy of model: ", accuracy) plot_confusion_matrix(classifier, X_test, y_test, cmap='Blues', display_labels=dataset.target_names) plt.tight_layout() plt.show()
Here, we first create an instance of the Logistic Regression class. We set the max_iter, that is maximum iterations, to 1500 due to the fact that for this particular dataset, the model could not converge before reaching the iteration limit. If you encounter an error regarding reaching the iteration limit, you can increase the number of iterations to see whether the model converges or not.
We then use the
fit() module to train the model. After the training, the model is used to create predictions for the testing dataset using the
Then, we finally compute the accuracy of the model. In the end we print the accuracy and the confusion matrix for the evaluation of the performance of the model.
Now, we move on to develop the other classification models in the same way we did the Logistic Regression one.
# Naive Bayes # Fitting Gaussian Naive Bayes Classification to the Training set classifier = GaussianNB() classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) # Computing Accuracy accuracy = accuracy_score(y_test, y_pred) print("Accuracy of model: ", accuracy) plot_confusion_matrix(classifier, X_test, y_test, cmap='Blues', display_labels=dataset.target_names) plt.tight_layout() plt.show()
# Decision Tree # Fitting Decision Tree Classification to the Training set classifier = DecisionTreeClassifier() classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) # Computing Accuracy accuracy = accuracy_score(y_test, y_pred) print("Accuracy of model: ", accuracy) plot_confusion_matrix(classifier, X_test, y_test, cmap='Blues', display_labels=dataset.target_names) plt.tight_layout() plt.show()
# Support Vector Machine # Fitting SVM Classification to the Training set classifier = SVC() classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) # Computing Accuracy accuracy = accuracy_score(y_test, y_pred) print("Accuracy of model: ", accuracy) plot_confusion_matrix(classifier, X_test, y_test, cmap='Blues', display_labels=dataset.target_names) plt.tight_layout() plt.show()
# K-Nearest Neighbor # Fitting KNN Classification to the Training set classifier = KNeighborsClassifier() classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) # Computing Accuracy accuracy = accuracy_score(y_test, y_pred) print("Accuracy of model: ", accuracy) plot_confusion_matrix(classifier, X_test, y_test, cmap='Blues', display_labels=dataset.target_names) plt.tight_layout() plt.show()
# Random Forest # Fitting Random Forest Classification to the Training set classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy') classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) # Computing Accuracy accuracy = accuracy_score(y_test, y_pred) print("Accuracy of model: ", accuracy) plot_confusion_matrix(classifier, X_test, y_test, cmap='Blues', display_labels=dataset.target_names) plt.tight_layout() plt.show()
As seen from the accuracies and the confusion matrices of the models above, we can see that the K-Nearest Neighbor gives the highest accuracy of 98.88%, with support vector machine at 98.66%. The algorithm with the lowest accuracy is the decision tree algorithm.
This proves that the K-nearest neighbor is best optimized for the dataset that we use for the evaluation of these models. As such, there is no universal standard data for evaluation of any classification model and declaring it better than the other. Each algorithm adapts to every dataset in a unique way and an algorithm not behaving well on one dataset may be the most efficient for another.
This is the reason that none of the models here are obsolete, rather all of these are highly often used in the field of machine learning.
If you enjoyed reading this piece of ours, do check out other articles for an in-depth knowledge of AI and machine learning: