Basics of Classification Models

Classification algorithms are supervised learning concepts which categorizes data into classes. For example if you wanted to categorize dogs by large or small you could use a classification model to take the data for dogs and classify them as either large or small. Data used for a classification model can be structured or unstructured. Classification models can be used to output data as binary results, yes or no, 1’s or 0’s, win or loss, etc. They are used very commonly in speech recognition, face detection systems etc.

Classifiers are algorithms used to map input data and put them into specific categories. The classifier is then used in the classification model which takes the training data, and uses it to predict the testing data or draw conclusions in the testing data, in a similar way to linear regression models. The results of classifications can be as follows

Binary Classification

As explained before this is a model that has 2 output results. For dogs it can be if a dog is small or large.

Multi-Class Classification

Classification with multiple results as the output. An example of this is the breeds of dogs.

Multi-Label Classification

Classification with multiple results and labels for each result. An example for this would be grouping dogs by breeds, colors and sizes

The final model effectiveness can be evaluated much in the same way a linear regression model would be evaluated using train, test split and methods such as cross validation to analyze the strength of the final model.

Types of learners in Classification Models

Classification models are broken up into 2 types of learning categories

Lazy learners which store the training data and just look for the most related data in the testing data. Spend more time predicting than eager learners but less time preparing the training data. Classification is done by taking the most related data in the training set to compare to the testing data.

Eager learners construct a classification model based on the entire set of training data to commit to a single hypothesis that is used to classify the testing set. They take a lot of time in training and less time in prediction (decision tree, naive bayes)

Classification algorithms

Logistic regression: The goal is to find the best fitting relationship between dependent variable and one or more independent variables to determine an outcome, always binary. Fits to the data in a similar way to linear regression. The dependent variable is always binary. This model assumes data is free of missing values and assumes the predictors are independent of each other.

Naive Bayes Classifier: Uses the basic bayes theorem that finds the probability that a point in the data set falls into a certain class P(class | training variable 1, training variable 2, training variable 3… training n). and will compare that to P(class | test 2) to see what class it falls into. It’s a very simple but effective model and compares favorably to other models even with its simplicity.

Stochastic Gradient Descent: In a very basic sense it will take each feature and plug it into a gradient function, calculate the step size (minimization factor) of each feature by multiplying the gradient descent by the learning rate and subtract the old feature by that step size. This process is repeated until the gradient converges to 0

K-Nearest Neighbor: A lazy learning algorithm that stores instances of training data in an n-dimensional space and then takes the test values and classifies them by which of those n-dimensional spaces it falls into.

Decision Tree: It breaks down the training data into specific structures using if-then rules. The rules are learned sequentially using 1 training data point at a time. The process continues until a termination in the tree is met. The testing data will go through the tree to find its classification. form of eager learner, training set takes much longer to run.

Random Forest: Method of creating multiple decision trees on subsets of the entirety of the data set. the testing set will then fall into one of those subsets and not a termination point on the decision tree. This method generally works better than the decision tree because it reduces overfit the data to specific termination points. Very slow model and complex implementation

Artificial Neural Networks: Takes all the data and the results of the data and maps the features to the results by weighing the effect of each feature on the final target. The features will go through multiple hidden layers to find the proper final output and weight of each feature on the final classifications of the results. Extremely complex eager learner model

Support Vector Machines: Represents training data as points in space and creates categories by using gaps as wide as possible in the data. much more learning on the training data.

Evaluating the Model

One method is the holdout method, which separates the data into a training set and a testing set to measure the accuracy of the model on the testing set using the training set to, as the name states, train the model.

Another method is cross validation, which is a test to ensure there is no overfitting of the training dataset. This works by separating the data into different sets of training and testing data so that in each case the model will train a set of data and test the rest, then train a different set of the same data then test the rest.

There are also classification reports which give a printed out version of results from the model

And lastly we have the ROC curve, or receiver operator characteristics curve, shows a visual representation of the true positive rate vs the false positive rate where the area under the curve is the measure of accuracy of the model.

Which models to choose

Just like with linear regression the best model is the one that gives you the most accuracy, all these models can be used on the data depending on what is necessary for a specific problem. Determining the best method comes down to its final accuracy as well as the time it takes to achieve the result. For example for a very large data set a decision tree or artificial neural network may work best but a naive bayes model may be more effective because it gives a good result in a much shorter amount of time.