Introduction
Logistic Regression, also known as logit regression is a statistical method of resolving classification problems such as a problem that requires a decision either true or false(1 or 0). The classification method has gained popularity in recent years due to its ease of use and good performance on data sets with many features and instances. In order to explain the concept of this method, we must first understand what is a classifier? The function called "classifier" determines which class an input data belongs to based on a set of training data. There are two methods to create a classifier: 1) Supervised learning where the classifier is trained using samples that have labels; 2) Unsupervised where no labels are provided.
The main concept of logistic regression is to estimate the probability that an instance belongs to a particular class and when the probability estimate is greater than 0.5 or we can say 50% then the model predicts the instance belongs to that particular class, else not. Since the name consists of "regression" you may think of logistic regression as similar to Linear Regression. It is true since the concept of both Linear and Logistic Regression is to find the best-fit equation for the given data. But while Linear Regression predicts continuous variables Logistic Regression is about discrete variables. For instance, while a Linear Regression tries to predict the prices of the house, Logistic Regression predicts whether the house is costly or not based on certain factors. This behavior of Logistic Regression makes it a binary classifier.
What are classification problems?
The basic definition of classification is to categorize different items into different groups in which the items with similar features are grouped together. In machine learning, there are mainly three types of classification problems, binary, multi-class, and multilabel classification. The classification involves only two variables is known as a binary classification(for instance, to check whether an email is spam or not, we use binary classification algorithms). Meanwhile, the classification of multiple variables is known as multiclass classification and the classification that involves more than one label is known as multilabel classification.
Concept of Logistic Regression
The logistic or logit is basically a sigmoid function(S-shaped function) that produces a number between 0 and 1. The middle value is 0.5, any value that is above 0.5 can be categorized as positive and any value below 0.5 is categorized as negative.
The sigmoid function:
Logistic Function |
x = hypothesis function(θ1 + θ2.x)
Here, when we obtain the values for θ1 and θ2 it can be plugged into the sigmoid equation and we'll get a probability between 0 and 1 which determines whether the class is positive or negative.
Decision Boundaries
When Logistic Regression estimates probabilities, it is possible to divide the instances of the two classes into positive and negative classes according to the probabilities with a line known as a decision boundary. In the above picture, the vertical straight line that separates the instances of two classes on each of its sides can be its decision boundary. The decision boundary can be any line that separates the instances of the two classes.
Estimating probabilities
The table shows the number of hours each student spent studying, and whether they passed or failed
Hours | 0.50 | 0.75 | 1.00 | 1.25 | 1.50 | 1.75 | 1.75 | 2.00 | 2.25 | 2.50 | 2.75 | 3.00 | 3.25 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Pass | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
The coefficient and intercept for logistic regression of this particular data is 0.92109544 and -2.63630685
For understanding, let's consider the first and last x values ie, 0.50 and 3.25 to plug to the equation:
f(x) = 1 / 1 + e ^ 0.28356606 * 0.50 + -0.22126745 = 0.10194855 < 0.50 ≅ 0
f(x) = 1 / 1 + e ^ 0.28356606 * 3.25 + -0.22126745 = 0.58837538 > 0.50 ≅ 1
Here 0.10194855 and 0.58837538 are the estimated probabilities to pass when a student spends 0.50 and 3.25 hours respectively. The first probability can be categorized as a negative class since it is less than 0.5 and the last probability can be a positive class, which means if the student spends 3.25 hours studying he/she is more likely to pass the exam than spending only 0.50 hours. With this simple example, you can easily get the point on how Logistic Regression is used in binary classification problems.
Cost Function(Cross Entropy Loss)
Now we know how a logistic regression model estimates probabilities and makes predictions. But how is it trained? The main objective of the training is to make the model predict high probabilities for positive classes and low probabilities for negative classes. For a single training instance the following cost function can be used:
The cost function for a single instance |
The cost function for the whole training set can be expressed as:
This function is basically known as the Cross-Entropy Loss function or Log Loss function which helps to maximize the log probability of reaching the true values from predicted values, It also finds the weights which minimize the cost. The cost function of Logistic Regression is convex, so it is possible to use Gradient Descent for optimization of the model.
Implementing Logistic Regression in Python
>>> from sklearn.linear_model import LogisticRegression>>> import numpy as np>>> X_train = np.array([[0.50], [0.75], [1.00], [1.25], [1.50], [1.75], [1.75], [2.00], [2.25], [2.50], [2.75], [3.00], [3.25]])>>> Y_train = np.array([[0],[0],[0],[0],[0],[0],[1],[0],[1],[0],[1],[0],[1]])>>> model = LogisticRegression()>>> model.fit(X_train, Y_train)
>>> print("Coefficient:",model.coef_)>>> print("Intercept:", model.intercept_)Coefficient: [[0.92109544]]Intercept: [-2.63630685]
Determining the probabilities:
for i in range(len(X_train)):x = model.coef_ * X_train[i][0] + model.intercept_ # slope intercept equationp = 1/(1 + np.exp(-x)) # Logistic Functionprint("Probabilitiy:{} {}".format(i+1, p))Probabilitiy:1 [[0.10194855]]Probabilitiy:2 [[0.12504648]]Probabilitiy:3 [[0.15248899]]Probabilitiy:4 [[0.18468279]]Probabilitiy:5 [[0.22189388]]Probabilitiy:6 [[0.2641732]]Probabilitiy:7 [[0.2641732]]Probabilitiy:8 [[0.31128557]]Probabilitiy:9 [[0.36265894]]Probabilitiy:10 [[0.41737267]]Probabilitiy:11 [[0.47419934]]Probabilitiy:12 [[0.53170228]]Probabilitiy:13 [[0.58837538]]
Here, the probabilities are less than 0.5 except for the last two. Now let us look at the predictions
model.predict(X_train)array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1])
The predictions are according to the probabilities. If the probability is greater than 0.50 then it is categorized as a positive class or 1 and if the probability is less than 0.50 then it is categorized as a negative class or 0.