Introduction
Logistic Regression
could be one of the best models for solving problems that require categorizing
different instances into groups of similar features, which we basically call
classification problems. Logistic Regression as we know is a binary classifier
that deals with binary classes, for instance, it can be used to predict
whether a person has a risk of heart disease or not. But classification will
not only end up in two classes. Suppose In some cases, we need more than
two classes, in such a case we can extend the binary Logistic Regression to
multiclass known as a Multinomial Logistic Regression or Softmax Regression.
This article solely focuses on an in-depth understanding of Multinomial
Logistic Regression, when and where it can be used in machine learning etc.
What is Multinomial Logistic Regression?
Training multiple binary classifiers and combining them together for
multiclass classification makes sense, but there is another better way of
doing multiclass classification which is by generalizing the Logistic
Regression to support multiple classes directly known as Multinomial Logistic
Regression or Softmax Regression.
In binary Logistic Regression, we deal with
two classes, ie, y ∈ {0, 1} which we used to predict whether a person
has a risk of heart disease(1) or not (0). In the case of Multinomial Logistic
Regression, it handles up to K classes, ie, y ∈ {1, 2,3, ..., K}.
The softmax function is used to generalize the Logistic Regression for
supporting multiple classes. We provide an input vector along with the
coefficients to the softmax function and it gives an output vector of K
classes with probabilities of which class the data belongs to. The class that
has the highest probability will be the correct class and is represented as
1 and the remaining classes as 0. The vector representation of
such a class is known as a one-hot vector.
We can represent the probability of getting the correct result mathematically
as p(y = 1|x), the probability of getting a correct
class for a given input, ie, the probability of y = 1 for a given x.
The Softmax Function
In the case of Logistic Regression, we use a sigmoid function for
predicting the results. The generalized version of the sigmoid is known as
the softmax function. The softmax function accepts an input vector
z = [ z1, z2, ..., zk] and produces another output vector of probability
distributions. It can be expressed as :
where:
- σ = Softmax
- vector z = Input vector of the given data.
- e^(zi) = Exponential function for input vector.
- e^(zj) = Exponential function for output vector.
- K = Number of classes.
The softmax function is used in many cases, like the activation function in
neural networks, classification problems, Reinforcement Learning, etc.
Let's understand the softmax function with an example, consider an input
vector z = [3, 4.5, -1]
softmax for vector z = [z1, z2,....,zk] is:
softmax for input vector z = [3, 4.5, -1] is:
The resulting rounded vector softmax(z) will be:
z = [0.1819, 0.8149, 0.0034]
The probability distribution of the softmax function is always between 0 and 1
and the probabilities will sum up to 1.
Now let's see how softmax can be implemented in python.
import numpy as npdef softmax(x):"""Compute softmax for each x values"""return np.exp(x) / np.sum(np.exp(x), axis=0)input_vector = [3, 4.5, -1]softmax(input_vector)output:array([0.18181803, 0.81485186, 0.00333011])
Applying Softmax in Logistic Regression
So we looked at the softmax function, now we need to know how to apply this
function to Logistic Regression to switch into multiclass. When we apply
softmax in logistic regression the inputs will be the dot product of the
weight vector(w) and the input vector(x) plus a bias(b) term (w.x + b).
input = [w1, w2, w3, .... , wk].[x1, x2, x3,.] + bias
Can you recall the slope-intercept equation y = mx + b? You can see the
inputs are given in the form of the slope-intercept equation, so our objective
is to find the optimal weights(w and b) to minimize the cost and
maximize the probability of predicting the correct class for given data. For
finding the optimal weights we don't need to bother much since Gradient
Descent will take care of it.
We can write this equation into a single form for output ŷ just like
this:
ŷ = softmax(Wx + b) or ŷ = σ(Wx + b)
Learning in Multinomial Logistic Regression
How are the weight(w) and bias (b) learned? Well, the Logistic model predicts
output by analyzing the provided labels in the training data. What we want to
do is to learn the parameters w and b for ŷ that produce an output close to the actual y values and
reduce the cost.
To get the optimal weights to reduce the cost, we can find the
distance from the predicted values ŷ to the actual values
y. This distance is known as the loss function more
specifically the cross-entropy loss function of Binary
Logistic Regression. Minimizing the loss function reduces the distance from
predicted values ŷ to the actual y values. This loss
function can be generalized to support K classes will give us the loss
function of Multinomial Logistic Regression.
Deriving cross-entropy loss function
For an observation x, which parameters maximize the likelihood of the observed
data from our actual data can be estimated by a method known as
Maximum Likelihood Estimation(MLE). After estimating the values we will
get the cross-entropy loss function.
So now you have a little bit of intuition about the loss function as we
discussed in brief, let's understand it more closely and see how we can derive
the loss function for Binary Logistic Regression and then generalize it to get
the loss function of Multinomial Logistic Regression,
We are interested in learning the weights(parameters) that maximize the
probability of getting the correct result p(y|x)(probability of y for
given x)
Since Logistic Regression deals with binary outcomes we have only two outcomes
(0, 1), so here we can use the
Bernoulli distribution. Notice that here we are first deriving the loss function of Binary Logistic
Regression and then generalizing it. So don't get confused,
When we apply Bernoulli distribution for p(y|x) we will get:
Let's apply log on both sides:
When we apply the value of ŷ = σ(Wx + b) and express the equation
as cross-entropy loss LCE, we will
get:
So a good
model predicts probabilities close to 0 for negative instances and close to 1
for positive instances.
Let's understand this concept with an example.
Suppose our Logistic model predicts a probability of 0.70 for a
positive class, where y = 1
Lce( ŷ, y) = -[ 1 . ln (0.70) + (1 - 1) ln (1 -
σ(w.x + b))]
= -
ln(0.70)
=
0.36
When the model predicts a probability of 0.70 for a negative class,
where y = 0
Lce( ŷ, y) = -[ 0 . ln σ(w.x + b) + (1 - 0) ln (1 - 0.70)]
= - ln(1
- 0.70)
=
1.2
In the first case, when the model predicts a probability of 0.70 for the
positive class where y = 1, the cost is less than 0.36. But in the second
case, the model predicts 0.70 for a negative class where y = 0, and the cost
is higher than 1.2 as compared to the first case.
The model predicts a
probability close to 1 for a positive class, which is the correct result,
but in the second case, the model predicts a probability close to 1 for a
negative class, which is the wrong result. Because of that, the cost also increased.
Now you have a better understanding of the loss function, let's see how we
can generalize this loss function to support K classes.
Generalizing loss function
For Multinomial Logistic Regression, we represent both input y and
output ŷ as vectors. The actual y label is a vector
containing K classes where
yc = 1 if c is the correct
class and the remaining elements will be 0. With these labels, the model
predicts a ŷ vector containing K classes.
If c is the correct class, where yc =
1, the right side of the cost function drops out. The input vector
y and output vector ŷ change to K classes
yk and ŷk respectively. The loss function will be the sum of logs of K classes. So the
loss function for a single training sample will be:
For all training samples the loss function will be:
Implementing Multinomial Logistic Regression in python
Now let's implement all the crazy stuff we learned so far programmatically in
python. First of all, we need a dataset for our model to work on. Here I'm
using the MNIST dataset which is a collection of 70000 small images of
handwritten digits of school students from the US.
The MNIST dataset is one of the best practical datasets for classification
problems,
So let's load the dataset available in keras.datasets
import pandas as pdimport numpy as npfrom keras.datasets import mnist# Splitting the dataset(X_train, y_train), (X_test, y_test) = mnist.load_data()
First, we loaded the dataset from Keras and assigned it to the corresponding
train and test variables. The images in this dataset are in the form of
matrices, so if you want to see the images of numbers use the
matplotlib imshow method.
import matplotlib.pyplot as pltplt.imshow(X_train[0])
Reshaping the dataset:
# Reshaping the data into 28*28 equal matrixX_train = X_train.reshape(60000,28*28)X_test = X_test.reshape(10000,28*28)
The dataset contains 784 features and we need to represent as 28 * 28 equal
number of rows and columns.
One-Hot Encoding
def OneHot(y, c):# Constructing zero matrixy_encoded = np.zeros((len(y), c))# Giving 1 for some columsy_encoded[np.arange(len(y)), y] = 1return y_encoded
The above function converts the training data into a series of ones and zeros
for the classes given.
Softmax Function
# Softmax Functiondef Softmax(z):exp = np.exp(z - np.max(z))for i in range(len(z)):exp[i]/=np.sum(exp[i])return exp
Training the model
Now comes the important part, we are going to implement a function that trains
the model with the optimal parameters(w and b) and reduce the
cost. In each iteration, the cost will shrink to provide the best possible
parameters. So let's see how we can implement that,
def fit(X,y, c, epochs, learn_rate):# Splitting the number of training examples and features(m,n) = X.shape# Selecting random weights and biasw = np.random.random((n,c))b = np.random.random(c)loss_arr = []# Trainingfor epoch in range(epochs):# Hypothesis functionz = X@w + b# Computing gradient of loss w.r.t w and bgrad_for_w = (1/m)*np.dot(X.T,Softmax(z) - OneHot(y, c))grad_for_b = (1/m)*np.sum(Softmax(z) - OneHot(y, c))# Updating w and bw = w - learn_rate * grad_for_wb = b - learn_rate * grad_for_b# Computing the lossloss = -np.mean(np.log(Softmax(z)[np.arange(len(y)), y]))loss_arr.append(loss)print("Epoch: {} , Loss: {}".format(epoch, loss))return w, b, loss_arr# Normalizing the training setX_train = X_train/300# Training the modelw, b, loss = fit(X_train, y_train, c=10, epochs=1000, learn_rate=0.10)output:Epoch: 0 , Loss: 4.048884450473747Epoch: 1 , Loss: 3.1123844928930318Epoch: 2 , Loss: 2.4359512147361935Epoch: 3 , Loss: 2.0743646205439106Epoch: 4 , Loss: 1.7426834190627996Epoch: 5 , Loss: 1.5270608054318329Epoch: 6 , Loss: 1.3661773662434502Epoch: 7 , Loss: 1.253514604554096Epoch: 8 , Loss: 1.1604735954233256Epoch: 9 , Loss: 1.092909563196898Epoch: 10 , Loss: 1.0287242505816592Epoch: 11 , Loss: 0.9819879297108901Epoch: 12 , Loss: 0.9330864749109451Epoch: 13 , Loss: 0.8970693055728086Epoch: 14 , Loss: 0.8597687440748668Epoch: 15 , Loss: 0.8307884325356042Epoch: 16 , Loss: 0.8026402805231958....
The function mainly accepts 5 parameters, The training data X and
Y, the number of classes, the number of iterations(epochs), and the
learning rate.
When you run the code, you can see that the cost is reduced in each iteration.
At the end of the iteration, you'll get the minimum cost and the optimal
weights.
Making Predictions
Now it's time to test our model, let's see how our model performs for the
training data.
def predict(X, w, b):z = X@w + by_hat = Softmax(z)# Returning highest probability class.return np.argmax(y_hat, axis=1)predictions = predict(X_train, w, b)actual_values = y_trainprint("Predictions:", predictions)print("Actual values:", actual_values)accuracy = np.sum(actual_values==predictions)/len(actual_values)print(accuracy)Predictions: [5 0 4 ... 5 6 8]Actual values: [5 0 4 ... 5 6 8]0.8780666666666667
The accuracy score is pretty good in my case, the model is able to classify
the handwritten digits with an accuracy of 88%.
What about the case of testing data,
# Normalizing the training setX_test = X_test/300test_predictions = predict(X_test, w, b)test_actual = y_testprint("Test predictions:", test_predictions)print("Test actual:", test_actual)test_accuracy = np.sum(test_actual==test_predictions)/len(test_actual)print(test_accuracy)Test predictions: [7 2 1 ... 4 8 6]Test actual: [7 2 1 ... 4 5 6]0.8853
The test score seems great, the model predicts values with an accuracy of
approximately 88%.
You may also like:
Concept of Logistic Regression | Machine Learning