Introduction
Activation Functions are crucial when comes to Neural Networks. An activation function at its really basic definition is nothing but a mathematical function that determines whether a Neuron should be fired or not. In other words, it determines the output of a neuron based on its input. Now there are tons of activation functions used in Neural Networks depending on the problem that needs to be solved, So, In this article, we'll discuss some of the most common activation functions used in Neural Networks and when to use them. To be precise, we'll see 5 of them, But, before that, we need to have a better understanding of why we are using activation functions in Neural Networks and what exactly an activation function does with Neural Networks.
What are Activation Functions and Why it is needed?
Now, if you think about Neural Networks in common, what makes them special is the capability of learning really complex patterns and non-linear relationships in the data given for training them. To achieve this, the Neural Network needs to identify how the patterns are distributed and what relationships exist in the given data, and an activation function is what helps Neural Networks to do it.
Since most of the real-world data is non-linear in nature, An Activation function helps Neural Networks to introduce this non-linearity, enabling them to learn and model more complex patterns in real-world datasets. This makes Neural Networks the ultimate tool for learning almost any kind of complex data.
That is the basic intuition behind Activation Functions. But what happens if we don't apply activation to neurons? Well, if we do so, our Neural Network will become a simple linear model like Linear Regression or Logistic Regression which can only understand linear relationships.
Let's make it more simple to understand, here is an illustration made using TensorFlow Neural Network Playground,
Applying a Linear Activation and not applying an activation has the same effect. Anyway, you can see that the Network is not able to classify or separate the blue and orange data points as seen on the right side since it is just doing a linear fit. Also, the training and testing loss is pretty bad.
Now let's see what happens when Applying an Activation Function,
Here, we applied ReLU, a commonly used Activation Function in Neural Networks and you can see that the network is able to find the pattern in the data and classify it more precisely. Moreover, the loss is significantly lowered as well.
If you want to play around with this tool I'll leave the link here:
This is because activation functions like ReLU help some neurons to be activated while others remain dormant. When this happens, the activated neurons can identify and capture the patterns in the data, leading to more precise classification results. Additionally, by activating a subset of neurons and suppressing the rest, the overall loss of the Neural Network can be significantly reduced, making it more effective.
This illustration provides a straightforward way to understand the significance of Activation Functions in Neural Networks. Now, we can delve into the discussion of the most commonly used activation functions in Neural Networks.
Common Activation Functions
1. Sigmoid Function
The Sigmoid Activation Function is a popular choice when comes to Activation Functions in Neural Networks. It is a mathematical function that maps any real-valued number to a value between 0 and 1, which makes it useful in binary classification problems.
If you notice the graph of the Sigmoid Function, you can see that it is shaped like "S" and because of that it is also known as an "S-shaped function". The mathematical formula of the sigmoid function is given below:
\[f(x) = \frac{1}{1 + e^{-x}}\]
When you look more deeply into the formula, you will understand that the output of the sigmoid function is always between 0 and 1. Because of this property of the sigmoid function, it is commonly used for modeling probabilities of binary outcomes. For instance, it can be used to predict the probability of an item belonging to a particular class in classification.
When to use Sigmoid Function?
Sigmoid functions are commonly used in the output layer of a Deep Neural Network. It is particularly useful when you want to model a binary classification problem where the output variable can take on only two values, such as "yes" or "no", or 1 or 0. The sigmoid function can output a probability between 0 and 1, which can be used to predict the likelihood of a particular class.
For instance, if you want to predict the probability of a particular message being spam or not, you will be using a Sigmoid Function at the output layer which gives you the probability of the message being spam or not.
When not to use Sigmoid Function?
Even though the sigmoid function is a popular choice, there are some situations where it might not be the best fit. Sometimes the output of a Sigmoid Activation Function can become very close to 0 or 1 for small or large inputs. This causes a problem to occur known as the Vanishing Gradient Problem. This problem arises in really large Deep Neural Networks. Here, the Gradients become too small which does not contribute anything to the weight update during backpropagation. This leads to more computational costs and poor learning.
So Sigmoid Function is not recommended in hidden layers of Deep Neural Networks.
Sigmoid Function using Python
import numpy as npdef sigmoid(x):return 1/(1+np.exp(-x))sigmoid(10)-----------0.9999546021312976
2. Tanh (Hyperbolic Tangent)
Tanh, commonly known as Hyperbolic Tangent Function is another well-known activation function similar to sigmoid but the values range from -1 to +1 centered at 0. The Tanh function is an improvement over the sigmoid function because it maps the input values to a wider range and produces a more balanced output. Here is the mathematical formula of the Hyperbolic Tangent Function,
\[\frac{(e^{x}-e^{-x})}{(e^{x}+e^{-x})}\]
This formula produces output ranges between -1 and +1. Because of this nature, the Tanh function can be used in Neural Networks which are used to predict both negative and positive outcomes. The Tanh function is centered around 0, which makes it easier to model data that has both positive and negative values.
When to use Tanh?
Tanh function can be used when the input data has a range between negative and positive values. For instance, it can be used in Neural Networks which is made for sentiment analysis, it can be used to model the sentiment of text data, where negative sentiment is represented by negative values, positive sentiment is represented by positive values, and Neutral can be represented by zero or close to zero.
When not to use Tanh?
Tanh may not be a great option when your dataset is really very large and if there is any better option available, It can be computationally expensive on very large datasets and large Deep Neural Networks since there involves computing exponential and also the value ranges between -1 and +1. Also, Tanh is prone to the same Vanishing Gradient Problem as discussed in Sigmoid Function but far lesser than the Sigmoid.
Tanh Function using Python
import numpy as npdef tanh(x):return np.tanh(x)tanh(-10)------------0.9999999958776927
3. ReLU
ReLU(Rectified Linear Unit) is a popular choice for most problems that can be solved using Neural Networks. ReLU is a simple function that returns the input value x if it is positive, and 0 otherwise. From the graph, we can understand that when the value is greater than zero, the function just excites, and if it is zero or negative, it stays still at zero. The formula is simple,
\[f(x) = max(0, x)\]
As per the formula, we can understand that when the value is positive it returns the value itself and if it is negative, it returns zero. Because of this simple nature, ReLU doesn't activate Neurons if it gets very small inputs which makes it a better choice for most problems involving image recognition, speech, etc.
When to use ReLU?
ReLU is most commonly used in Large Deep Neural Networks with lots and lots of hidden layers. Since ReLU only requires some threshold operations, it is computationally efficient even though the dataset and the network are large. Also, ReLU is a great choice to avoid the Vanishing Gradient Problem since it does not suffer from the problem of gradients becoming very small as they propagate through multiple layers of the network.
When not to use ReLU?
Even though ReLU is good for most of the problems, it is not always a greater choice. ReLU has this problem known as "dying ReLU". In this case, the neurons can be stuck in a state where their output is always zero. This can happen when the input to a particular Neuron is negative always and stays negative for a long period of time, causing the neuron to never activate. When this occurs, the network cannot learn some patterns for which the activation is zero. So ReLU may not be good when your dataset contains negative values that are important for the task at hand, and the network needs to learn it.
ReLU using Python
import numpy as npdef relu(x):return np.maximum(0, x)relu(-5)--------0
4. Softmax
The softmax function is more popular for performing multiclass classification tasks. It simply takes a vector of numbers as inputs and produces another vector containing the probability distribution of inputs. For instance, given an input vector \(x=\begin{bmatrix}& x_{1} & x_{2} ... & x_{n} \\\end{bmatrix}\), the Softmax function computes a new vector \(y =\begin{bmatrix}& y_{1} & y_{2} ... & y_{n} \\\end{bmatrix}\), where each element \(y_{i}\) is given by:
\[y_{i} = \frac{e^{x_{i}}}{\sum_{j=1}^{n}e^{x_{j}}}\]
Where, \(e^{x_{i}}\) is the standard exponential function of input vector,
\(n\) is the number of classes
\(e^{x_{j}}\) is the standard exponential function of output vector
If you just scan the equation, you'll understand that the softmax function applies the exponential function to each element of the input vector, summing up, and then mapping it to the output vector. Once you add all the values given in the output vector you'll get 1 as the answer. This ensures that the output values can be interpreted as probabilities.
Due to the nature of the Softmax function, It is commonly used when comes to Multi-class Classification tasks where there are multiple classes to distinguish. The Softmax simply finds the probability of each of the inputs given in the input vector belonging to a particular class, and because of that Softmax is used in the output layer of the Neural Network to produce a vector of real numbers representing the "logits" for each category.
When to use Softmax?
Softmax as we said is more commonly used for Multi-class classification problems. For instance, if you want to classify images of dogs, cats, and, rabbits, Softmax is the best option.
When not to use Softmax?
Well, if you are dealing with a binary classification problem where there are only two classes to distinguish, there is no need to apply the Softmax function, you can go with any other function such as sigmoid. Also, if you are not dealing with probability distribution outcomes, Applying the Softmax function doesn't make any sense.
Softmax using Python
import numpy as npdef softmax(x):return np.exp(x) / np.sum(np.exp(x))softmax([2, 5, 6, 8, 2])-----------array([0.00208285, 0.04183507, 0.1137195 , 0.84027975, 0.00208285])
5. Swish
Swish can be considered a new member of activation functions which is introduced in 2017 by Google Brain researchers. It is being introduced because of the fact that it outperforms some other kinds of activation functions in certain cases. Here is the formula of the Swish Activation Function,
\[f(x) = x * \text{sigmoid}(\text{beta} * x)\]
The equation is straightforward, but what needs to be noted is the beta parameter. The beta parameter is used to control the shape of the function. Look at a small nudge in the graph of the Swish function close to zero, the beta parameter will help to adjust this nudge whenever needed.
The Swish function is similar to Sigmoid but with a modified output. It is also similar to the ReLU function in some ways.
When to use Swish?
Swish activation functions can be used within Deep Neural Networks and large datasets since it is computationally efficient. Swish can also address the problem of dying ReLU, a situation where specific neurons are constantly blocked from firing. By smoothly approaching zero, Swish prevents outputs from becoming zero and enables neurons to fire. This property makes Swish particularly well-suited for use in scenarios that require fast optimization in large Deep Neural Networks since it has a smooth gradient.
When not to use Swish?
Swish may not be a good choice when the dataset contains large negative values or are too close to zero. Also, It may not be useful if there is no need to activate unwanted neurons in the network. In that case, ReLU is best.
Swish using Python
def swish(x, beta=5):return x * (1 / (1 + np.exp(-beta * x)))swish(1)---------------------0.9933071490757153
Conclusion
In conclusion, activation functions play a crucial role in the performance of neural networks. They introduce non-linearity, enabling neural networks to learn complex relationships between inputs and outputs. In this article, we have discussed five popular activation functions, namely Sigmoid, Tanh, ReLU, Softmax, Swish, and ELU, and when to use or avoid them. It is important to note that the choice of activation function depends on the nature of the problem, the architecture of the network, and the available data. Therefore, it is crucial to experiment with different activation functions to determine the most suitable one for a given problem. As the field of deep learning continues to grow, it is likely that more activation functions will be developed, and researchers will continue to investigate their effectiveness for various applications.