Why ReLU function after every layer in CNN?-CodePudding

I am taking intro to ML on Coursera offered by Duke, which I recommend if you are interested in ML. The instructors of this course explained that "We typically include nonlinearities between layers of a neural network.There's a number of reasons to do so.For one, without anything nonlinear between them, successive linear transforms (fully connected layers) collapse into a single linear transform, which means the model isn't any more expressive than a single layer. On the other hand, intermediate nonlinearities prevent this collapse, allowing neural networks to approximate more complex functions." I am curious that, if I apply ReLU, aren't we losing information since ReLU is transforming every negative value to 0? Then how is this transformation more expressive than that without ReLU?

In Multilayer Perceptron, I tried to run MLP on MNIST dataset without a ReLU transformation, and it seems that the performance didn't change much (92% with ReLU and 90% without ReLU). But still, I am curious why this tranformation gives us more information rather than lose information.

CodePudding user response：

the first point is that without nonlinearities, such as the ReLU function, in a neural network, the network is limited to performing linear combinations of the input. In other words, the network can only learn linear relationships between the input and output. This means that the network can't approximate complex functions that are not linear, such as polynomials or non-linear equations.

Consider a simple example where the task is to classify a 2D data point as belonging to one of two classes based on its coordinates (x, y). A linear classifier, such as a single-layer perceptron, can only draw a straight line to separate the two classes. However, if the data points are not linearly separable, a linear classifier will not be able to classify them accurately. A nonlinear classifier, such as a multi-layer perceptron with a nonlinear activation function, can draw a curved decision boundary and separate the two classes more accurately.

ReLU function increases the complexity of the neural network by introducing non-linearity, which allows the network to learn more complex representations of the data. The ReLU function is defined as f(x) = max(0, x), which sets all negative values to zero. By setting all negative values to zero, the ReLU function creates multiple linear regions in the network, which allows the network to represent more complex functions.

For example, suppose you have a neural network with two layers, where the first layer has a linear activation function and the second layer has a ReLU activation function. The first layer can only perform a linear transformation on the input, while the second layer can perform a non-linear transformation. By having a non-linear function in the second layer, the network can learn more complex representations of the data.

In the case of your experiment, it's normal that the performance did not change much when you removed the ReLU function, because the dataset and the problem you were trying to solve might not be complex enough to require a ReLU function. In other words, a linear model might be sufficient for that problem, but for more complex problems, ReLU can be a critical component to achieve good performance.

It's also important to note that ReLU is not the only function to introduce non-linearity and other non-linear activation functions such as sigmoid and tanh could be used as well. The choice of activation function depends on the problem and dataset you are working with.

CodePudding user response：

Neural networks are inspired by the structure of brain. Neurons in the brain transmit information between different areas of the brain by using electrical impulses and chemical signals. Some signals are strong and some are not. Neurons with weak signals are not activated.

Neural networks work in the same fashion. Some input features have weak and some have strong signals. These depend on the features. If they are weak, the related neurons aren't activated and don't transmit the information forward. We know that some features or inputs aren't crucial players in contributing to the label. For the same reason, we don't bother with feature engineering in neural networks. The model takes care of it. Thus, activation functions help here and tell the model which neurons and how much information they should transmit.