Do linear activation and ReLU activation behave the same when using kernel

Recently, I worked with kernel constraints in Keras to restrict gradients during training. For my use case (regression), I found the NonNeg constraint pretty useful.

From my understanding, the NonNeg-class restricts the gradients to be only positive (presumably using the absolute gradients). Therefore, I wonder if there are any differences between activating a layer using linear activation layers.Dense(1, activation = "linear", kernel_constraint = "non_neg") vs. ReLU activation layers.Dense(1, activation = "relu", kernel_constraint = "non_neg") when adding the NonNeg-constraint. Do you have some insights?

CodePudding user response：

A kernel_constraint affects the weights of the layer. It works by applying the constraint function to the weights after each gradient step. NonNeg, in particular, sets all negative weights to 0 (it does not use the absolute value). Thus:

It does not affect the gradients at all, except for whatever effects only having weights >= 0 will have on the gradients.
It does not actually "respect" gradients either -- if the gradient-based optimization pushes a weight to a value < 0, the constraint will set it to 0 directly.
It is not the same as a ReLU activation, since relu sets the activations to >= 0, which is once again completely different from having weights and/or gradients >= 0.

The one thing you could say is: If your inputs are all >= 0, and you are constraining weights to be >= 0 via kernel_constraint=NoneNeg, then the layer outputs will necessarily be >= 0 and relu will indeed have no effect, so you could use a linear activation.

CodePudding user response：

It depends on the inputs of the layer. (Note: this constraint affects only the "weights" of the layer, not the gradients)

So, if you have "only positive weights", the following will happen:

If the input is positive:
- Outputs will be positive (positive x positive = positive)
- ReLU will never do it's job, it will be equivalent to linear
If the input can be negative:
- Outputs can be negative (positive x negative = negative)
- ReLU will do it's job, it will be different from linear

So, if you stack two layers like this:

First layer with "ReLU" or "sigmoid" (only positive outputs)
Next layer with NonNeg constraint

The second layer will always have positive results, thus ReLU will not work as intended for the second layer.

If you stack an entire model with NonNeg and ReLU, only the first layer will take advantage of ReLU properly (and only if the input data can be negative).

Curiosity about how this kernel constraint works:

Keras calculates gradients
Keras adds gradients to weights
If weights turn out to be negative after this
- Make weights 0

Gradients can still be negative, of course. If gradients could only be positive, eventually your weights would grow to infinity. The kernel constraint doesn't change the gradients in any way.