Keras shows NaN loss when using custom softplus activation function-CodePudding

Here's my custom softplus activation:

def my_softplus(z): 
    return tf.math.log(tf.exp(tf.cast(z,tf.float32)) 1)

If I run a small test:

my_softplus([-3.0, -1.0, 0.0, 2.0])

it returns

<tf.Tensor: shape=(4,), dtype=float32, numpy=array([0.04858733, 0.31326166, 0.6931472 , 2.126928])>

When I run tensorflow own softplus activation function:

tf.keras.activations.softplus([-3.0, -1.0, 0.0, 2.0])

I got

<tf.Tensor: shape=(4,), dtype=float32, numpy=array([0.04858736, 0.31326172, 0.6931472 , 2.126928  ], dtype=float32)>

Very similar results, except for the last digits which are different.

When I fit the following model on a subset of the mnist dataset,

model2=models.Sequential()
model2.add(layers.Flatten(input_shape=(28,28)))
model2.add(layers.Dense(16, activation="softplus",#"softplus",# my_softplus <- this activation
                        kernel_initializer=my_glorot_initializer,
                        kernel_regularizer=my_l1_regularizer,
                        #kernel_constraint=my_positive_weights
                       ))
model2.add(layers.Dense(16, activation="relu"))
model2.add(layers.Dense(10,activation="softmax"))

model2.compile(optimizer="rmsprop",loss=tf.keras.losses.SparseCategoricalCrossentropy(),
             metrics=["accuracy"])

The fitting returns something like

Epoch 1/20
20/20 - 2s - loss: -2.9399e-01 - accuracy: 0.1064 - val_loss: -2.1013e-01 - val_accuracy: 0.1136
Epoch 2/20
20/20 - 1s - loss: -9.9094e-02 - accuracy: 0.1064 - val_loss: 0.0140 - val_accuracy: 0.1136

However, when I use my my_softplus activation function, I get NaN for losses.

Why is that?

Note: You can comment out the kernel_initializer and kernel_regularizer in the model building, that the results will be similar.

Note2: Here's a link for GoogleColab notebook with a MWE.

CodePudding user response：

In Colab, you did not normalize the data:

#creating a validation set
x_val=x_train[:50000]
partial_x_train=x_train[50000:]
y_val=y_train[:50000]
partial_y_train=y_train[50000:]

So the network had to go over very large values that yielded NaN loss.

Example (your implementation):

def my_softplus(z):
    return tf.math.log(tf.exp(tf.cast(z, tf.float32))   1)

my_softplus(100)
>> <tf.Tensor: shape=(), dtype=float32, numpy=inf>

When you call softplus (by TF) as an activation in the dense layer, it will check underflow and overflow issues.

In your question, if you want to get similar results, you need to normalize the data.

Source code of Softplus: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/softplus_op.h#L31-L58

In case the link changes, I'll copy it here.

template <typename Device, typename T>
struct Softplus {
  // Computes Softplus activation.
  //
  // features: any shape.
  // activations: same shape as "features".
  void operator()(const Device& d, typename TTypes<T>::ConstTensor features,
                  typename TTypes<T>::Tensor activations) {
    // Choose a threshold on x below which exp(x) may underflow
    // when added to 1, but for which exp(x) is always within epsilon of the
    // true softplus(x).  Offset of 2 from machine epsilon checked
    // experimentally for float16, float32, float64.  Checked against
    // softplus implemented with numpy's log1p and numpy's logaddexp.
    static const T threshold =
        Eigen::numext::log(Eigen::NumTraits<T>::epsilon())   T(2);
    // Value above which exp(x) may overflow, but softplus(x) == x
    // is within machine epsilon.
    auto too_large = features > features.constant(-threshold);
    // Value below which exp(x) may underflow, but softplus(x) == exp(x)
    // is within machine epsilon.
    auto too_small = features < features.constant(threshold);
    auto features_exp = features.exp();
    activations.device(d) = too_large.select(
        features,                       // softplus(x) ~= x for x large
        too_small.select(features_exp,  // softplus(x) ~= exp(x) for x small
                         features_exp.log1p()));
  }
};