what is the difference between keras.MeanSquaredError and reduce

I've been trying to figure this out now for a couple hours. The simple solution to make the program below work is to just use keras.MSE, but I want to understand why my version doesn't work more than I want this program to work.

It seems to me the mean of the square of the difference aught to be really close to the keras.MSE. I expect differences, but mine seems to start close and just get worse and worse and I can't figure out why.

step=0 theirs= 13.1761 mine= 14.0251
step=5 theirs= 10.3337 mine= 11.8363
…
step=90 theirs=  0.0361 mine=  6.9888
step=95 theirs=  0.0332 mine=  6.9604

I've been source diving all through keras and tensorflow. I got down to keras/losses.py backend.mean(tf.math.squared_difference(y_pred, y_true), axis=-1) and that seems really similar to me. Although, I admit the code in tf.math.squared_difference makes little sense to me, it works out roughly the same as tf.square(y_true-y_pred) in ipython.

I'm definitely missing something.

Here's my tiny program:

import tensorflow as tf
import numpy as np

def small_ds():
    in_t = tf.cast(np.random.randint(5, size=(24, 2)), tf.float32)
    out_t = tf.reduce_sum(in_t, axis=-1)
    return in_t, out_t

def small_model():
    i = tf.keras.layers.Input(shape=(2,))
    d = i
    d = tf.keras.layers.Dense(32, activation="LeakyReLU")(d)
    d = tf.keras.layers.Dense(32, activation="LeakyReLU")(d)
    d = tf.keras.layers.Dense(32, activation="LeakyReLU")(d)
    o = tf.keras.layers.Dense(1, activation="LeakyReLU")(d)
    m = tf.keras.Model(inputs=i, outputs=o)
    return m

def what_is_happening_here():
    opt = tf.keras.optimizers.Adam()
    tf_mse = tf.keras.losses.MeanSquaredError()

    @tf.function
    def my_mse(y_true, y_pred):
        return tf.reduce_mean(tf.square(y_true-y_pred))

    m = small_model()

    @tf.function
    def train_step(x_input, y_true):
        with tf.GradientTape() as tape:
            y_pred = m(x_input, training=True)
            theirs = tf_mse(y_true, y_pred)
            mine   = my_mse(y_true, y_pred)
        grad = tape.gradient(theirs, m.trainable_variables)
        opt.apply_gradients(zip(grad, m.trainable_variables))
        return theirs, mine

    x_input, y_true = small_ds()
    for step in range(100):
        theirs, mine = train_step(x_input, y_true)
        if (step % 5) == 0:
            print(f'step={step} theirs={theirs:8.4f} mine={mine:8.4f}')

if __name__ == '__main__':
    what_is_happening_here()

edit:

I was initially convinced by the first answer, but I don't think it's quite right. If I generate some completely random vectors and run through mine vs theirs with and without reductions, everything is identical. This version of my_mse is slightly difference than the above, but the strange non-linear differences show up as before if I run the modified version through the training loop above.

I think the optimizer or the graph is doing something else that I can't see.

I realize this is a trivial problem in the grand scheme of things, but I'd like to be able to write my own loss functions at some point, and I really don't trust them to work the same as the native ones.

I've also tried wrapping my loss functions in the tf.keras.losses.Loss class, but everything turns out the same (ie, it doesn't work).

In [22]: tf_mse = tf.keras.losses.MeanSquaredError(reduction=tf.keras.losses.Reduction.NONE)
    ...: my_mse = lambda x,y: tf.reduce_mean(tf.square(x-y), axis=-1)
    ...:
    ...: tf_mser = tf.keras.losses.MeanSquaredError()
    ...: my_mser = lambda x,y: tf.reduce_mean(my_mse(x,y))
    ...:
    ...: y_true = tf.cast(np.random.randint(10, size=(6,1)), tf.float32)
    ...: y_pred = tf.cast(np.random.randint(10, size=(6,1)), tf.float32)
    ...:
    ...: i = tf.keras.layers.Input(shape=(1,))
    ...: o = tf.keras.layers.Dense(32)(i)
    ...: m = tf.keras.Model(inputs=i, outputs=o)
    ...:
    ...: m_pred = m(y_pred)
    ...:
    ...: for a,b in [(tf_mse, tf_mser), (my_mse, my_mser)]:
    ...:     print(f'{a(y_true, y_pred).numpy()} -> {b(y_true, y_pred)}')
    ...:
    ...: for a,b in [(tf_mse, tf_mser), (my_mse, my_mser)]:
    ...:     print(f'{a(y_true, m_pred).numpy()} -> {b(y_true, m_pred)}')
[ 9.  4.  4.  9. 16.  4.] -> 7.666666507720947
[ 9.  4.  4.  9. 16.  4.] -> 7.666666507720947
[20.47 30.15 19.62  9.   12.8  19.62] -> 18.608726501464844
[20.47 30.15 19.62  9.   12.8  19.62] -> 18.608726501464844

edit2:

ok, I get it. I get it. If I change the small_ds above to do this, everything works fine:

def small_ds():
    in_t = tf.cast(np.random.randint(5, size=(24, 2)), tf.float32)
    out_t = tf.expand_dims(tf.reduce_sum(in_t, axis=-1), -1)
    return in_t, out_t

I think the secret to figuring this out was still in the first answer, though it wasn't clear to me exactly... The first mean in the real MSE work on channel -1, so if your shape is (24,), it's always going to reduce it -- regardless of the reduction setting.

By adding the expand_dims, my poor man's recreation of mse works just like the native one.

CodePudding user response：

According to the docs, the MeanSquaredError loss function has the parameter reduction, which is set to losses_utils.ReductionV2.AUTO by default. Now this means that:

the reduction option will be determined by the usage context. For almost all cases this defaults to SUM_OVER_BATCH_SIZE.

So I think it depends on which reduction method you are using and your batch size. Try changing your small_ds() method like this:

def small_ds():
  in_t = tf.cast(np.random.randint(5, size=(1, 2)), tf.float32)
  out_t = tf.reduce_sum(in_t, axis=-1)
  return in_t, out_t

You will notice that your results are identical for the batch size 1:

Input shape:  (1, 2)
step=0 theirs= 30.9056 mine= 30.9056
step=5 theirs= 21.4109 mine= 21.4109
step=10 theirs= 13.2141 mine= 13.2141
....
step=75 theirs=  0.0004 mine=  0.0004
step=80 theirs=  0.0055 mine=  0.0055
step=85 theirs=  0.0054 mine=  0.0054
step=90 theirs=  0.0015 mine=  0.0015
step=95 theirs=  0.0000 mine=  0.0000

Example with reduction='none':

y_true = tf.constant([[0., 2.], [0., 0.]])
y_pred = tf.constant([[3., 1.], [2., 5.]])

tf_mse = tf.keras.losses.MeanSquaredError(reduction='none')
print(tf_mse(y_true, y_pred).numpy())

my_mse = tf.reduce_mean(tf.square(y_true-y_pred))
print(my_mse)
'''
[ 5.  14.5]
tf.Tensor(9.75, shape=(), dtype=float32)
'''

And with tf.reduce_mean:

y_true = tf.constant([[0., 2.], [0., 0.]])
y_pred = tf.constant([[3., 1.], [2., 5.]])

tf_mse = tf.keras.losses.MeanSquaredError(reduction='none')
print(tf.reduce_mean(tf_mse(y_true, y_pred).numpy()))

my_mse = tf.reduce_mean(tf.square(y_true-y_pred))
print(my_mse)
'''
tf.Tensor(9.75, shape=(), dtype=float32)
tf.Tensor(9.75, shape=(), dtype=float32)
'''

And with reduction='sum':

y_true = tf.constant([[0., 2.], [0., 0.]])
y_pred = tf.constant([[3., 1.], [2., 5.]])

tf_mse = tf.keras.losses.MeanSquaredError(reduction='sum')
print(tf_mse(y_true, y_pred).numpy())

my_mse = tf.reduce_mean(tf.square(y_true-y_pred))
print(my_mse)
'''
19.5
tf.Tensor(9.75, shape=(), dtype=float32)
'''