Reproducibility, Controlling Randomness, Operator-level Randomness in TFF-CodePudding

I have a TFF code that takes a slightly different optimization path while training across different runs, despite having set all the operator-level seeds, numpy seeds for sampling clients in each round, etc. The FAQ section on TFF website does talk about randomness and expectation in TFF, but I found the answer slightly confusing. Is it the case that some aspects of the randomness can't be directly controlled even after setting all the operator-level seeds that one could; because one can't control the way sub-sessions are started and ended?

To be more specific, these are all the operator-level seeds that my code already sets: dataset.shuffle, create_tf_dataset_from_all_clients, keras.initializers and np.random.seed for per-round client sampling (which uses numpy). I have verified that the initial model state is the same across runs, but as soon as training starts, the model states start diverging across different runs. The divergence is gradual/slow in most cases, but not always.

The code is quite complex, so not adding it here.

CodePudding user response：

There is one more source of non-determinism that would be very hard to control -- summation of float32 numbers is not commutative.

When you simulate a number of clients in a round, the TFF executor does not have a way to control the order in which the model updates are added together. As a result, there could be some differences at the bottom of the float32 range. While this may sound negligible, it can add up over a number of rounds (I have seen hundreds, but could be also less), and eventually cause different loss/accuracy/model weights trajectories, as the gradients will start to be computed at slightly different points.

BTW, this tutorial has more info on best practices in controlling randomness in TFF.