I'm working with mobilenets and trying to understand the intuition for why activations have 16 bits and why weights have 8 bits. Empirically, I see it, but intuitively what's the reason for the huge difference between 8 and 16 bit, (other than most deployment hw doesn't offer anything in between 8 and 16, which is a different discussion)? AKA why aren't the activations also 8 bits?
I think part of my misunderstanding is I don't understand what activations means in this context. The weights are the parts that get optimized by gradient descent, but what exactly are the activations? I know what an activation function is, e.g. sigmoid/relu, but I don't understand what activations are or why they need to be stored with the model in addition to weights and biases (which this link doesn't talk about, quantization of the biases is important to optimization too)
https://www.tensorflow.org/lite/performance/model_optimization
CodePudding user response:
Activations are actual signals propagating through the network. They have nothing to do with activation function, this is just a name collision. They are higher accuracy because they are not part of the model, so they do not affect storage, download size, or memory usage, as if you are not training your model you never store activations beyond the current one.
for example for an MLP we have something among the lines of
a1 = relu(W1x b1)
a2 = relu(W2a1 b2)
...
an = Wnan-1 bn
where each W and b will be 8bit parameters. And activations are a1, ..., an. The thing is you only need previous and current layer, so to calculate at you just need at-1, and not previous ones, consequently storing them during computation at higher accuracy is just a good tradeoff.