Home > other >  Why the intermediate Bert. Dense. The size of weight is [3072, 768], 3072=4 * H, why is 4 times?
Why the intermediate Bert. Dense. The size of weight is [3072, 768], 3072=4 * H, why is 4 times?

Time:10-02

Bert. Encoder. Layer. 0. Attention. Self. Query. Weight [768, 768] nn. Linear
Bert. Encoder. Layer. 0. Attention. Self. Query. Bias, [768]
Bert. Encoder. Layer. 0. Attention. Self. Key. Weight [768, 768] nn. Linear
Bert. Encoder. Layer. 0. Attention. Self. Key. Bias, [768]
Bert. Encoder. Layer. 0. Attention. The self. The value. The weight [768, 768] nn. Linear
Bert. Encoder. Layer. 0. Attention. Self. Value. The bias, [768]
Bert. Encoder. Layer. 0. Attention. The output. The dense, weight [768, 768] nn. Linear
Bert. Encoder. Layer. 0. Attention. The output. The dense, bias, [768]
Bert. Encoder. Layer. 0. Attention. The output. LayerNorm. Weight [768] nn. LayerNorm
Bert. Encoder. Layer. 0. Attention. The output. LayerNorm. Bias, [768]
Bert. Encoder. Layer. 0. Intermediate. Dense. Weight [3072, 768] nn. Linear
Bert. Encoder. Layer. 0. Intermediate. Dense. Bias, [3072]
Bert. Encoder. Layer. 0. The output. The dense, weight [768, 3072] nn. Linear
Bert. Encoder. Layer. 0. The output. The dense, bias, [768]
Bert. Encoder. Layer. 0. The output. LayerNorm. Weight [768] nn. LayerNorm
Bert. Encoder. Layer. 0. The output. LayerNorm. Bias, [768]
This is Bert layer0 parameters, the other can understand, but this 3072, see other parsing all said only that he is equal to 4 * H (768), but why? With the head or something?
  • Related