Bert. Encoder. Layer. 0. Attention. Self. Query. Weight [768, 768] nn. Linear
Bert. Encoder. Layer. 0. Attention. Self. Query. Bias, [768]
Bert. Encoder. Layer. 0. Attention. Self. Key. Weight [768, 768] nn. Linear
Bert. Encoder. Layer. 0. Attention. Self. Key. Bias, [768]
Bert. Encoder. Layer. 0. Attention. The self. The value. The weight [768, 768] nn. Linear
Bert. Encoder. Layer. 0. Attention. Self. Value. The bias, [768]
Bert. Encoder. Layer. 0. Attention. The output. The dense, weight [768, 768] nn. Linear
Bert. Encoder. Layer. 0. Attention. The output. The dense, bias, [768]
Bert. Encoder. Layer. 0. Attention. The output. LayerNorm. Weight [768] nn. LayerNorm
Bert. Encoder. Layer. 0. Attention. The output. LayerNorm. Bias, [768]
Bert. Encoder. Layer. 0. Intermediate. Dense. Weight [3072, 768] nn. Linear
Bert. Encoder. Layer. 0. Intermediate. Dense. Bias, [3072]
Bert. Encoder. Layer. 0. The output. The dense, weight [768, 3072] nn. Linear
Bert. Encoder. Layer. 0. The output. The dense, bias, [768]
Bert. Encoder. Layer. 0. The output. LayerNorm. Weight [768] nn. LayerNorm
Bert. Encoder. Layer. 0. The output. LayerNorm. Bias, [768]
This is Bert layer0 parameters, the other can understand, but this 3072, see other parsing all said only that he is equal to 4 * H (768), but why? With the head or something?