MultiHeadAttention giving very different values between versions (Pytorch/Tensorflow-CodePudding

I'm trying to recreate a transformer that was written in Pytorch and make it Tensorflow. Everything was going pretty well until each version of MultiHeadAttention started giving extremely different outputs. Both methods are an implementation of multi-headed attention as described in the paper "Attention is all you Need", so they should be able to achieve the same output.

I'm converting

self_attn = nn.MultiheadAttention(dModel, nheads, dropout=dropout)

self_attn = MultiHeadAttention(num_heads=nheads, key_dim=dModel, dropout=dropout)

For my tests, dropout is 0.

I'm calling them with:

self_attn(x,x,x)

where x is a tensor with shape=(10, 128, 50)

As expected from the documentation, the Pytorch version returns a tuple, (the target sequence length, embedding dimension), both with dimensions [10, 128, 50].

I'm having trouble getting the TensorFlow version to do the same thing. With Tensorflow I only get one tensor back, (size [10, 128, 50]) and it looks like neither the target sequence length or embedding dimension tensor from pytorch. Based on the Tensorflow documentation I should be getting something comparable.

How can I get them to operate the same way? I'm guessing I'm doing something wrong with Tensorflow but I can't figure out what.

CodePudding user response：

nn.MultiheadAttention outputs by default tuple with two tensors:

attn_output -- result of self-attention operation
attn_output_weights -- attention weights averaged(!) over heads

At the same time tf.keras.layers.MultiHeadAttention outputs by default only one tensor attention_output (which corresponds to attn_output of pytorch). Attention weights of all heads also will be returned if parameter return_attention_scores is set to True, like:

output, scores = self_attn(x, x, x, return_attention_scores=True)

Tensor scores also should be averaged to achieve full correspondence with pytorch:

scores = tf.math.reduce_mean(scores, 1)

While rewriting keep in mind that by default (as in snippet in question) nn.MultiheadAttention expects input in form (seq_length, batch_size, embed_dim), but tf.keras.layers.MultiHeadAttention expects it in form (batch_size, seq_length, embed_dim).