Home > Software engineering >  Masking layer vs attention_mask parameter in MultiHeadAttention
Masking layer vs attention_mask parameter in MultiHeadAttention

Time:09-12

I use MultiHeadAttention layer in my transformer model (my model is very similar to the named entity recognition models). Because my data comes with different lengths, I use padding and attention_mask parameter in MultiHeadAttention to mask padding. If I would use the Masking layer before MultiHeadAttention, will it have the same effect as attention_mask parameter? Or should I use both: attention_mask and Masking layer?

CodePudding user response:

The masking layer keeps the input vector as it and creates a masking vector to be propagated to the following layers if they need a mask vector ( like RNN layers). you can use it if you implement your own model.If you use models from huggingFace, you can use a masking layer for example if you you want to save the mask vector for future use, if not the masking operations are already built_in, so there is no need to add any masking layer at the beginning.

CodePudding user response:

The Tensoflow documentation on Masking and padding with keras may be helpful.
The following is an excerpt from the document.

When using the Functional API or the Sequential API, a mask generated by an Embedding or Masking layer will be propagated through the network for any layer that is capable of using them (for example, RNN layers). Keras will automatically fetch the mask corresponding to an input and pass it to any layer that knows how to use it.

tf.keras.layers.MultiHeadAttention also supports automatic mask propagation in TF2.10.0. https://github.com/tensorflow/tensorflow/releases/tag/v2.10.0

  • Related