I have been doing some NLP categorisation tasks and noticed that my models train much faster if I use post-padding instead of pre-padding, and was wondering why that is the case.
I am using Google Colab to train these model with the GPU runtime. Here is my preprocessing code:
PADDING = 'post'
# Tokenising the input strings and padding
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(X)
X_tokenized = tokenizer.texts_to_sequences(X)
X_padded = pad_sequences(X_tokenized, maxlen=80, truncating='post', padding=PADDING)
X_train = np.array(X_padded)
# Encoding output one
y1 = y1.to_numpy().reshape(-1, 1) # Reshape to an array of features
encoder_1 = OneHotEncoder() # Instantiate encoder
y1 = encoder_1.fit_transform(y1) # Fit encoder to output
y1 = y1.toarray() # Make output a numpy array
# Encoding output two
y2 = y2.to_numpy().reshape(-1, 1)
encoder_2 = OneHotEncoder()
y2 = form_encoder.fit_transform(y2)
y2 = y2.toarray()
Now to create my model:
# --- MODEL PARAMETERS ---
vocab_size = len(tokenizer.index_word) 1
y1_size = len(encoder_1.categories_[0])
y2_size = len(encoder_2.categories_[0])
embedding_size = 175
units = 96
# --- MODEL ARCHITECTURE ---
inputs = Input(shape=(None,))
input_embeddings = Embedding(vocab_size, embedding_size, mask_zero=True)(inputs)
shared_lstm = Bidirectional(LSTM(units, return_sequences=True,
dropout=0.3))(input_embeddings)
y1_lstm = Bidirectional(LSTM(units, dropout=0.3))(shared_lstm)
y1_dense = Dense(y1_size, activation='softmax', name='y1')(y1_lstm)
y2_lstm = Bidirectional(LSTM(units, dropout=0.3))(shared_lstm)
y2_dense = Dense(y2_size, activation='softmax', name='y2')(y2_lstm)
split_shared_model = Model(inputs=inputs, outputs=[y1_dense, y2_dense])
Which is then compiled as:
split_shared_model.compile(
optimizer='adam',
loss=CategoricalCrossentropy(),
metrics=['accuracy']
)
The summary of the model is as follows:
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_4 (InputLayer) [(None, None)] 0 []
embedding_3 (Embedding) (None, None, 175) 19075 ['input_4[0][0]']
bidirectional_8 (Bidirectional (None, None, 192) 208896 ['embedding_3[0][0]']
)
bidirectional_9 (Bidirectional (None, 192) 221952 ['bidirectional_8[0][0]']
)
bidirectional_10 (Bidirectiona (None, 192) 221952 ['bidirectional_8[0][0]']
l)
y1 (Dense) (None, 912) 176016 ['bidirectional_9[0][0]']
y2 (Dense) (None, 617) 119081 ['bidirectional_10[0][0]']
==================================================================================================
Total params: 966,972
Trainable params: 966,972
Non-trainable params: 0
__________________________________________________________________________________________________
After calling the fit()
method the model starts training. Below is an intermediary result with the above settings:
Epoch 1/50
398/2647 [===>..........................] - ETA: 1:28 - loss: 8.7918 - y1_loss: 4.9236 - y2_loss: 3.8682 - y1_accuracy: 0.1495 - y2_accuracy: 0.3204
---------------------------------------------------------------------------
However, if I change PADDING
to 'pre'
I find that training is much slower!
Epoch 1/50
90/2647 [>.............................] - ETA: 45:52 - loss: 9.8153 - y1_loss: 5.3961 - y2_loss: 4.4192 - y1_accuracy: 0.1243 - y2_accuracy: 0.2788
Can anyone explain why this is? I think it might have something to do with the Embedding layer and it's masking but I am not sure.
CodePudding user response:
This is related to the underlying LSTM
implementation. There are in fact two: A "native Tensorflow" one and a highly optimized pure CUDA implementation which is MUCH faster. However, the latter can only be used under specific conditions (certain parameter settings etc.). You can find details in the docs. The main point here is:
Inputs, if use masking, are strictly right-padded.
This implies that the pre-padding version does not use the efficient implementation, which explains the much slower runtime. I don't think there is a reasonable workaround here except for sticking with post-padding.
Note that sometimes, Tensorflow actually outputs a warning message that it had to use the inefficient implementation. However, for me this has been inconsistent. Maybe keep your eyes out if any additional warning outputs are produced in the pre-padding case.