Home > other >  Keras model with fasttext word embedding
Keras model with fasttext word embedding

Time:11-14

I am trying to learn a language model, to predict the last word of a sentence given all the previous words, using keras. I would like to embed my inputs using a learned fasttext embedding model.

I managed to preprocess my text data and embed the using fasttext. My training data is comprised of sentences of 40 tokens each. I created 2 np arrays, X and y as inputs, with y what I want to predict.

X is of shape (44317, 39, 300) with 44317 the number of example sentences, 39 the number of tokens in each sentence, and 300 the dimension of the word embedding.

y is of shape (44317, 300) is for each example the embedding of the last token of the sentence.

My code for the keras model goes as follow (inspired by this)

#importing all the needed tensorflow.keras components
model = Sequential()  
model.add(InputLayer((None, 300)))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(300, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, batch_size=128, epochs=20)
model.save('model.h5')

However the accuracy I get while training on this model is extremely low (around 1.5%). I think there is some component of the keras model that I misundertood, as if I don't embed my inputs and add an extra embedding layer instead of the InputLayer I get an accuracy of about 60 percents.

My main doubt is the value of "300" on my second Dense layer, as I read that this should correspond the vocabulary size of my word embedding model (which is 48000), however if I put anything else than 300 I get a dimension error. So I understand that I'm doing something wrong, but I can't find how to fix it.

PS : I have also tried y = to_categorical(y, num_classes=vocab_size) with vocab_size the vocabulary size of my word embedding, and by changing 300 by this same value in the second Dense, however then it tries to create an array of shape(13295100, 48120) instead of what I expect : (44317, 48120).

CodePudding user response:

It's very difficult to train RNN models in next sentence prediction task. LSTM/GRU do not have enough resources to extract enough features from text.

There is 2 ways to solve issue:

  1. predict chars instead of word class
  2. use transformer model. For example, Bert is good for features extracting and predicting masked word

CodePudding user response:

If you really want to use the word vectors from Fasttext, you will have to incorporate them into your model using a weight matrix and Embedding layer. The goal of the embedding layer is to map each integer sequence representing a sentence to its corresponding 300-dimensional vector representation:

import gensim.downloader as api
import numpy as np
import tensorflow as tf

def load_doc(filename):
    file = open(filename, 'r')
    text = file.read()
    file.close()
    return text

fasttext = api.load("fasttext-wiki-news-subwords-300")
embedding_dim = 300

in_filename = 'data.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(lines)
text_sequences = tokenizer.texts_to_sequences(lines)
text_sequences = tf.keras.preprocessing.sequence.pad_sequences(text_sequences, padding='post')
vocab_size = len(tokenizer.word_index)   1

text_sequences = np.array(text_sequences)
X, y = text_sequences[:, :-1], text_sequences[:, -1]
y = tf.keras.utils.to_categorical(y, num_classes=vocab_size)
max_length = X.shape[1]

weight_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
    try:
        embedding_vector = fasttext[word]
        weight_matrix[i] = embedding_vector
    except KeyError:
        weight_matrix[i] = np.random.uniform(-5, 5, embedding_dim)

sentence_input = tf.keras.layers.Input(shape=(max_length,))
x = tf.keras.layers.Embedding(vocab_size, embedding_dim, weights=[weight_matrix],
                              input_length=max_length)(sentence_input)

x = tf.keras.layers.LSTM(100, return_sequences=True)(x)
x = tf.keras.layers.LSTM(100)(x)
x = tf.keras.layers.Dense(100, activation='relu')(x)
output = tf.keras.layers.Dense(vocab_size, activation='softmax')(x)
model = tf.keras.Model(sentence_input, output)

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, batch_size=5, epochs=20)                                 

Note that I am using the dataset and preprocessing steps from the tutorial you linked.

CodePudding user response:

I was using accuracy as a metric and classification error as loss function for a continuous output.

Instead, I changed to mean squared error as loss, and mean_absolute_percentage_error as metric. It still seems a little weak, but it's way better

  • Related