Data being used is saved in csv file:
Sentence # Word POS Tag
Sentence1 YASHAWANTHA NNP B-PER
Sentence1 K NNP I-PER
Sentence1 S NNP I-PER
Sentence1 Mobile NNP O
Sentence1 : : O
Sentence1 -7353555773 JJ O
I am trying to take the dataset with the following columns: Sentence #, Word, POS, Tag and converting all entries within the Word column to Word2Vec vectors.
Here i am reading in the dataset and splitting into sentences:
from gensim.models import Word2Vec
import pandas as pd
data = pd.read_csv(path_to_csv)
class SentenceGetter(object):
def __init__(self, data):
self.n_sent = 1#
self.data = data
agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),s["POS"].values.tolist(), s["Tag"].values.tolist())]
self.grouped = self.data.groupby("Sentence #").apply(agg_func)
self.sentences = [s for s in self.grouped]
def get_next(self):
try:
s = self.grouped["Sentence: {}".format(self.n_sent)]
self.n_sent = 1
return s
except:
return None
getter = SentenceGetter(data)
sentences = getter.sentences
Now i convert all words to their corresponding Word2Vec vectors, where word2idx is a dictionary with the key being the string and its corresponding Word2Vec vector as the value:
vec_words= [[i] for i in words]
vec_model= Word2Vec(vec_words, min_count=1, size=30)
word2idx = dict({})
for idx, key in enumerate(vec_model.wv.vocab):
word2idx[key] = vec_model.wv[key]
Then for the tags column i use simple enumeration:
tag2idx = {t: i for i, t in enumerate(tags)}
I then pad the words and tags:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
max_len = 60
X = [[word2idx[w[0]] for w in s] for s in sentences]
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=num_words-1)
y = [[tag2idx[w[2]] for w in s] for s in sentences]
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag2idx["O"])
y= [to_categorical(i, num_classes = num_tags) for i in y]
Then define the model:
from sklearn.model_selection import train_test_split
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import LSTM, Embedding, Dense
from tensorflow.keras.layers import TimeDistributed, SpatialDropout1D, Bidirectional
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)
input_word = Input(shape=(max_len,))
model = Embedding(input_dim=num_words, output_dim=max_len, input_length=max_len)(input_word)
model = SpatialDropout1D(0.1)(model)
model = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(num_tags, activation="softmax"))(model)
model = Model(input_word, out)
model.compile(optimizer="rmsprop",
loss="categorical_crossentropy",
metrics=["accuracy"])
Then fit the model:
history = model.fit(
x_train, np.array(y_train),
validation_split=0.2,
batch_size=32,
epochs=1,
verbose=1,
)
This fitting step leads to the following error and i am unsure how to fix it
Input 0 of layer "spatial_dropout1d_2" is incompatible with the layer: expected ndim=3, found ndim=4. Full shape received: (None, 60, 30, 60)
CodePudding user response:
The shape before padding of
X = [[word2idx[w[0]] for w in s] for s in sentences]
X = np.array(X)
print(X.shape)
is (3, 6, 30)
for 3 sentences in the csv file, and (3, 60, 30)
after padding, 30 being the word2wec size.
but the model expects an input of size (3, 60)
Without changing the rest, you can modify the network :
wrd2vec_size = 30
input_word = Input(shape=(max_len, wrd2vec_size))
x = SpatialDropout1D(0.1)(input_word)
x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(x)
out = TimeDistributed(Dense(num_tags, activation="softmax"))(x)
model = Model(input_word, out)