Word2Vec dimensions incorrect-CodePudding

Data being used is saved in csv file:

Sentence #  Word    POS Tag
Sentence1   YASHAWANTHA NNP B-PER
Sentence1   K   NNP I-PER
Sentence1   S   NNP I-PER
Sentence1   Mobile  NNP O
Sentence1   :   :   O
Sentence1   -7353555773 JJ  O

I am trying to take the dataset with the following columns: Sentence #, Word, POS, Tag and converting all entries within the Word column to Word2Vec vectors.

Here i am reading in the dataset and splitting into sentences:

from gensim.models import Word2Vec
import pandas as pd

data = pd.read_csv(path_to_csv)

class SentenceGetter(object):
    def __init__(self, data):
        self.n_sent = 1#
        self.data = data

        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),s["POS"].values.tolist(), s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent  = 1
            return s
        except:
            return None

getter = SentenceGetter(data)
sentences = getter.sentences

Now i convert all words to their corresponding Word2Vec vectors, where word2idx is a dictionary with the key being the string and its corresponding Word2Vec vector as the value:

vec_words= [[i] for i in words]
vec_model= Word2Vec(vec_words, min_count=1, size=30)
word2idx = dict({})
for idx, key in enumerate(vec_model.wv.vocab):
    word2idx[key] = vec_model.wv[key]

Then for the tags column i use simple enumeration:

tag2idx = {t: i for i, t in enumerate(tags)}

I then pad the words and tags:

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

max_len = 60
X = [[word2idx[w[0]] for w in s] for s in sentences]
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=num_words-1)
y = [[tag2idx[w[2]] for w in s] for s in sentences]
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag2idx["O"])
y= [to_categorical(i, num_classes = num_tags) for i in y]

Then define the model:

from sklearn.model_selection import train_test_split
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import LSTM, Embedding, Dense
from tensorflow.keras.layers import TimeDistributed, SpatialDropout1D, Bidirectional

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)
input_word = Input(shape=(max_len,))
model = Embedding(input_dim=num_words, output_dim=max_len, input_length=max_len)(input_word)
model = SpatialDropout1D(0.1)(model)
model = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(num_tags, activation="softmax"))(model)
model = Model(input_word, out)

model.compile(optimizer="rmsprop",
              loss="categorical_crossentropy",
              metrics=["accuracy"])

Then fit the model:

history = model.fit(
    x_train, np.array(y_train),
    validation_split=0.2,
    batch_size=32, 
    epochs=1,
    verbose=1,    
)

This fitting step leads to the following error and i am unsure how to fix it

Input 0 of layer "spatial_dropout1d_2" is incompatible with the layer: expected ndim=3, found ndim=4. Full shape received: (None, 60, 30, 60)

CodePudding user response：

The shape before padding of

X = [[word2idx[w[0]] for w in s] for s in sentences]
X = np.array(X)
print(X.shape)

is (3, 6, 30) for 3 sentences in the csv file, and (3, 60, 30) after padding, 30 being the word2wec size. but the model expects an input of size (3, 60)

Without changing the rest, you can modify the network :

wrd2vec_size = 30
input_word = Input(shape=(max_len, wrd2vec_size))
x = SpatialDropout1D(0.1)(input_word)
x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(x)
out = TimeDistributed(Dense(num_tags, activation="softmax"))(x)

model = Model(input_word, out)