I have a bunch of sentences that I am trying to classify. For each sentence, I generated a word embedding using word2vec. I also performed a cluster analysis which clustered the sentences into 3 separate clusters.
What I want to do is use the cluster id (1-3) as a feature for my model. However, I am just not entirely sure how to do this? I can't seem to find a good article that clearly states how to do this.
I was thinking I could create a one hot embedding for the cluster id and then somehow combine the one hot to the word embedding? I am really not sure what to do here.
I already have a model that will take the word embedding and classify the sentence:
X=Data[word_embedding].values
y=Data[category].values
indices = filtered_products.index.values
X_train, X_test, y_train, y_test, indices_train, indices_test, = train_test_split(X, y, indices, test_size=0.3, random_state=428)
clf = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
DSVM = clf.fit(X_train,y_train)
prediction = DSVM.predict(X_test)
print(metrics.classification_report(y_test, prediction))
Where X
is the word embedding and y
is the category. Just not sure how to add in the cluster id as a feature
CodePudding user response:
Assuming, you want to use Tensorflow. You can either one-hot encode the ids or map them to n-dimensional random vectors using an Embedding
layer. Here is an example with an Embedding
layer, where I am mapping each id to a 10-dimensional vector and then repeating this vector 50 times to correspond to the max length of a sentence (So, each word has the same 10-dimensional vector for a given input). Afterwards, I just concatenate:
import tensorflow as tf
word_embedding_dim = 300
max_sentence_length = 50
word_embedding_input = tf.keras.layers.Input((max_sentence_length, word_embedding_dim))
id_input = tf.keras.layers.Input((1, ))
embedding_layer = tf.keras.layers.Embedding(1, 10) # or one-hot encode
x = embedding_layer(id_input)
x = tf.keras.layers.RepeatVector(max_sentence_length)(x[:, 0, :])
output = tf.keras.layers.Concatenate()([word_embedding_input, x])
model = tf.keras.Model([word_embedding_input, id_input], output)
print(model.summary())
Model: "model_1"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_17 (InputLayer) [(None, 1)] 0 []
embedding_3 (Embedding) (None, 1, 10) 10 ['input_17[0][0]']
tf.__operators__.getitem (Slic (None, 10) 0 ['embedding_3[0][0]']
ingOpLambda)
input_16 (InputLayer) [(None, 50, 300)] 0 []
repeat_vector_1 (RepeatVector) (None, 50, 10) 0 ['tf.__operators__.getitem[0][0]'
]
concatenate (Concatenate) (None, 50, 310) 0 ['input_16[0][0]',
'repeat_vector_1[0][0]']
==================================================================================================
Total params: 10
Trainable params: 10
Non-trainable params: 0
__________________________________________________________________________________________________
None
If you do not have a 2D input, but actually sentence embeddings, it is even easier:
import tensorflow as tf
sentence_embedding_dim = 300
sentence_embedding_input = tf.keras.layers.Input((sentence_embedding_dim,))
id_input = tf.keras.layers.Input((1, ))
embedding_layer = tf.keras.layers.Embedding(1, 10) # or one-hot encode
x = embedding_layer(id_input)
output = tf.keras.layers.Concatenate()([sentence_embedding_input, x[:, 0, :]])
model = tf.keras.Model([sentence_embedding_input, id_input], output)
Here is a solution with numpy
and sklearn
for reference:
import numpy as np
from sklearn.preprocessing import OneHotEncoder
samples = 10
word_embedding_dim = 300
max_sentence_length = 50
ids = np.random.randint(low=1, high=4, size=(10,)).reshape(-1, 1)
enc = OneHotEncoder(handle_unknown='ignore')
ids = enc.fit_transform(ids).toarray()[:, None, :]
X_train = np.random.random((samples, max_sentence_length, word_embedding_dim))
ids = np.repeat(ids, max_sentence_length, axis=1)
X_train = np.concatenate([X_train, ids], axis=-1)
print(X_train.shape)
# (10, 50, 303)