I have a dataset, one feature is text and 4 more features. Sentence-Bert vectorizer transforms text data into tensors. I can use these sparse matrices directly with a machine learning classifier. Can I replace the text column with tensors? And, how can I train the model. The code below is how I transform the text into vectors.
model = SentenceTransformer('sentence-transformers/LaBSE')
sentence_embeddings = model.encode(X_train['tweet'], convert_to_tensor=True, show_progress_bar=True)
sentence_embeddings1 = model.encode(X_test['tweet'], convert_to_tensor=True, show_progress_bar=True)
CodePudding user response:
Let's assume this is your data
X_train = pd.DataFrame({
'tweet':['foo', 'foo', 'bar'],
'feature1':[1, 1, 0],
'feature2':[1, 0, 1],
})
y_train = [1, 1, 0]
and you are willing to use it with sklearn
API (cross-validation, pipeline, grid-search, and so on). There is a utility named ColumnTransformer
which can map pandas data frames to the desired data using user-defined arbitrary functions! what you have to do is define a function and create an official sklearn.transformer
from it.
model = SentenceTransformer('mrm8488/bert-tiny-finetuned-squadv2') # model named is changed for time and computation gians :)
embedder = FunctionTransformer(lambda item:model.encode(item, convert_to_tensor=True, show_progress_bar=False).detach().cpu().numpy())
After that, you would be able to use the transformer like any other transformer and map your text column into semantic space, like:
preprocessor = ColumnTransformer(
transformers=[('embedder', embedder, 'tweet')],
remainder='passthrough'
)
X_train = preprocessor.fit_transform(X_train) # X_train.shape => (len(df), your_transformer_model_hidden_dim your_features_count)
X_train
would be the data you wanted. It's proper to use with sklearn
ecosystem.
gnb = GaussianNB()
gnb.fit(X_train, y_train)
output:
GaussianNB(priors=None, var_smoothing=1e-09)
caveat: Numerical features and the tweets embeddings should belong to the same SCALE otherwise some would dominate others and degrade the performance