I have a fake news detection problem and it predicts the binary labels "1"&"0" by vectorizing the 'tweet' column, I use three different models for detection but I want to use the ensemble method to increase the accuracy but they use different vectorezer.
I have 3 KNN models the first and the second one vectorizes the 'tweet' column using TF-IDF.
from sklearn.feature_extraction.text import TfidfVectorizer
vector = TfidfVectorizer(max_features =5000, ngram_range=(1,3))
X_train = vector.fit_transform(X_train['tweet']).toarray()
X_test = vector.fit_transform(X_test['tweet']).toarray()
for the third model I used fastText for sentence vectorization
%%time
sent_vec = []
for index, row in X_train.iterrows():
sent_vec.append(avg_feature_vector(row['tweet']))
%%time
sent_vec1 = []
for index, row in X_test.iterrows():
sent_vec1.append(avg_feature_vector(row['tweet']))
after scaling and... my third model fits the input like this
scaler.fit(sent_vec)
scaled_X_train= scaler.transform(sent_vec)
scaled_X_test= scaler.transform(sent_vec1)
.
.
.
knn_model1.fit(scaled_X_train, y_train)
now I want to combine the three models like this and I want the ensemble method to give me the majority just like
VotingClassifier
, but I have no idea how can I deal with the different inputs (TF-IDF & fastText) is there another way to do that?
CodePudding user response:
You can create a custom MyVotingClassifier
which takes a fitted model instead of a model instance yet to be trained. In VotingClassifier
, sklearn takes just the unfitted classifiers as input and train them and then apply voting on the predicted result. You can create something like this. The below function might not be the exact function but you can make quite similar function like below for your purpose.
from collections import Counter
clf1 = knn_model_1.fit(X1, y)
clf2 = knn_model_2.fit(X2, y)
clf3 = knn_model_3.fit(X3, y)
class MyVotingClassifier:
def __init__(self, **models):
self.models = models
def predict(dict_X):
'''
dict_X = {'knn_model_1': X1, 'knn_model_2': X2, 'knn_model_3': X3}
'''
preds = []
for model_name in dict_X:
model = self.models[model_name]
preds.append(model.predict(dict_X[model_name]))
preds = list(zip(*preds))
final_pred = list(map(lambda x: Counter(x).most_common(1)[0][0]))
return final_pred
ensemble_model = MyVotingClassifier(knn_model_1=clf1, knn_model_2=clf2, knn_model_3=clf3)
ensemble_model.predict({'knn_model_1': X1, 'knn_model_2': X2, 'knn_model_3': X3}) # Input the pre-processed `X`s