How can I use Ensemble learning of two models with different features as an input?-CodePudding

I have a fake news detection problem and it predicts the binary labels "1"&"0" by vectorizing the 'tweet' column, I use three different models for detection but I want to use the ensemble method to increase the accuracy but they use different vectorezer.

I have 3 KNN models the first and the second one vectorizes the 'tweet' column using TF-IDF.

from sklearn.feature_extraction.text import TfidfVectorizer
    vector = TfidfVectorizer(max_features =5000, ngram_range=(1,3))
    X_train = vector.fit_transform(X_train['tweet']).toarray()
    X_test = vector.fit_transform(X_test['tweet']).toarray()

for the third model I used fastText for sentence vectorization

%%time
sent_vec = []
for index, row in X_train.iterrows():
    sent_vec.append(avg_feature_vector(row['tweet']))
%%time
sent_vec1 = []
for index, row in X_test.iterrows():
    sent_vec1.append(avg_feature_vector(row['tweet']))

after scaling and... my third model fits the input like this

scaler.fit(sent_vec)
scaled_X_train= scaler.transform(sent_vec)
scaled_X_test= scaler.transform(sent_vec1)
.
.
.
knn_model1.fit(scaled_X_train, y_train)

now I want to combine the three models like this and I want the ensemble method to give me the majority just likeVotingClassifier, but I have no idea how can I deal with the different inputs (TF-IDF & fastText) is there another way to do that?

CodePudding user response：

You can create a custom MyVotingClassifier which takes a fitted model instead of a model instance yet to be trained. In VotingClassifier, sklearn takes just the unfitted classifiers as input and train them and then apply voting on the predicted result. You can create something like this. The below function might not be the exact function but you can make quite similar function like below for your purpose.

from collections import Counter
clf1 = knn_model_1.fit(X1, y)
clf2 = knn_model_2.fit(X2, y)
clf3 = knn_model_3.fit(X3, y)

class MyVotingClassifier:
    def __init__(self, **models):
        self.models = models
    
    def predict(dict_X):
        '''
        dict_X = {'knn_model_1': X1, 'knn_model_2': X2, 'knn_model_3': X3}
        '''
        preds = []
        for model_name in dict_X:
            model = self.models[model_name]
            preds.append(model.predict(dict_X[model_name]))
        preds = list(zip(*preds))
        final_pred = list(map(lambda x: Counter(x).most_common(1)[0][0]))
        return final_pred
ensemble_model = MyVotingClassifier(knn_model_1=clf1, knn_model_2=clf2, knn_model_3=clf3)
ensemble_model.predict({'knn_model_1': X1, 'knn_model_2': X2, 'knn_model_3': X3}) # Input the pre-processed `X`s