I was building a binary classifier using the random forest classifier. Before it, I did a feature selection based on the high AUC score. However, when I wanted to get AUC for this model I couldn't. Here is the code below. Sorry for the lack of the dataset.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.feature_selection import VarianceThreshold
df_process_label1 = 'AAA'
X = df_process.iloc[:,200:500]
y = df_process[df_process_label1].values
import sklearn
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0)
constant_filter = VarianceThreshold(threshold = 0.01)
constant_filter.fit(X_train)
X_train_filter = constant_filter.transform(X_train)
X_test_filter = constant_filter.transform(X_test)
roc_auc = []
for features in X_train.columns:
clf = RandomForestClassifier(n_estimators = 100, random_state=0)
clf.fit(X_train[features].to_frame(), y_train)
y_pred = clf.predict(X_test[features].to_frame())
roc_auc.append(roc_auc_score(y_test, y_pred))
roc_values = pd.Series(roc_auc)
roc_values.index = X_train.columns
roc_values.sort_values(ascending = False, inplace =True)
sel = roc_values[roc_values>0.5]
sel
X_train_roc = X_train[sel.index]
X_test_roc = X_test[sel.index]
def run_randomForest(X_train, X_test, y_train, y_test):
clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=1)
clf.fit(X_train, y_train)
y_pred1 = clf.predict(X_test)
print('Accuracy on test set: ', accuracy_score(y_test, y_pred))
print(roc_auc_score(y_test, RandomForestClassifier.predict_proba(X_test)[:,1]))
%time
run_randomForest(X_train_roc, X_test_roc, y_train, y_test)
However, one error keep appearing over and over again.
TypeError: predict_proba() missing 1 required positional argument: 'X'
Do you know how to fix it? Thanks in advance!
CodePudding user response:
You should use clf.predict_proba(X_test)
instead, and also I think you need to fix this part too:
y_pred1 = clf.predict(X_test)
print('Accuracy on test set: ', accuracy_score(y_test, y_pred))
you are declaring y_pred1
, but using y_pred