I built a knn model for classification. Unfortunately, my model has accuracy > 80%, and I would like to get a better result. Can I ask for some tips? Maybe I used too many predictors?
My data = https://www.openml.org/search?type=data&sort=runs&id=53&status=active
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
from sklearn.model_selection import GridSearchCV
heart_disease = pd.read_csv('heart_disease.csv', sep=';', decimal=',')
y = heart_disease['heart_disease']
X = heart_disease.drop(["heart_disease"], axis=1)
correlation_matrix = heart_disease.corr()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
scaler = MinMaxScaler(feature_range=(-1,1))
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
knn_3 = KNeighborsClassifier(3, n_jobs = -1)
knn_3.fit(X_train, y_train)
y_train_pred = knn_3.predict(X_train)
labels = ['0', '1']
print('Training set')
print(pd.DataFrame(confusion_matrix(y_train, y_train_pred), index = labels, columns = labels))
print(accuracy_score(y_train, y_train_pred))
print(f1_score(y_train, y_train_pred))
y_test_pred = knn_3.predict(X_test)
print('Test set')
print(pd.DataFrame(confusion_matrix(y_test, y_test_pred), index = labels, columns = labels))
print(accuracy_score(y_test, y_test_pred))
print(f1_score(y_test, y_test_pred))
hyperparameters = {'n_neighbors' : range(1, 15), 'weights': ['uniform','distance']}
knn_best = GridSearchCV(KNeighborsClassifier(), hyperparameters, n_jobs = -1, error_score = 'raise')
knn_best.fit(X_train,y_train)
knn_best.best_params_
y_train_pred_best = knn_best.predict(X_train)
y_test_pred_best = knn_best.predict(X_test)
print('Training set')
print(pd.DataFrame(confusion_matrix(y_train, y_train_pred_best), index = labels, columns = labels))
print(accuracy_score(y_train, y_train_pred_best))
print(f1_score(y_train, y_train_pred_best))
print('Test set')
print(pd.DataFrame(confusion_matrix(y_test, y_test_pred_best), index = labels, columns = labels))
print(accuracy_score(y_test, y_test_pred_best))
print(f1_score(y_test, y_test_pred_best))
CodePudding user response:
There are a few things you can try to improve the accuracy of your KNN model.
First, you can try tuning the hyperparameters of your model, such as the number of nearest neighbors to consider or the distance metric used to measure the similarity between points.
To tune the hyperparameters of your KNN model, you can use techniques like grid search or cross-validation to try different combinations of hyperparameters and find the combination that works best for your data.
You can also try preprocessing your data to make it more suitable for KNN. For example, you can try reducing the dimensionality of the data using techniques like principal component analysis (PCA). This can help to remove redundancies in your data and reduce the number of dimensions, which can make it easier for KNN to find the nearest neighbors.
Additionally, you can try using a different classification algorithm altogether, such as logistic regression or a decision tree. These algorithms may be better suited to your data and can potentially yield better results than KNN.
Another thing you can try is using an ensemble method, such as bagging or boosting, to combine multiple KNN models and potentially improve their accuracy. Ensemble methods can be effective at reducing overfitting and improving the generalizability of your model.
CodePudding user response:
Just a little part of answer, to find the best number for k_neighbors.
errlist = [] #an error list to append
for i in range(1,40): #from 0-40 numbers to use in k_neighbors
knn_i = KNeighborsClassifier(k_neighbors=i)
knn_i.fit(X_train,y_train)
errlist.append(np.mean(knn_i.predict(X_test)!=y_test)) # append the mean of failed-predict numbers
plot a line to see best k_neighbors:
plt.plot(range(1,40),errlist)
feel free to change the numbers for range.