Implementing GridSearchCV and Pipelines to perform Hyperparameters Tuning for KNN Algorithm-CodePudding

I have been reading about perfroming Hyperparameters Tuning for KNN Algorthim, and understood that the best practice of implementing it is to make sure that for each fold, my dataset should be normalized and oversamplmed using a pipeline (To avoid data leakage and overfitting). What I'm trying to do is that I'm trying to identify the best number of neighbors (n_neighbors) possible that gives me the best accuracy in training. In the code I have set the number of neighbors to be a list range (1,50), and the number of iterations cv=10.

My code below:

# dataset reading & preprocessing libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

#oversmapling
from imblearn.over_sampling import SMOTE

#KNN Model related Libraries
import cuml 
from imblearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from cuml.neighbors import KNeighborsClassifier

#loading the dataset
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/dataset/IanDataset.csv")

#filling missing values with zeros
df = df.fillna(0)

#replace the data in from being objects to integers
df["command response"].replace({"b'0'": "0", "b'1'": "1"}, inplace=True)
df["binary result"].replace({"b'0'": "0", "b'1'": "1"}, inplace=True)

#change the datatype of some features to be able to be used later 
df["command response"] = pd.to_numeric(df["command response"]).astype(float)
df["binary result"] = pd.to_numeric(df["binary result"]).astype(int)

# dataset splitting
X = df.iloc[:, 0:17]
y_bin = df.iloc[:, 17]

# spliting the dataset into train and test for binary classification
X_train, X_test, y_bin_train, y_bin_test = train_test_split(X, y_bin, random_state=0, test_size=0.2)

#making pipleline that normalize, oversample and use classifier before GridSearchCV
pipe = Pipeline([
        ('normalization', MinMaxScaler()),
        ('oversampling', SMOTE()),
        ('classifier', KNeighborsClassifier(metric='eculidean', output='input'))
])

#Using GridSearchCV
neighbors = list(range(1,50))
parameters = {
    'classifier__n_neighbors': neighbors 
}

grid_search = GridSearchCV(pipe, parameters, cv=10)
grid_search.fit(X_train, y_bin_train)

print("Best Accuracy: {}" .format(grid_search.best_score_))
print("Best num of neighbors: {}" .format(grid_search.best_estimator_.get_params()['n_neighbors']))

At step grid_search.fit(X_train, y_bin_train), the program is repeating the error that i'm getting is :

/usr/local/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:619: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/imblearn/pipeline.py", line 266, in fit
    self._final_estimator.fit(Xt, yt, **fit_params_last_step)
  File "/usr/local/lib/python3.7/site-packages/cuml/internals/api_decorators.py", line 409, in inner_with_setters
    return func(*args, **kwargs)
  File "cuml/neighbors/kneighbors_classifier.pyx", line 176, in cuml.neighbors.kneighbors_classifier.KNeighborsClassifier.fit
  File "/usr/local/lib/python3.7/site-packages/cuml/internals/api_decorators.py", line 409, in inner_with_setters
    return func(*args, **kwargs)
  File "cuml/neighbors/nearest_neighbors.pyx", line 397, in cuml.neighbors.nearest_neighbors.NearestNeighbors.fit
ValueError: Metric  is not valid. Use sorted(cuml.neighbors.VALID_METRICSeculidean[brute]) to get valid options.

I'm not sure from which side is this error coming from, is it because I'm importing KNN Algorthim from cuML Library instead of sklearn ? Or is there something wrong wtih my Pipeline and GridSearchCV implementation?

CodePudding user response：

This error indicates you've passed an invalid value for the metric parameter (in both scikit-learn and cuML). You've misspelled "euclidean".

import cuml
from sklearn import datasets

from sklearn.preprocessing import MinMaxScaler

from imblearn.over_sampling import SMOTE

from imblearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from cuml.neighbors import KNeighborsClassifier

X, y = datasets.make_classification(
    n_samples=100
)

pipe = Pipeline([
        ('normalization', MinMaxScaler()),
        ('oversampling', SMOTE()),
        ('classifier', KNeighborsClassifier(metric='euclidean', output='input'))
])

parameters = {
    'classifier__n_neighbors': [1,3,6] 
}

grid_search = GridSearchCV(pipe, parameters, cv=2)
grid_search.fit(X, y)
GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('normalization', MinMaxScaler()),
                                       ('oversampling', SMOTE()),
                                       ('classifier', KNeighborsClassifier())]),
             param_grid={'classifier__n_neighbors': [1, 3, 6]})