I'm trying to create a KNN function from scratch and then compare it to scikit-learn KNeighborsClassifier
. I am using the iris dataset for testing.
Based on what I learned, I would have to take each data point individually and then calculate the distance between it and the rest of the training data.
The last step would be to associate it with the target value of the data closest to it. For some reason when I do this I get an error rate of 4%. Why is this the case?
from sklearn import *
import numpy as np
iris = datasets.load_iris()
X = iris.data
Y = iris.target
def PPV(data, target):
target_res = []
true = 0
for i in range(len(target)):
data_copy = data
target_copy = target
training_data = np.delete(data_copy, i, 0)
training_target = np.delete(target_copy, i, 0)
target_res.append(training_target[np.argmin(metrics.pairwise.euclidean_distances([data[i]], training_data))])
# print(f"{i} has target prediction {training_target[np.argmin(metrics.pairwise.euclidean_distances([data[i]], training_data))]}")
for i in range(len(target)):
if target[i] == target_res[i]:
true = true 1
print(f"The predicted PPV target values are: {target_res}")
print(f"PPV precision: {true*100/len(target)}%")
PPV(X, Y)
The output for the code above is:
The predicted PPV target values are: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
PPV precision: 96.0%
KNeighbors precision: 100.0% = 1
Unless I am missing something then i should be able to get the same results as the KNeighborsClassifier
algorithm for K=1 as they share the same principle.
CodePudding user response:
You are trying to classify observations using a 1-Nearest Neighbor classifier after you delete them from the training set. Because the observations are no longer in the training set, there is no guarantee that every observation will be correctly classified. Scored accuracy may be lower than 100%.
If you are doing something like this:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
from sklearn import metrics
iris = datasets.load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X, y)
knn_results = knn.predict(X) # we are predicting our own training data here
metrics.accuracy_score(y, knn_results) # 1.0
You will get 100% accuracy because you are classifying observations using 1-NN with those same observations in the training set. The 1-NN classifier will find the perfectly matching point every time.
If you change the n_neighbors
parameter or use fresh, test data, the accuracy may no longer be 100% in this example.
Also, the scoring metric you are using in your code appears to be accuracy, not precision. https://en.wikipedia.org/wiki/Confusion_matrix