I tried to find the max 3 values in the list for implementing my knn model. While trying to do so, I did it using the method that was intuitive to me the code was something as follows `
first_k = X_train['distance'].sort_values().head(k)
prediction = first_k.value_counts().idxmax()
` The first_k list contains the first k elements from the sorted values of the distance column. Prediction is what the model will return at last.
Another approach I found on the internet was this `
prediction = y_train[X_train["distance"].nsmallest(n=k).index].mode()[0]
` The second approach yields the correct results and my approach did not work as intended. Can someone explain to me why my approach did not work.
CodePudding user response:
The difference is in the usage of .index
after the method nsmallest(n=k)
in the alternative approach. What you are doing in your code is the following:
- Sort X using
distance
as sorting key, then take the first k elements in the sorted dataset - Check the distance frequency and the the first occurrence of the most frequent distance
The alternative approach instead does the following steps:
- Recover the k smallest elements in the
distance
column - Get the corresponding index value of the rows recovered in the previous step (for example with
k=5
it could be an element that when printed shows something similar toInt64Index([3, 9, 10, 1, 8], dtype='int64')
- Recover in
y
the labels with the same index values of the ones recovered in the previous step - Get the most frequent label in
y
(or themode
)
So, as you can see, the main difference is the fact that the most frequent distance is not necessarily the most frequent class among the K neighbours that you have recovered.
Anyway you code can be easily fixed:
first_k = X_train['distance'].sort_values().head(k).index
prediction = y_train[first_k].mode()[0]