Different approaches for finding the max 'k' values in a Python list for implementing knn-CodePudding

I tried to find the max 3 values in the list for implementing my knn model. While trying to do so, I did it using the method that was intuitive to me the code was something as follows `

first_k = X_train['distance'].sort_values().head(k)
prediction = first_k.value_counts().idxmax()

` The first_k list contains the first k elements from the sorted values of the distance column. Prediction is what the model will return at last.

Another approach I found on the internet was this `

prediction = y_train[X_train["distance"].nsmallest(n=k).index].mode()[0]

` The second approach yields the correct results and my approach did not work as intended. Can someone explain to me why my approach did not work.

CodePudding user response：

The difference is in the usage of .index after the method nsmallest(n=k) in the alternative approach. What you are doing in your code is the following:

Sort X using distance as sorting key, then take the first k elements in the sorted dataset
Check the distance frequency and the the first occurrence of the most frequent distance

The alternative approach instead does the following steps:

Recover the k smallest elements in the distance column
Get the corresponding index value of the rows recovered in the previous step (for example with k=5 it could be an element that when printed shows something similar to Int64Index([3, 9, 10, 1, 8], dtype='int64')
Recover in y the labels with the same index values of the ones recovered in the previous step
Get the most frequent label in y (or the mode)

So, as you can see, the main difference is the fact that the most frequent distance is not necessarily the most frequent class among the K neighbours that you have recovered.

Anyway you code can be easily fixed:

first_k = X_train['distance'].sort_values().head(k).index
prediction = y_train[first_k].mode()[0]