I am trying to convert the base R code in Introduction to Statistical Learning into the R tidymodels
ecosystem. The book uses class::knn()
and tidymodels
uses kknn::kknn()
. I got different results when doing knn, with a fixed k. So I stripped out the tidymodels and tried to just compare using class::knn()
and kknn::kknn()
and still I got different results. class::knn
uses Euclidean distance and kknn::kknn
uses Minkowski distance with distance parameter of 2, which is Euclidean distance according to Wikipedia. I set the kernel in kknn to be "rectangular" which according to the documentation is unweighted. Shouldn't the results of knn modeling with a fixed k be the same?
Here is (basically) the base R with class::knn code from the book:
library(ISLR2)
# base R class
train <- (Smarket$Year < 2005)
Smarket.2005 <- Smarket[!train, ]
dim(Smarket.2005)
Direction.2005 <- Smarket$Direction[!train]
train.X <- cbind(Smarket$Lag1, Smarket$Lag2)[train, ]
test.X <- cbind(Smarket$Lag1, Smarket$Lag2)[!train, ]
train.Direction <- Smarket$Direction[train]
the_k <- 3 # 30 shows larger discrepancies
library(class)
knn.pred <- knn(train.X, test.X, train.Direction, k = the_k)
Here is my tidyverse with kknn::kknn code
# tidyverse kknn
library(tidyverse)
Smarket_train <- Smarket %>%
filter(Year != 2005)
Smarket_test <- Smarket %>% # Smarket.2005
filter(Year == 2005)
library(kknn)
the_knn <-
kknn(
Direction ~ Lag1 Lag2, Smarket_train, Smarket_test, k = the_k,
distance = 2, kernel = "rectangular"
)
fit <- fitted(the_knn)
This shows the differences:
the_k
# class
table(Direction.2005, knn.pred)
# kknn
table(Smarket_test$Direction, fit)
Did I make a stupid mistake in the coding? If not, can anybody explain the differences between class::knn()
and kknn::kknn()
?
CodePudding user response:
Alright, there is a lot going on in this one. First, we see from the documentation of class::knn()
that the classification is decided by majority vote, with ties broken at random.
So it appears we should start by looking at the output of class::knn()
to see what happens.
I repeatedly called
which(fitted(knn.pred) != fitted(knn.pred))
and after a while, I got 28 and 66. So these are the observations in the test data set that has some randomness in them. To see why these two observations are troublesome, we can set prob = TRUE
in class::knn()
to get the predicted probabilities.
knn.pred <- knn(train.X, test.X, train.Direction, k = the_k, prob = TRUE)
attr(knn.pred, "prob")
#> [1] 0.6666667 0.6666667 0.6666667 0.6666667 0.6666667 1.0000000 0.6666667
#> [8] 0.6666667 0.6666667 0.6666667 0.6666667 0.6666667 0.6666667 1.0000000
#> [15] 0.6666667 0.6666667 0.6666667 0.6666667 0.6666667 0.6666667 0.6666667
#> [22] 0.6666667 0.6666667 0.6666667 0.6666667 0.6666667 0.6666667 0.5000000
#> [29] 0.6666667 0.6666667 1.0000000 0.6666667 0.6666667 0.6666667 0.6666667
#> [36] 1.0000000 0.6666667 0.6666667 0.6666667 1.0000000 1.0000000 1.0000000
#> [43] 0.6666667 0.6666667 0.6666667 0.6666667 1.0000000 0.6666667 1.0000000
#> [50] 1.0000000 0.6666667 1.0000000 0.6666667 0.6666667 1.0000000 1.0000000
#> [57] 0.6666667 0.6666667 0.6666667 1.0000000 0.6666667 0.6666667 0.6666667
#> [64] 0.6666667 1.0000000 0.5000000 0.6666667 1.0000000 0.6666667 1.0000000
...
And here we see that the predicted probability for observations 28 and 66 are both 0.5
. But how could that be since we are having k=3
?
To answer that, we will take a look at the nearest neighbors to these points. I'm going to use the RANN::nn2()
function to calculate the distances between the training set and testing set. Let us look at the first observation as an example, we calculate the distances and pull them out
dists <- RANN::nn2(train.X, test.X)
dists$nn.dists[1, ]
#> [1] 0.01063015 0.05632051 0.06985700 0.08469357 0.08495881 0.08561542
#> [7] 0.10823123 0.12003333 0.12621014 0.12657014
The distances by themselves don't do much, what we want to know is what observations in the training set they are and their classes.
We can pull this out with $nn.idx
dists$nn.idx[1, ]
#> [1] 503 411 166 964 981 611 840 705 562 578
train.Direction[dists$nn.idx[1, 1:3]]
#> [1] Up Down Down
#> Levels: Down Up
And we see here that the nearest neighbors to the first observations are Up
, Down
, and Down
. Thus giving a classification of Down
.
If we look at the 66th observation we see something different. Notice how the 3rd and 4th nearest neighbors have the exact same distance?
dists$nn.dists[66, ]
#> [1] 0.06500000 0.06754258 0.07465253 0.07465253 0.07746612 0.07778175
#> [7] 0.08905055 0.09651943 0.11036757 0.11928118
train.Direction[dists$nn.idx[66, 1:4]]
#> [1] Down Down Up Up
#> Levels: Down Up
And when we look at their classes there are 2 Up
and 2 Down
. And this is where the discrepancy comes in. class::knn()
count all these 4 observations as the "3 nearest neighbors", which gives a tie, that is broken randomly. kknn::kknn()
Takes the first 3 neighbors, disregarding this tie in distances, and predicts Down
since the first 3 neighbors have 2 Down
and 1 Up
.
predict(the_knn, type = "prob")[66, ]
#> Down Up
#> [1,] 0.6666667 0.3333333