I am trying to find a way to find the closest value in a vector, from elements in the SAME vector, but excluding the row in question. For example, suppose I have dataframe A with one column (column_1):
column_1
1
5
6
2
3
0
5
2
1
9
I want to add a second column which, for every element in column_1 finds the closest value in THAT SAME vector excluding the row in question. Desired output is below:
column_1 column_2
1 1
5 5
6 5
2 2
3 2
0 1
5 5
2 2
1 1
9 6
I have seen people discuss how to do this where the closest value for each element in a vector (a) is identified from another vector (b) via the following:
which(abs(a-b)==min(a-b))
Does anyone know how to modify the above, or do this in some other way, so that I can look within the same vector and exclude the row in question (example: the third row in column_1 is closest to 5 not 6, since I exclude its own row from the search vector. However, the fourth row in column_1 is closest to 2 since even when excluding the fourth row, there is another 2 value in the 8th row)
CodePudding user response:
You could sort the vector, then check only the number before and number after, and replace the original vector:
y <- sort(x)
z <- c(-Inf, y, Inf)
b <- cbind(head(z, -2), tail(z, -2))
x[order(x)] <- b[cbind(seq_along(y), max.col(-abs(b - y)))]
x
[1] 1 5 5 2 2 1 5 2 1 6
Notice that this method has the lowest complexity as compared to the above solutions. Ie it is the fastest:
onyambu <- function(x){
y <- sort(x)
z <- c(-Inf, y, Inf)
b <- cbind(head(z, -2), tail(z, -2))
x[order(x)] <- b[cbind(seq_along(y), max.col(-abs(b - y)))]
x
}
gregor <- function(x){
dist = outer(x, x, FUN = \(a, b) abs(a - b))
diag(dist) = Inf
mins = max.col(-dist, ties.method = 'first')
x[mins]
}
x <- rnorm(1000)
microbenchmark::microbenchmark(onyambu(x), gregor(x), check = 'equal', times = 1)
Unit: milliseconds
expr min lq mean median uq max neval
onyambu(x) 5.7839 5.7839 5.7839 5.7839 5.7839 5.7839 1
gregor(x) 3021.9425 3021.9425 3021.9425 3021.9425 3021.9425 3021.9425 1
CodePudding user response:
# sample data
x = c(1, 5, 6, 2, 3, 0, 5, 2, 1, 9)
# make a distance matrix and set diagonal to Inf
dist = outer(x, x, FUN = \(a, b) abs(a - b))
diag(dist) = Inf
# find the index of the min value on each row
# (which is the index of the max negative value
# so we can use the convenient max.col)
mins = max.col(-dist)
# show the result
y = x[mins]
cbind(x, y)
# x y
# [1,] 1 1
# [2,] 5 5
# [3,] 6 5
# [4,] 2 2
# [5,] 3 2
# [6,] 0 1
# [7,] 5 5
# [8,] 2 2
# [9,] 1 1
# [10,] 9 6
CodePudding user response:
Using the get.knn
function from the FNN
package:
x <- c(1, 5, 6, 2, 3, 0, 5, 2, 1, 9)
cbind(x, x[FNN::get.knn(x, 1)$nn.index])
#> x
#> [1,] 1 1
#> [2,] 5 5
#> [3,] 6 5
#> [4,] 2 2
#> [5,] 3 2
#> [6,] 0 1
#> [7,] 5 5
#> [8,] 2 2
#> [9,] 1 1
#> [10,] 9 6
CodePudding user response:
Here is a way. Compute all distances from each element of the column to all other elements. To set the diagonal element to Inf
is a trick to avoid index arithmetic. Then return the index to the minimum distance. This is used to assign the wanted values.
df1 <-
structure(list(
column_1 = c(1, 5, 6, 2, 3, 0, 5, 2, 1, 9)),
class = "data.frame", row.names = c(NA, -10L))
i <- sapply(seq_along(df1$column_1), \(i) {
d <- abs(df1$column_1[i] - df1$column_1)
d[i] <- Inf
which.min(d)
})
df1$column_2 <- df1$column_1[i]
df1
#> column_1 column_2
#> 1 1 1
#> 2 5 5
#> 3 6 5
#> 4 2 2
#> 5 3 2
#> 6 0 1
#> 7 5 5
#> 8 2 2
#> 9 1 1
#> 10 9 6
Created on 2022-11-14 with reprex v2.0.2
CodePudding user response:
We can use dist
max.col
to locate the closest neighbor
> x <- c(1, 5, 6, 2, 3, 0, 5, 2, 1, 9)
> cbind(x, y = x[max.col(-`diag<-`(as.matrix(dist(x)), Inf))])
x y
[1,] 1 1
[2,] 5 5
[3,] 6 5
[4,] 2 2
[5,] 3 2
[6,] 0 1
[7,] 5 5
[8,] 2 2
[9,] 1 1
[10,] 9 6