In my dataset, I have points, whose location are given by X
and Y
, that are grouped by ID
. I want to calculate the nearest neighbour (NN) distance of each point to the points in other groups. In other words, if the ID
of a point is 1
, the code should search for the NND from the points that satisfy ID != 1
.
A pseudo-R-code could look like this:
DT[DT, c("nn_dist", "nn_X", "nn_Y") := find_NNN(data.table(i.X, i.Y), .SD[ID != i.ID]), by = .EACHI]
To achieve this, I tried writing an imperative code with loops and such, but it was too slow. I tried using get.knnx
function from FNN
library but then I couldn't figure out how to get both the NN distance and the position of the NN.
How can I do this calculation on a relatively large (~10,000 rows) dataset?
Here is a tiny portion of the dataset I'm using
structure(list(ID = c(1L, 1L, 2L, 2L), X = c(318L, 317L, 1273L,
1272L), Y = c(1L, 2L, 1L, 2L), t = c(1, 1, 1, 1), uid = c(1L,
2L, 1271L, 1272L)), row.names = c(NA, -4L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x7f89b481f0e0>)
t
column is the same for all of the points in the datasetuid
is a unique id given to every point, regardless of their position or group, somax(uid)
equals to the #rows in the dataset
CodePudding user response:
You can create a small distance function that is passed x
, y
, and a data frame d
of all candidate points (i.e. the x and y values of all the points in other id groups, and returns the coordinates and uid of the nearest inter-group point
dist <- function(x,y,d) d[order(d[,sqrt((X-x)^2 (Y-y)^2)])][1]
Then apply the function to the frame
df[, c("nnX", "nnY", "nn"):=dist(X,Y, df[ID!=.BY$ID,.(X,Y,uid)]),by=.(ID,uid)]
Output:
ID X Y t uid nnX nnY nn
1: 1 318 1 1 1 1272 2 1272
2: 1 317 2 1 2 1272 2 1272
3: 2 1273 1 1 1271 318 1 1
4: 2 1272 2 1 1272 318 1 1
If you additionally want the distance to that nearest neighbor, you could either update the function and the call to the function like this:
dist <- function(x,y,d) {
d[, nn_dist:=sqrt((X-x)^2 (Y-y)^2)][order(nn_dist)][1]
}
df[, c("nnX", "nnY", "nn", "nn_dist"):=dist(X,Y, df[ID!=.BY$ID,.(X,Y,uid)]),by=.(ID,uid)]
Output:
ID X Y t uid nnX nnY nn nn_dist
1: 1 318 1 1 1 1272 2 1272 954.0005
2: 1 317 2 1 2 1272 2 1272 955.0000
3: 2 1273 1 1 1271 318 1 1 955.0000
4: 2 1272 2 1 1272 318 1 1 954.0005
or, you could use the first function and estimate the distance at the end using df[, nn_dist := sqrt((X-nnX)^2 (Y-nnY)^2)]
CodePudding user response:
I once had to do a similar job.
The nncross
function in the spatstat
library can do this work. I don't know how to use data.table, but the following example code should be easy to follow.
library(spatstat)
# create an example dataframe of 100 points that belong to 10 classes
points <- data.frame(X=rnorm(100, mean = 0, sd = 1),
Y=rnorm(100, mean = 0, sd = 1),
class_id = sample(1:10, size = 100, replace = T))
# convert the dataframe of points to ppp object that can be used as input for nncross function
points.ppp <- as.ppp(points,
W=owin( xrange = range(points$X),yrange = range(points$Y) )
)
# calculate for each points its NN point that belongs to another class. The "iX" and "iY" arguments are very important to tell the function to calculate only the cross-class distance, please see the help information of the nncross function for details.
NND_which <- nncross(points.ppp,
points.ppp,
iX = as.integer(points.ppp$marks),
iY = as.integer(points.ppp$marks)
)
# put the NN pairs as well as the distance between them in a new dataframe
NND_pair <- data.frame(NND_source_id = 1:nrow(points),
NND_target_id = NND_which$which,
NND_dist = NND_which$dist)
This function should be much faster than loops in R. But if you have extremely large number of points, say billions of points, I recommend writing your own function in C , with the help of Rcpp.
I don't know the FNN
library, probably someone else who knows how to use it can compare which function is faster.