Home > Mobile >  Nearest neighbour distance of points to the point that are not in the same group
Nearest neighbour distance of points to the point that are not in the same group

Time:05-10

In my dataset, I have points, whose location are given by X and Y, that are grouped by ID. I want to calculate the nearest neighbour (NN) distance of each point to the points in other groups. In other words, if the ID of a point is 1, the code should search for the NND from the points that satisfy ID != 1.

A pseudo-R-code could look like this:

DT[DT, c("nn_dist", "nn_X", "nn_Y") := find_NNN(data.table(i.X, i.Y), .SD[ID != i.ID]), by = .EACHI]

To achieve this, I tried writing an imperative code with loops and such, but it was too slow. I tried using get.knnx function from FNN library but then I couldn't figure out how to get both the NN distance and the position of the NN.

How can I do this calculation on a relatively large (~10,000 rows) dataset?

Here is a tiny portion of the dataset I'm using

structure(list(ID = c(1L, 1L, 2L, 2L), X = c(318L, 317L, 1273L, 
1272L), Y = c(1L, 2L, 1L, 2L), t = c(1, 1, 1, 1), uid = c(1L, 
2L, 1271L, 1272L)), row.names = c(NA, -4L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x7f89b481f0e0>)
  • t column is the same for all of the points in the dataset
  • uid is a unique id given to every point, regardless of their position or group, so max(uid) equals to the #rows in the dataset

CodePudding user response:

You can create a small distance function that is passed x, y, and a data frame d of all candidate points (i.e. the x and y values of all the points in other id groups, and returns the coordinates and uid of the nearest inter-group point

dist <- function(x,y,d) d[order(d[,sqrt((X-x)^2   (Y-y)^2)])][1]

Then apply the function to the frame

df[, c("nnX", "nnY", "nn"):=dist(X,Y, df[ID!=.BY$ID,.(X,Y,uid)]),by=.(ID,uid)]

Output:

   ID    X Y t  uid  nnX nnY   nn
1:  1  318 1 1    1 1272   2 1272
2:  1  317 2 1    2 1272   2 1272
3:  2 1273 1 1 1271  318   1    1
4:  2 1272 2 1 1272  318   1    1

If you additionally want the distance to that nearest neighbor, you could either update the function and the call to the function like this:

dist <- function(x,y,d) {
  d[, nn_dist:=sqrt((X-x)^2   (Y-y)^2)][order(nn_dist)][1]
}
df[, c("nnX", "nnY", "nn", "nn_dist"):=dist(X,Y, df[ID!=.BY$ID,.(X,Y,uid)]),by=.(ID,uid)]

Output:

   ID    X Y t  uid  nnX nnY   nn  nn_dist
1:  1  318 1 1    1 1272   2 1272 954.0005
2:  1  317 2 1    2 1272   2 1272 955.0000
3:  2 1273 1 1 1271  318   1    1 955.0000
4:  2 1272 2 1 1272  318   1    1 954.0005

or, you could use the first function and estimate the distance at the end using df[, nn_dist := sqrt((X-nnX)^2 (Y-nnY)^2)]

CodePudding user response:

I once had to do a similar job.

The nncross function in the spatstat library can do this work. I don't know how to use data.table, but the following example code should be easy to follow.

library(spatstat)

# create an example dataframe of 100 points that belong to 10 classes
points <- data.frame(X=rnorm(100, mean = 0, sd = 1),
                     Y=rnorm(100, mean = 0, sd = 1),
                     class_id = sample(1:10, size = 100, replace = T))

# convert the dataframe of points to ppp object that can be used as input for nncross function
points.ppp <- as.ppp(points, 
                     W=owin( xrange = range(points$X),yrange = range(points$Y) )
                     )

# calculate for each points its NN point that belongs to another class. The "iX" and "iY" arguments are very important to tell the function to calculate only the cross-class distance, please see the help information of the nncross function for details.
NND_which <- nncross(points.ppp,
                 points.ppp,
                 iX = as.integer(points.ppp$marks), 
                 iY = as.integer(points.ppp$marks)
                 )

# put the NN pairs as well as the distance between them in a new dataframe
NND_pair <- data.frame(NND_source_id = 1:nrow(points),
                       NND_target_id = NND_which$which,
                       NND_dist = NND_which$dist)

This function should be much faster than loops in R. But if you have extremely large number of points, say billions of points, I recommend writing your own function in C , with the help of Rcpp.

I don't know the FNN library, probably someone else who knows how to use it can compare which function is faster.

  • Related