Home > database >  Minimum distances among grouped values
Minimum distances among grouped values

Time:04-28

I have 2-dimensional data that is grouped by time and ID, and for each unique timestamp, I'm trying to identify the nearest neighbor for one ID to any of the others within that timestamp (i.e. get nearest neighbor for all IDs within timestamp 1, then get nearest neighbor for all IDs within timestmap 2, and so forth). I have cartesian coordinates; I know that distm() returns a matrix, but don't know how to do this with non-Lat/Long coordinates. dplyr or other options welcomed.

dat<-structure(list(timestamp = structure(c(1585958400, 1585958400, 
1585958400, 1585958400, 1585958400, 1585962000, 1585962000, 1585962000
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), ID = structure(c(1L, 
12L, 25L, 47L, 51L, 12L, 47L, 50L), .Label = c("1", "10", "11", 
"12", "13", "14", "15", "16", "17", "18", "19", "2", "20", "21", 
"21b", "22", "23", "23b", "24", "25", "26", "27", "28", "29", 
"3", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", 
"4", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", 
"5", "50", "51", "6", "7", "8", "9"), class = "factor"), x = c(740.693095051881, 
743.998405321748, 739.480351500548, 744.040204814706, 743.357515557653, 
744.432012727096, 744.604552094105, 746.065894928645), y = c(2300.91570604786, 
2304.38234085183, 2304.9102593082, 2304.33896409476, 2302.08781536681, 
2304.18977683083, 2304.07139807708, 2301.47198606318)), row.names = c(NA, 
8L), class = "data.frame")

> dat
            timestamp ID        x        y
1 2020-04-04 00:00:00  1 740.6931 2300.916
2 2020-04-04 00:00:00  2 743.9984 2304.382
3 2020-04-04 00:00:00  3 739.4804 2304.910
4 2020-04-04 00:00:00  5 744.0402 2304.339
5 2020-04-04 00:00:00  7 743.3575 2302.088
6 2020-04-04 01:00:00  2 744.4320 2304.190
7 2020-04-04 01:00:00  5 744.6046 2304.071
8 2020-04-04 01:00:00  6 746.0659 2301.472

What I need is a 5th column in that dataframe with the distance to the nearest neighbor. I do not need to know which respective IDs are the closest, I just need the value.

Edit:

Following @Dave2e 's suggestion, I've tried this. It's probably really hacky as I don't know how to make it neater and more streamlined. I hope I'm not missing anything and have a serious flaw in my thinking. I tried to come up with a dplyr solution, but I don't know how to temporarily create the distance matrix within a dplyr call per group.

#Option1: looping through subsets for each unique timestamp

dat.2<-data.frame()

for(i in 1:length(unique(dat$timestamp))){
  
  dat.sub<-dat[dat$timestamp==unique(dat$timestamp)[i],]
  d<-as.data.frame(as.matrix(dist(dat.sub[ , c("x", "y")], upper=TRUE)))
  d[d==0]<-NA #exclude distance to self
  dat.sub$min.d<-sapply(d, min, na.rm=T)
  
  dat.2<-rbind(dat.2, dat.sub)
}

> dat.2
            timestamp ID        x        y      min.d
1 2020-04-04 00:00:00  1 740.6931 2300.916 2.91083783
2 2020-04-04 00:00:00  2 743.9984 2304.382 0.06023903
3 2020-04-04 00:00:00  3 739.4804 2304.910 4.17459012
4 2020-04-04 00:00:00  5 744.0402 2304.339 0.06023903
5 2020-04-04 00:00:00  7 743.3575 2302.088 2.35238926
6 2020-04-04 01:00:00  2 744.4320 2304.190 0.20924474
7 2020-04-04 01:00:00  5 744.6046 2304.071 0.20924474
8 2020-04-04 01:00:00  6 746.0659 2301.472 2.98202376


#Option 2: using dplyr - don't know how

dat.2<-as.data.frame(dat%>%
                       group_by(timestamp)%>%
                       ...dplyr magic ... 
                       summarize? mutate? 
                     )

CodePudding user response:

Is this what you are looking for?

groups <-split(dat, dat$timestamp)
answer <-lapply(groups, function(dat){
   #calculate distance matrix
   d<-dist(dat[ , c("x", "y")], upper=TRUE, diag = TRUE)
   #convert to matrix
   dm<-as.matrix(d)

   #find the smallest value in each row less self
   closest <-sapply(1:nrow(dm), function(i){
      names(which.min(dm[i,-i]))
   })

dat$closest <-as.integer(closest)
dat
})
dplyr::bind_rows(answer)



timestamp ID        x        y closest
1 2020-04-04 00:00:00  1 740.6931 2300.916       5
2 2020-04-04 00:00:00  2 743.9984 2304.382       4
3 2020-04-04 00:00:00  3 739.4804 2304.910       1
4 2020-04-04 00:00:00  5 744.0402 2304.339       2
5 2020-04-04 00:00:00  7 743.3575 2302.088       7
6 2020-04-04 01:00:00  2 744.4320 2304.190       7
7 2020-04-04 01:00:00  5 744.6046 2304.071       6
8 2020-04-04 01:00:00  6 746.0659 2301.472       5

CodePudding user response:

I would try something like this:

library(data.table)

f<- function(x,y){ 
  d <-as.matrix(dist(data.table(x,y), upper=T))
  d[d==0] <- NA
  apply(d,1,min,na.rm=T)
}

setDT(dat)[, mindist:=f(x,y), by=timestamp]

Output:

             timestamp     ID        x        y    mindist
                <POSc> <fctr>    <num>    <num>      <num>
1: 2020-04-04 00:00:00      1 740.6931 2300.916 2.91083783
2: 2020-04-04 00:00:00      2 743.9984 2304.382 0.06023903
3: 2020-04-04 00:00:00      3 739.4804 2304.910 4.17459012
4: 2020-04-04 00:00:00      5 744.0402 2304.339 0.06023903
5: 2020-04-04 00:00:00      7 743.3575 2302.088 2.35238926
6: 2020-04-04 01:00:00      2 744.4320 2304.190 0.20924474
7: 2020-04-04 01:00:00      5 744.6046 2304.071 0.20924474
8: 2020-04-04 01:00:00      6 746.0659 2301.472 2.98202376
  • Related