I have 2-dimensional data that is grouped by time and ID, and for each unique timestamp, I'm trying to identify the nearest neighbor for one ID to any of the others within that timestamp (i.e. get nearest neighbor for all IDs within timestamp 1, then get nearest neighbor for all IDs within timestmap 2, and so forth). I have cartesian coordinates; I know that distm()
returns a matrix, but don't know how to do this with non-Lat/Long coordinates. dplyr
or other options welcomed.
dat<-structure(list(timestamp = structure(c(1585958400, 1585958400,
1585958400, 1585958400, 1585958400, 1585962000, 1585962000, 1585962000
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), ID = structure(c(1L,
12L, 25L, 47L, 51L, 12L, 47L, 50L), .Label = c("1", "10", "11",
"12", "13", "14", "15", "16", "17", "18", "19", "2", "20", "21",
"21b", "22", "23", "23b", "24", "25", "26", "27", "28", "29",
"3", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39",
"4", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49",
"5", "50", "51", "6", "7", "8", "9"), class = "factor"), x = c(740.693095051881,
743.998405321748, 739.480351500548, 744.040204814706, 743.357515557653,
744.432012727096, 744.604552094105, 746.065894928645), y = c(2300.91570604786,
2304.38234085183, 2304.9102593082, 2304.33896409476, 2302.08781536681,
2304.18977683083, 2304.07139807708, 2301.47198606318)), row.names = c(NA,
8L), class = "data.frame")
> dat
timestamp ID x y
1 2020-04-04 00:00:00 1 740.6931 2300.916
2 2020-04-04 00:00:00 2 743.9984 2304.382
3 2020-04-04 00:00:00 3 739.4804 2304.910
4 2020-04-04 00:00:00 5 744.0402 2304.339
5 2020-04-04 00:00:00 7 743.3575 2302.088
6 2020-04-04 01:00:00 2 744.4320 2304.190
7 2020-04-04 01:00:00 5 744.6046 2304.071
8 2020-04-04 01:00:00 6 746.0659 2301.472
What I need is a 5th column in that dataframe with the distance to the nearest neighbor. I do not need to know which respective IDs are the closest, I just need the value.
Edit:
Following @Dave2e 's suggestion, I've tried this. It's probably really hacky as I don't know how to make it neater and more streamlined. I hope I'm not missing anything and have a serious flaw in my thinking.
I tried to come up with a dplyr
solution, but I don't know how to temporarily create the distance matrix within a dplyr call per group.
#Option1: looping through subsets for each unique timestamp
dat.2<-data.frame()
for(i in 1:length(unique(dat$timestamp))){
dat.sub<-dat[dat$timestamp==unique(dat$timestamp)[i],]
d<-as.data.frame(as.matrix(dist(dat.sub[ , c("x", "y")], upper=TRUE)))
d[d==0]<-NA #exclude distance to self
dat.sub$min.d<-sapply(d, min, na.rm=T)
dat.2<-rbind(dat.2, dat.sub)
}
> dat.2
timestamp ID x y min.d
1 2020-04-04 00:00:00 1 740.6931 2300.916 2.91083783
2 2020-04-04 00:00:00 2 743.9984 2304.382 0.06023903
3 2020-04-04 00:00:00 3 739.4804 2304.910 4.17459012
4 2020-04-04 00:00:00 5 744.0402 2304.339 0.06023903
5 2020-04-04 00:00:00 7 743.3575 2302.088 2.35238926
6 2020-04-04 01:00:00 2 744.4320 2304.190 0.20924474
7 2020-04-04 01:00:00 5 744.6046 2304.071 0.20924474
8 2020-04-04 01:00:00 6 746.0659 2301.472 2.98202376
#Option 2: using dplyr - don't know how
dat.2<-as.data.frame(dat%>%
group_by(timestamp)%>%
...dplyr magic ...
summarize? mutate?
)
CodePudding user response:
Is this what you are looking for?
groups <-split(dat, dat$timestamp)
answer <-lapply(groups, function(dat){
#calculate distance matrix
d<-dist(dat[ , c("x", "y")], upper=TRUE, diag = TRUE)
#convert to matrix
dm<-as.matrix(d)
#find the smallest value in each row less self
closest <-sapply(1:nrow(dm), function(i){
names(which.min(dm[i,-i]))
})
dat$closest <-as.integer(closest)
dat
})
dplyr::bind_rows(answer)
timestamp ID x y closest
1 2020-04-04 00:00:00 1 740.6931 2300.916 5
2 2020-04-04 00:00:00 2 743.9984 2304.382 4
3 2020-04-04 00:00:00 3 739.4804 2304.910 1
4 2020-04-04 00:00:00 5 744.0402 2304.339 2
5 2020-04-04 00:00:00 7 743.3575 2302.088 7
6 2020-04-04 01:00:00 2 744.4320 2304.190 7
7 2020-04-04 01:00:00 5 744.6046 2304.071 6
8 2020-04-04 01:00:00 6 746.0659 2301.472 5
CodePudding user response:
I would try something like this:
library(data.table)
f<- function(x,y){
d <-as.matrix(dist(data.table(x,y), upper=T))
d[d==0] <- NA
apply(d,1,min,na.rm=T)
}
setDT(dat)[, mindist:=f(x,y), by=timestamp]
Output:
timestamp ID x y mindist
<POSc> <fctr> <num> <num> <num>
1: 2020-04-04 00:00:00 1 740.6931 2300.916 2.91083783
2: 2020-04-04 00:00:00 2 743.9984 2304.382 0.06023903
3: 2020-04-04 00:00:00 3 739.4804 2304.910 4.17459012
4: 2020-04-04 00:00:00 5 744.0402 2304.339 0.06023903
5: 2020-04-04 00:00:00 7 743.3575 2302.088 2.35238926
6: 2020-04-04 01:00:00 2 744.4320 2304.190 0.20924474
7: 2020-04-04 01:00:00 5 744.6046 2304.071 0.20924474
8: 2020-04-04 01:00:00 6 746.0659 2301.472 2.98202376