Efficient/Fastest way to compute minimal distance between groups in a data frame-CodePudding

I have a data frame looks like:

x y group
1 2  1 
1 3  1
1 4  2
1 5  2
1 6  3
...

For each group, I would like to find the distance to its 'nearest' group. Here, nearest is defined as the group which has the shortest distance to that group; and distance is defined as the shortest distance between all members from those two groups. For example, the distances between all members within group 1 to all members within group 2 is:

(1,2) -> (1,4) = 2
(1,2) -> (1,5) = 3
(1,3) -> (1,4) = 1
(1,3) -> (1,5) = 2

1 is the shortest, therefore the distance between group 1 and 2 is 1. Same idea, the distances between all members within group 1 to all members within group is:

(1,2) -> (1,6) = 4
(1,3) -> (1,6) = 3

therefore the distance between group 1 and 3 is 3. Since 3 > 1, therefore the nearest neighbor to group 1 is group 2, and the distance is 1. I would like to apply this metric to a really large dataset and I am able to achieve this idea using nested-for loops, but apparently it is very slow. Is there any faster solution? Appreciated!

CodePudding user response：

Here is an approach that loops over pairs of groups but is at least vectorized within pairs:

d <- data.frame(x = 1L, y = 2:6, group = c(1L, 1L, 2L, 2L, 3L))
m <- do.call(rbind, d[c("x", "y")])
l <- lapply(split(seq_len(ncol(m)), d$group), function(j) m[, j, drop = FALSE])
rm(m); gc()

distance <- function(x, y) {
    j <- rep(seq_len(ncol(x)), each = ncol(y))
    min(sqrt(colSums((x[, j, drop = FALSE] - as.vector(y))^2)))
}

D <- outer(l, l, Vectorize(distance))
D
##   1 2 3
## 1 0 1 3
## 2 1 0 1
## 3 3 1 0

I would avoid outer, though, since it doesn't take advantage of the properties of the distance function, namely that distance(x, x) == 0 and distance(x, y) == distance(y, x) for all groups x and y. To obtain the outer result more efficiently, I would do:

D <- matrix(0, length(l), length(l))
D[lower.tri(D)] <- combn(length(l), 2L, function(i) distance(l[[i[1L]]], l[[i[2L]]]))
D <- D   t(D)
D
##      [,1] [,2] [,3]
## [1,]    0    1    3
## [2,]    1    0    1
## [3,]    3    1    0

CodePudding user response：

You can compute the distance between each pair of x, y points using stats::dist(). After a little manipulation of the results using {broom} and {dplyr}, you can find the minimum distance within each pair of groups.

library(dplyr)
library(broom)

df <- data.frame(
  x = rep(1, 5),
  y = 2:6,
  group = c(1, 1, 2, 2, 3)
)

item_groups <- df %>% 
  transmute(item = factor(row_number()), group)

dist(df[c("x", "y")]) %>% 
  broom::tidy() %>% 
  left_join(item_groups, by = c("item1" = "item")) %>% 
  left_join(item_groups, by = c("item2" = "item"), suffix = c(".1", ".2")) %>% 
  group_by(group.1, group.2) %>% 
  filter(group.1 != group.2, distance == min(distance))

#> # A tibble: 3 x 5
#> # Groups:   group.1, group.2 [3]
#>   item1 item2 distance group.1 group.2
#>   <fct> <fct>    <dbl>   <dbl>   <dbl>
#> 1 2     3            1       1       2
#> 2 2     5            3       1       3
#> 3 4     5            1       2       3

^{Created on 2022-03-01 by the reprex package (v2.0.1)}

CodePudding user response：

Does this help?

library(tidyverse)
data <- tribble(
      ~x, ~y, ~group,
      1,2, 1,
      1,3, 1,
      1,4, 2,
      1,5, 2,
      1,6, 3
    )
    data %>% 
      mutate(sum_of_x_y = x y) %>% 
      group_by(group)%>% 
      summarize(min_group =  min(sum_of_x_y))

# group min_group
# <dbl> <dbl>
# 1 3           
# 2 5           
# 3 7

CodePudding user response：

Here's another way

g = length(unique(df$grp))

matrix(
  df[, `:=`(con = 1)][df,allow.cartesian=T,on="con"] %>% 
  .[,dist:=sqrt((x-i.x)^2   (y-i.y)^2)] %>% 
  .[, min(dist), by=.(grp,i.grp)] %>% 
  .[order(grp, i.grp),V1],g,g)

Output:

     [,1] [,2] [,3]
[1,]    0    1    3
[2,]    1    0    1
[3,]    3    1    0

If you have too many points to do the full cartesian join, you can do this, where you do for each of the pairs:

df[,con:=1]

func <- function(df) {
  df[df,allow.cartesian=T,on="con"] %>% 
    .[,dist:=sqrt((x-i.x)^2   (y-i.y)^2)] %>% 
    .[grp!=i.grp, min(dist), by=.(grp,i.grp)][1,V1]
}

grps = unique(df$grp)
vals = apply(combn(grps,2), 2, \(p) func(df[grp %in% p]))
M = matrix(0, length(grps),length(grps))
M[lower.tri(M)] <- vals
M[upper.tri(M)] <- vals

     [,1] [,2] [,3]
[1,]    0    1    3
[2,]    1    0    1
[3,]    3    1    0