Compare overlap of groups pairwise using tidyverse-CodePudding

I have a tidy data.frame in this format:

library(tidyverse)
df = data.frame(name = c("Clarence","Clarence","Clarence","Shelby","Shelby", "Patricia","Patricia"), fruit = c("Apple", "Banana", "Grapes", "Apple", "Apricot", "Banana", "Grapes"))
df

#      name   fruit
#1 Clarence   Apple
#2 Clarence  Banana
#3 Clarence  Grapes
#4   Shelby   Apple
#5   Shelby Apricot
#6 Patricia  Banana
#7 Patricia  Grapes

I want to compare the overlaps between groups in a pairwise manner (i.e. if both people have an apple that counts as an overlap of 1) so that I end up with a dataframe that looks like this:

df2 = data.frame(names = c("Clarence-Shelby", "Clarence-Patricia", "Shelby-Patricia"), n_overlap  = c(1, 2, 0))
df2

#              names n_overlap
#1   Clarence-Shelby       1
#2 Clarence-Patricia       2
#3   Shelby-Patricia       0

Is there an elegant way to do this in the tidyverse framework? My real dataset is much larger than this and will be grouped on multiple columns.

CodePudding user response：

Try this,

combinations <- apply(combn(unique(df$name), 2), 2, function(z) paste(sort(z), collapse = "-"))
combinations
# [1] "Clarence-Shelby"   "Clarence-Patricia" "Patricia-Shelby"  

library(dplyr)
df %>%
  group_by(fruit) %>%
  summarize(names = paste(sort(unique(name)), collapse = "-")) %>%
  right_join(tibble(names = combinations), by = "names") %>%
  group_by(names) %>%
  summarize(n_overlap = sum(!is.na(fruit)))
# # A tibble: 3 x 2
#   names             n_overlap
#   <chr>                 <int>
# 1 Clarence-Patricia         2
# 2 Clarence-Shelby           1
# 3 Patricia-Shelby           0

CodePudding user response：

If the 0 overlap is not important, a solution is:

> df %>% inner_join(df,by="fruit") %>% filter(name.x<name.y) %>% count(name.x,name.y)
    name.x   name.y n
1 Clarence Patricia 2
2 Clarence   Shelby 1

If you really need non-overlapping pairs:

> a = df %>% inner_join(df,by="fruit") %>% filter(name.x<name.y) %>% count(name.x,name.y)
> b = as.data.frame(t(combn(sort(unique(df$name,2)),2)))
> colnames(b)=colnames(a)[1:2]
> a %>% full_join(b) %>% replace_na(list(n=0))
Joining, by = c("name.x", "name.y")
    name.x   name.y n
1 Clarence Patricia 2
2 Clarence   Shelby 1
3 Patricia   Shelby 0