Home > Software design >  Compare overlap of groups pairwise using tidyverse
Compare overlap of groups pairwise using tidyverse

Time:05-25

I have a tidy data.frame in this format:

library(tidyverse)
df = data.frame(name = c("Clarence","Clarence","Clarence","Shelby","Shelby", "Patricia","Patricia"), fruit = c("Apple", "Banana", "Grapes", "Apple", "Apricot", "Banana", "Grapes"))
df

#      name   fruit
#1 Clarence   Apple
#2 Clarence  Banana
#3 Clarence  Grapes
#4   Shelby   Apple
#5   Shelby Apricot
#6 Patricia  Banana
#7 Patricia  Grapes

I want to compare the overlaps between groups in a pairwise manner (i.e. if both people have an apple that counts as an overlap of 1) so that I end up with a dataframe that looks like this:

df2 = data.frame(names = c("Clarence-Shelby", "Clarence-Patricia", "Shelby-Patricia"), n_overlap  = c(1, 2, 0))
df2

#              names n_overlap
#1   Clarence-Shelby       1
#2 Clarence-Patricia       2
#3   Shelby-Patricia       0

Is there an elegant way to do this in the tidyverse framework? My real dataset is much larger than this and will be grouped on multiple columns.

CodePudding user response:

Try this,

combinations <- apply(combn(unique(df$name), 2), 2, function(z) paste(sort(z), collapse = "-"))
combinations
# [1] "Clarence-Shelby"   "Clarence-Patricia" "Patricia-Shelby"  

library(dplyr)
df %>%
  group_by(fruit) %>%
  summarize(names = paste(sort(unique(name)), collapse = "-")) %>%
  right_join(tibble(names = combinations), by = "names") %>%
  group_by(names) %>%
  summarize(n_overlap = sum(!is.na(fruit)))
# # A tibble: 3 x 2
#   names             n_overlap
#   <chr>                 <int>
# 1 Clarence-Patricia         2
# 2 Clarence-Shelby           1
# 3 Patricia-Shelby           0

CodePudding user response:

If the 0 overlap is not important, a solution is:

> df %>% inner_join(df,by="fruit") %>% filter(name.x<name.y) %>% count(name.x,name.y)
    name.x   name.y n
1 Clarence Patricia 2
2 Clarence   Shelby 1

If you really need non-overlapping pairs:

> a = df %>% inner_join(df,by="fruit") %>% filter(name.x<name.y) %>% count(name.x,name.y)
> b = as.data.frame(t(combn(sort(unique(df$name,2)),2)))
> colnames(b)=colnames(a)[1:2]
> a %>% full_join(b) %>% replace_na(list(n=0))
Joining, by = c("name.x", "name.y")
    name.x   name.y n
1 Clarence Patricia 2
2 Clarence   Shelby 1
3 Patricia   Shelby 0
  • Related