Find similar character in string per group in R-CodePudding

I have the following dataframe called df (dput below):

  group string
1     1     Bc
2     1    EPc
3     1    Lkc
4     2    ABR
5     2     mA
6     2    Amt
7     3    Yrt
8     3    rtU
9     3    rti

I would like to find the characters that appear across all strings per group. For example group 1 has the character c in every string. Here is the desired output:

  group similar
1     1       c
2     2       A
3     3      rt

So I was wondering if anyone knows how to find similar characters across every string per group in R?

dput df:

df <- structure(list(group = c("1", "1", "1", "2", "2", "2", "3", "3", 
"3"), string = c("Bc", "EPc", "Lkc", "ABR", "mA", "Amt", "Yrt", 
"rtU", "rti")), class = "data.frame", row.names = c(NA, -9L))

CodePudding user response：

We could split the string into characters and use intersect (with the help of Reduce):

base:

aggregate(cbind(similar = string) ~ group,
          data = df,
          FUN = \(x) paste0(Reduce(intersect, str_split(x, "")), collapse = ""))

dplyr:

library(dplyr)

df |>
    group_by(group) |>
    summarise(similar = paste0(Reduce(intersect, strsplit(string, "")), collapse = ""))

Output:

  group similar
  <dbl> <chr>  
1     1 c      
2     2 A      
3     3 rt

CodePudding user response：

Here is an option with tidyverse - split the 'string' with separate_rows, grouped by 'group', 'string' filter the elements, get the distinct and do a group by paste

library(dplyr)
library(tidyr)
library(stringr)
df %>%
   separate_rows(string, sep = "(?<=.)(?=.)") %>% 
   group_by(group, string) %>%
   filter(n() == n_distinct(df$group)) %>% 
   distinct() %>% 
   group_by(group) %>% 
   summarise(string = str_c(string, collapse = ""))

-output

# A tibble: 3 × 2
  group string
  <chr> <chr> 
1 1     c     
2 2     A     
3 3     rt