I have the following dataframe called df (dput
below):
group string
1 1 Bc
2 1 EPc
3 1 Lkc
4 2 ABR
5 2 mA
6 2 Amt
7 3 Yrt
8 3 rtU
9 3 rti
I would like to find the characters that appear across all strings per group. For example group 1 has the character c
in every string. Here is the desired output:
group similar
1 1 c
2 2 A
3 3 rt
So I was wondering if anyone knows how to find similar characters across every string per group in R?
dput
df:
df <- structure(list(group = c("1", "1", "1", "2", "2", "2", "3", "3",
"3"), string = c("Bc", "EPc", "Lkc", "ABR", "mA", "Amt", "Yrt",
"rtU", "rti")), class = "data.frame", row.names = c(NA, -9L))
CodePudding user response:
We could split the string into characters and use intersect
(with the help of Reduce
):
base
:
aggregate(cbind(similar = string) ~ group,
data = df,
FUN = \(x) paste0(Reduce(intersect, str_split(x, "")), collapse = ""))
dplyr
:
library(dplyr)
df |>
group_by(group) |>
summarise(similar = paste0(Reduce(intersect, strsplit(string, "")), collapse = ""))
Output:
group similar
<dbl> <chr>
1 1 c
2 2 A
3 3 rt
CodePudding user response:
Here is an option with tidyverse
- split the 'string' with separate_rows
, grouped by 'group', 'string' filter
the elements, get the distinct
and do a group by paste
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate_rows(string, sep = "(?<=.)(?=.)") %>%
group_by(group, string) %>%
filter(n() == n_distinct(df$group)) %>%
distinct() %>%
group_by(group) %>%
summarise(string = str_c(string, collapse = ""))
-output
# A tibble: 3 × 2
group string
<chr> <chr>
1 1 c
2 2 A
3 3 rt