I have data like this:
df1<- structure(list(test = c("SNTM1", "STTTT2", "STOLA", "STOMQ",
"STR2", "SUPTY1", "TBNHSG", "TEYAH", "TMEIL1", "TMEIL2", "TMEIL3",
"TNIL", "TREUK", "TTRK", "TRRFK", "UBA52", "YIPF1")), class = "data.frame", row.names = c(NA,
-17L))
df2<-structure(list(test = c("SNTLK", "STTTFSG", "STOIU", "STOMQ",
"STR25", "SUPYHGS", "TBHYDG", "TEHDYG", "TMEIL1", "YIPF1")), class = "data.frame", row.names = c(NA,
-10L))
and
df3<- structure(list(test = c("SNTLKM", "STTTFSGTT", "GFD", "STOMQ",
"TRS", "BRsts", "TMHS", "RSEST", "TRSF", "YIPF1")), class = "data.frame", row.names = c(NA,
-10L))
I want to know how many strings are common across all these 3 data frames.
If it was two, I could do it with match and join function but I want to know how many are shared between df1 and df2 and df3 or a combination.
CodePudding user response:
example (if only identical strings count for duplicates):
library(dplyr)
df1 <- data.frame(test = c("A", "B", "C", "C"))
df2 <- data.frame(test = c("B", "C", "D"))
df3 <- data.frame(test = c("C", "D", "E"))
bind_rows(df1, df2, df3, .id = "origin") %>%
group_by(origin) %>%
distinct(test) %>% ## remove within-dataframe duplicates
group_by(test) %>%
summarise(replicates = n()) %>%
filter(replicates > 1)
CodePudding user response:
Here is an update in case only identical strings are wished:
library(dplyr)
bind_rows(list(df1 = df1, df2 = df2, df3 = df3), .id = 'id') %>%
filter(duplicated(test) | duplicated(test, fromLast=TRUE))
id test
1 df1 STOMQ
2 df1 TMEIL1
3 df1 YIPF1
4 df2 STOMQ
5 df2 TMEIL1
6 df2 YIPF1
7 df3 STOMQ
8 df3 YIPF1
First answer: Here is a suggestion:
First bring all dataframes in a list of dataframes with an identifier and arrange by the the string. Now you could check visually:
library(dplyr)
x <- bind_rows(list(df1 = df1, df2 = df2, df3 = df3), .id = 'id') %>%
arrange(test)
To automate the process you have to use a kind of string distance, there are some different out there and I can't tell which one is better or more appropriate. One example is Jaccard_index https://en.wikipedia.org/wiki/Jaccard_index
Here we use the Jaro-Winkler distance: Learned here: How to group similar strings together in a database in R
in the group column you could find the similar strings:
You can define what does similar mean, by changing the value of "jw". Try and change it from 0.4 to 0.1 then you will see that the groups change:
library(tidyverse)
library(stringdist)
map_dfr(x$test, ~ {
i <- which(stringdist(., x$test, "jw") < 0.40)
tibble(index = i, title = x$test[i])
}, .id = "group") %>%
distinct(index, .keep_all = T) %>%
mutate(group = as.integer(group))
bind_cols(df_id = x$id)
group index title df_id
<int> <int> <chr> <chr>
1 1 1 BRsts df3
2 2 2 GFD df3
3 3 3 RSEST df3
4 3 31 TRS df2
5 3 32 TRSF df3
6 4 4 SNTLK df1
7 4 5 SNTLKM df2
8 4 6 SNTM1 df1
9 4 8 STOLA df1
10 4 12 STR2 df2
# ... with 27 more rows