How to find common strings across several files-CodePudding

I have data like this:

df1<- structure(list(test = c("SNTM1", "STTTT2", "STOLA", "STOMQ", 
"STR2", "SUPTY1", "TBNHSG", "TEYAH", "TMEIL1", "TMEIL2", "TMEIL3", 
"TNIL", "TREUK", "TTRK", "TRRFK", "UBA52", "YIPF1")), class = "data.frame", row.names = c(NA, 
-17L))

df2<-structure(list(test = c("SNTLK", "STTTFSG", "STOIU", "STOMQ", 
"STR25", "SUPYHGS", "TBHYDG", "TEHDYG", "TMEIL1", "YIPF1")), class = "data.frame", row.names = c(NA, 
-10L))

and

df3<- structure(list(test = c("SNTLKM", "STTTFSGTT", "GFD", "STOMQ", 
"TRS", "BRsts", "TMHS", "RSEST", "TRSF", "YIPF1")), class = "data.frame", row.names = c(NA, 
-10L))

I want to know how many strings are common across all these 3 data frames.

If it was two, I could do it with match and join function but I want to know how many are shared between df1 and df2 and df3 or a combination.

CodePudding user response：

example (if only identical strings count for duplicates):

library(dplyr)

df1 <- data.frame(test = c("A", "B", "C", "C"))
df2 <- data.frame(test = c("B", "C", "D"))
df3 <- data.frame(test = c("C", "D", "E"))

bind_rows(df1, df2, df3, .id = "origin") %>%
  group_by(origin) %>%
  distinct(test) %>% ## remove within-dataframe duplicates
  group_by(test) %>%
  summarise(replicates = n()) %>%
  filter(replicates > 1)

CodePudding user response：

Here is an update in case only identical strings are wished:

library(dplyr)

bind_rows(list(df1 = df1, df2 = df2, df3 = df3), .id = 'id') %>% 
  filter(duplicated(test) | duplicated(test, fromLast=TRUE))

   id   test
1 df1  STOMQ
2 df1 TMEIL1
3 df1  YIPF1
4 df2  STOMQ
5 df2 TMEIL1
6 df2  YIPF1
7 df3  STOMQ
8 df3  YIPF1

First answer: Here is a suggestion:

First bring all dataframes in a list of dataframes with an identifier and arrange by the the string. Now you could check visually:

library(dplyr)
x <- bind_rows(list(df1 = df1, df2 = df2, df3 = df3), .id = 'id') %>% 
  arrange(test)

To automate the process you have to use a kind of string distance, there are some different out there and I can't tell which one is better or more appropriate. One example is Jaccard_index https://en.wikipedia.org/wiki/Jaccard_index

Here we use the Jaro-Winkler distance: Learned here: How to group similar strings together in a database in R

in the group column you could find the similar strings:

You can define what does similar mean, by changing the value of "jw". Try and change it from 0.4 to 0.1 then you will see that the groups change:

library(tidyverse)
library(stringdist)

map_dfr(x$test, ~ {
  i <- which(stringdist(., x$test, "jw") < 0.40)
  tibble(index = i, title = x$test[i])
}, .id = "group") %>%
  distinct(index, .keep_all = T) %>% 
  mutate(group = as.integer(group))  
  bind_cols(df_id = x$id)

  group index title  df_id
   <int> <int> <chr>  <chr>
 1     1     1 BRsts  df3  
 2     2     2 GFD    df3  
 3     3     3 RSEST  df3  
 4     3    31 TRS    df2  
 5     3    32 TRSF   df3  
 6     4     4 SNTLK  df1  
 7     4     5 SNTLKM df2  
 8     4     6 SNTM1  df1  
 9     4     8 STOLA  df1  
10     4    12 STR2   df2  
# ... with 27 more rows