how to find similar between two set of strings-CodePudding

I have two set of data and I am trying to find the similar strings across several files. As an example I show two data here

df1<-structure(list(test = c("SNTM1", "STTTT2", "STOLA", "STOMQ", 
"STR2", "SUPTY1", "TBNHSG", "TEYAH", "TMEIL1", "TMEIL2", "TMEIL3", 
"TNIL", "TREUK", "TTRK", "TRRFK", "UBA52", "YIPF1")), class = "data.frame", row.names = c(NA, 
-17L))


df2<- structure(list(test = c("SNTLK", "STTTFSG", "STOIU", "STOMQ", 
"STR25", "SUPYHGS", "TBHYDG", "TEHDYG", "TMEIL1", "YIPF1")), class = "data.frame", row.names = c(NA, 
-10L))

I find the similar strings like this

semi_join(df, df2, by="test")

or even this

match(df$test,df2$test)

or few other ways, I just cannot figure sometimes it does not work, it is possibly because of the charctrictis structure or upper lower case and I have been killing myself but it won't match. the list is huge that is why I cannot paste all here

what I tried also was

df1$test <- as.character(df1$test)
df2$test <- as.character(df2$test)

but I still cannot figure out. any idea?

CodePudding user response：

Try fuzzyjoin:

We could join df1 df2 based on fuzzy string matching of their columns.

With max_dist we could define the Maximum distance to use for joining

See: ?stringdist_left_join

library(dplyr)
library(fuzzyjoin)

fuzzyjoin::stringdist_left_join(x=df1, y=df2, max_dist =.35, by='test', method ='jaccard', distance_col = "dist")

   test.x test.y      dist
1   SNTM1   <NA>        NA
2  STTTT2   <NA>        NA
3   STOLA   <NA>        NA
4   STOMQ  STOMQ 0.0000000
5    STR2  STR25 0.2000000
6  SUPTY1   <NA>        NA
7  TBNHSG   <NA>        NA
8   TEYAH   <NA>        NA
9  TMEIL1 TMEIL1 0.0000000
10 TMEIL2 TMEIL1 0.2857143
11 TMEIL3 TMEIL1 0.2857143
12   TNIL   <NA>        NA
13  TREUK   <NA>        NA
14   TTRK   <NA>        NA
15  TRRFK   <NA>        NA
16  UBA52   <NA>        NA
17  YIPF1  YIPF1 0.0000000