I have two set of data and I am trying to find the similar strings across several files. As an example I show two data here
df1<-structure(list(test = c("SNTM1", "STTTT2", "STOLA", "STOMQ",
"STR2", "SUPTY1", "TBNHSG", "TEYAH", "TMEIL1", "TMEIL2", "TMEIL3",
"TNIL", "TREUK", "TTRK", "TRRFK", "UBA52", "YIPF1")), class = "data.frame", row.names = c(NA,
-17L))
df2<- structure(list(test = c("SNTLK", "STTTFSG", "STOIU", "STOMQ",
"STR25", "SUPYHGS", "TBHYDG", "TEHDYG", "TMEIL1", "YIPF1")), class = "data.frame", row.names = c(NA,
-10L))
I find the similar strings like this
semi_join(df, df2, by="test")
or even this
match(df$test,df2$test)
or few other ways, I just cannot figure sometimes it does not work, it is possibly because of the charctrictis structure or upper lower case and I have been killing myself but it won't match. the list is huge that is why I cannot paste all here
what I tried also was
df1$test <- as.character(df1$test)
df2$test <- as.character(df2$test)
but I still cannot figure out. any idea?
CodePudding user response:
Try fuzzyjoin
:
We could join df1
df2
based on fuzzy string matching of their columns.
With max_dist we could define the Maximum distance to use for joining
See: ?stringdist_left_join
library(dplyr)
library(fuzzyjoin)
fuzzyjoin::stringdist_left_join(x=df1, y=df2, max_dist =.35, by='test', method ='jaccard', distance_col = "dist")
test.x test.y dist
1 SNTM1 <NA> NA
2 STTTT2 <NA> NA
3 STOLA <NA> NA
4 STOMQ STOMQ 0.0000000
5 STR2 STR25 0.2000000
6 SUPTY1 <NA> NA
7 TBNHSG <NA> NA
8 TEYAH <NA> NA
9 TMEIL1 TMEIL1 0.0000000
10 TMEIL2 TMEIL1 0.2857143
11 TMEIL3 TMEIL1 0.2857143
12 TNIL <NA> NA
13 TREUK <NA> NA
14 TTRK <NA> NA
15 TRRFK <NA> NA
16 UBA52 <NA> NA
17 YIPF1 YIPF1 0.0000000