%in% but looking similar strings R-CodePudding

I want to search for what strings are similar in a column of a df in another df, for example in df1 I have this:

nombres
Acesco Corporation
Exito S.A
AMI 
Renault

and in a df2I found this:

nombres
Acesco
Exito 
AMI 
Renault

I want a function similar to %in% that gives an output like this: Acesco, Exito, AMI

CodePudding user response：

We can use:

txt1 <- c('nombres',
'Acesco Corporation',
'Exito S.A',
'AMI ',
'Renault')

txt2 <- c(
'nombres',
'Acesco',
'Exito',
'AMI',
'Renault')

dist_matrix <- data.frame(t(adist(txt1, txt2))) # columns correspond to txt1 after transposing
txt2[sapply(dist_matrix, which.min)]

[1] "nombres" "Acesco"  "Exito"   "AMI"     "Renault"

Where adist computes a distance between two strings.

The (generalized) Levenshtein (or edit) distance between two strings s and t is the minimal possibly weighted number of insertions, deletions and substitutions needed to transform s into t

CodePudding user response：

I found a way that may work, not as good as the one from @gaut, but may work

lapply(df1, function(x) grep(x, df2))

This give the position where its foundable in df1, those that are in df2.

Hope helps!

CodePudding user response：

I would just extend @lasagna's answer by taking the diagonal of the matrix, and then binding it to the original data frame so it can be used in the next steps...

df<-dist_matrix %>% as.matrix()
mydists<-diag(df)
dist_matrix$mydist<-mydists
dist_matrix