I want to search for what strings are similar in a column of a df in another df, for example in df1
I have this:
nombres
Acesco Corporation
Exito S.A
AMI
Renault
and in a df2
I found this:
nombres
Acesco
Exito
AMI
Renault
I want a function similar to %in% that gives an output like this:
Acesco, Exito, AMI
CodePudding user response:
We can use:
txt1 <- c('nombres',
'Acesco Corporation',
'Exito S.A',
'AMI ',
'Renault')
txt2 <- c(
'nombres',
'Acesco',
'Exito',
'AMI',
'Renault')
dist_matrix <- data.frame(t(adist(txt1, txt2))) # columns correspond to txt1 after transposing
txt2[sapply(dist_matrix, which.min)]
[1] "nombres" "Acesco" "Exito" "AMI" "Renault"
Where adist
computes a distance between two strings.
The (generalized) Levenshtein (or edit) distance between two strings s and t is the minimal possibly weighted number of insertions, deletions and substitutions needed to transform s into t
CodePudding user response:
I found a way that may work, not as good as the one from @gaut, but may work
lapply(df1, function(x) grep(x, df2))
This give the position where its foundable in df1, those that are in df2.
Hope helps!
CodePudding user response:
I would just extend @lasagna's answer by taking the diagonal of the matrix, and then binding it to the original data frame so it can be used in the next steps...
df<-dist_matrix %>% as.matrix()
mydists<-diag(df)
dist_matrix$mydist<-mydists
dist_matrix