Find the distance between groups of string in R-CodePudding

I have a very large dataset, which looks like this.

I have two types of data frames

my reference data.frame

ref=c("cake","brownies")

and my experimental data.frame

expr=c("cak","cakee","cake", "rownies","browwnies")

I want to match the ref and expr data.frames and find the levenstein distance between them. The output could look like this...

ref   expr      distance 
cake  cak         1
cake  cakee       1
cake  cake        0
cake  rownies    ...

after I have measured their levenstein distance I want to cluster any string that has distance less than 3 to one cluster and my data to maybe look like

ref        expr      distance  cluster
cake       cak         1         1
cake       cakee       1         1
cake       cake        0         1
brownies   rownies     1         2 
brownies   browwnies   1         2

any help or advice on how to move on is appreciate it. At the moment I am trying a lot of R packages to find the distance between data.frame such as

library("DescTools")

but they do not seem to work well.

CodePudding user response：

Here are 2 ways I'd approach it, one that's strictly supervised and more manual, and another that takes a less supervised route. The package stringdist has a bunch of different distance metrics, where "lv" is Levenshtein. I added an additional observation "poundcake" to test with a word that's too far from the reference words.

Option 1

Get a matrix of the distances between each experimental string and one of the reference strings. This could have issues if you have 2 similar reference strings, or if an experimental word is equally close to 2 references, but it works for this simple case. Then reshape the matrix into a data frame, and count along reference words to get cluster numbers. Filter for cases where the distance is less than your threshold.

library(dplyr)
library(stringdist)

max_dist <- 3

ref <- c("cake", "brownies")
expr <- c("cak", "cakee", "cake", "poundcake", "rownies","browwnies")

mtx <- stringdistmatrix(ref, expr, method = "lv", useNames = "strings")

mtx
#>          cak cakee cake poundcake rownies browwnies
#> cake       1     1    0         5       6         8
#> brownies   8     7    7         8       1         1

df1 <- as.data.frame(mtx) %>%
  tibble::rownames_to_column("ref") %>%
  tidyr::pivot_longer(-ref, names_to = "expr", values_to = "dist") %>%
  mutate(clust = as.numeric(forcats::as_factor(ref))) # could also use data.table::rleid

df1 %>%
  filter(dist <= max_dist)
#> # A tibble: 5 × 4
#>   ref      expr       dist clust
#>   <chr>    <chr>     <dbl> <dbl>
#> 1 cake     cak           1     1
#> 2 cake     cakee         1     1
#> 3 cake     cake          0     1
#> 4 brownies rownies       1     2
#> 5 brownies browwnies     1     2

Option 2

This might work for more complex cases. I've used it for correcting the spelling of people's names, where I have an incomplete set of correct labels to work from. Combine all the words into 1 vector, get a distance matrix (this time it will be square), then create clusters from hierarchical clustering using the threshold as the height to cut the tree. You can then match the reference for each word to get labels for the clusters.

The downside here is that you have rows for reference words that weren't experimental—note for example that "brownies" was never spelled correctly in the experimental strings, but now you have that observation.

all_words <- c(ref, expr)
hc <- hclust(stringdistmatrix(all_words, method = "lv", useNames = "strings"))

df2 <- data.frame(word = c(ref, expr), 
                  clust = cutree(hc, h = max_dist)) %>%
  mutate(r = ref[clust])

df2 %>%
  filter(!is.na(r))
#>        word clust        r
#> 1      cake     1     cake
#> 2  brownies     2 brownies
#> 3       cak     1     cake
#> 4     cakee     1     cake
#> 5      cake     1     cake
#> 6   rownies     2 brownies
#> 7 browwnies     2 brownies