Home > database >  Is there an R function to do a pairwise Levenshtein distance calculation of two vectors of strings?
Is there an R function to do a pairwise Levenshtein distance calculation of two vectors of strings?

Time:03-30

I have two vectors of strings:

a <- c('Alpha', 'Beta', 'Gamma', 'Delta')
b <- c('Epsilon', 'Zeta', 'Eta', 'Theta')

and I would like to compute the Levenshtein distance or edit distance for each pair of strings.

If I use

stringdist(a, b, method="lv")

The output is a vector with the Levenshtein distance of each string in vector a and the corresponding string in vector b (i.e., Alpha vs Epsilon, Beta vs Zeta, etc.).

What I need instead is a pairwise comparison between each string in one vector and ALL the other strings in the other vector (i.e. Alpha vs Epsilon, Alpha vs. Zeta, Alpha vs Eta, Alpha vs. Theta, Beta vs Epsilon, etc.).

Thanks

CodePudding user response:

There is a straightforward way to do this using stringdistmatrix and some reshaping:

library(stringdist)
library(tidyverse)

a <- c('Alpha', 'Beta', 'Gamma', 'Delta')
b <- c('Epsilon', 'Zeta', 'Eta', 'Theta')

stringdistmatrix(a, b, method = "lv", useNames = "string") %>%
  as_tibble(rownames = "a") %>%
  pivot_longer(-1, names_to = "b", values_to = "dist")
#> # A tibble: 16 x 3
#>    a     b        dist
#>    <chr> <chr>   <dbl>
#>  1 Alpha Epsilon     7
#>  2 Alpha Zeta        4
#>  3 Alpha Eta         4
#>  4 Alpha Theta       4
#>  5 Beta  Epsilon     7
#>  6 Beta  Zeta        1
#>  7 Beta  Eta         2
#>  8 Beta  Theta       2
#>  9 Gamma Epsilon     7
#> 10 Gamma Zeta        4
#> 11 Gamma Eta         4
#> 12 Gamma Theta       4
#> 13 Delta Epsilon     6
#> 14 Delta Zeta        2
#> 15 Delta Eta         3
#> 16 Delta Theta       3

CodePudding user response:

A base R option using adist expand.grid

> cbind(expand.grid(a = a, b = b), lv = c(adist(a, b)))
       a       b lv
1  Alpha Epsilon  7
2   Beta Epsilon  7
3  Gamma Epsilon  7
4  Delta Epsilon  6
5  Alpha    Zeta  4
6   Beta    Zeta  1
7  Gamma    Zeta  4
8  Delta    Zeta  2
9  Alpha     Eta  4
10  Beta     Eta  2
11 Gamma     Eta  4
12 Delta     Eta  3
13 Alpha   Theta  4
14  Beta   Theta  2
15 Gamma   Theta  4
16 Delta   Theta  3

or

> cbind(rev(expand.grid(b = b, a = a)), lv = c(t(adist(a, b))))
       a       b lv
1  Alpha Epsilon  7
2  Alpha    Zeta  4
3  Alpha     Eta  4
4  Alpha   Theta  4
5   Beta Epsilon  7
6   Beta    Zeta  1
7   Beta     Eta  2
8   Beta   Theta  2
9  Gamma Epsilon  7
10 Gamma    Zeta  4
11 Gamma     Eta  4
12 Gamma   Theta  4
13 Delta Epsilon  6
14 Delta    Zeta  2
15 Delta     Eta  3
16 Delta   Theta  3
  • Related