The following data has the surprising result that it does not match. I was expecting the distance to be 5, but even at 7 I get no match

library(fuzzyjoin)
one <- as.data.frame("Other field crops (non-organic)")
names(one) <- "A"
two <-  as.data.frame("other_field_crops_non_organic")
names(two) <- "A"

stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 7, ignore_case=TRUE)

                              A.x  A.y
1 Other field crops (non-organic) <NA>

Only at 10 I get a match..

stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 10, ignore_case=TRUE)
                              A.x                           A.y
1 Other field crops (non-organic) other_field_crops_non_organic

Could someone explain to me why this distance larger than 9? Does it have to do with the brackets? And if so how can I circumvent this issue without removing the brackets?

EDIT

library(fuzzyjoin)
one <- as.data.frame("Other field crops non-organic")
names(one) <- "A"
two <-  as.data.frame("other_field_crops_non_organic")
names(two) <- "A"

stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 5, ignore_case=TRUE)
                            A.x  A.y
1 Other field crops non-organic <NA>

Even without the brackets I cannot get the distance within 5.

CodePudding user response：

The problem comes down to the method you are using to calculate the string distance. You are using the lcs (longest common substring) method, which in effect only allows deletions and insertions rather than substitutions. From the docs:

The longest common substring (method='lcs') is defined as the longest string that can be obtained by pairing characters from a and b while keeping the order of characters intact. The lcs-distance is defined as the number of unpaired characters. The distance is equivalent to the edit distance allowing only deletions and insertions, each with weight one.

So when we convert spaces to underscores, we incur a weighting of 2 per substitution:

stringdist('abc def', 'abc_def', method = 'lcs')
#> [1] 2

This is in contrast to the default 'osa' method, which like the Levenshtein distance and the R function adist allows direct substitutions, with only a 1-point weighting:

stringdist('abc def', 'abc_def', method = 'osa')
#> [1] 1

You can compare how the different stringdist methods compare on your two strings. To further simplify, let's make both lowercase since you are already specifying ignore_case in your left join:

library(stringdist)

a <- "other field crops (non-organic)"
b <- "other_field_crops_non_organic"
methods <- c("osa", "lv", "dl", "hamming", "lcs", 
             "qgram", "cosine", "jaccard", "jw", "soundex")

sapply(methods, function(x) stringdist(a, b, method = x))
#>        osa         lv         dl    hamming        lcs      qgram     cosine 
#>  6.0000000  6.0000000  6.0000000        Inf 10.0000000 10.0000000  0.2025635 
#>    jaccard         jw    soundex 
#>  0.2500000  0.1104931  0.0000000

You can see that the Hamming distance is infinite, since your strings are of different length, and osa (the default method) is only 6, but lcs requires 10 (4 removals of underscores, 3 additions of spaces, one addition of a hyphen, and two additions of parentheses). If this string pair is representative of your data, you might want to switch to "osa"

^{Created on 2022-04-14 by the reprex package (v2.0.1)}

CodePudding user response：

Can you clean the text before joining? If the problem is only special characters, getting rid of them first might make an easier join.

library(fuzzyjoin)
library(stringdist)
library(stringr)

## sample data
one <- as.data.frame("Other field crops (non-organic)")
names(one) <- "A"
two <-  as.data.frame("other_field_crops_non_organic")
names(two) <- "A"
##

# remove special chars, make lower-case, single-space between strings
#  you might want to use purrr or *apply for multiple columns
one$A <- str_replace_all(one$A, "[^[:alnum:]]", " ") %>% 
  tolower() %>% 
  str_squish()
two$A <- str_replace_all(two$A, "[^[:alnum:]]", " ") %>% 
  tolower() %>%
  str_squish()


stringdist(one$A, two$A)
#> [1] 0

stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 7, ignore_case=TRUE)
#>                             A.x                           A.y
#> 1 other field crops non organic other field crops non organic

^{Created on 2022-04-14 by the reprex package (v2.0.1)}