Home > Software engineering >  Correct the misspelling for the part that is misspelled (even if the part is within a whole word)
Correct the misspelling for the part that is misspelled (even if the part is within a whole word)

Time:05-03

I want to replace misspelling for only the misspelled part. Here is example code. The first bit is setting up a reference dataframe with the wrong and correct spelling.

library(stringr)
corrected <- data.frame(stringsAsFactors=FALSE,
                        Wrong_spell = c("abdmen", "abdomane", "abdome", "abdumen", "abodmen",
                                        "adnomen", "aabdominal", "abddominal"),
                        Correct_spell = c("abdomen", "abdomen", "abdomen", "abdomen", "abdomen",
                                          "abdomen", "abdominal", "abdominal") )

these are the elements to be corrected

reported <- c("abdmen pain", "abdomane pain", "abdumenXXX pain")

When I run this code I get the resulting 3 elements

regex_pattern <- setNames(corrected$Correct_spell, paste0("\\b", corrected$Wrong_spell, "\\b"))
str_replace_all(reported, regex_pattern)

> str_replace_all(reported, regex_pattern)
[1] "abdomen pain"    "abdomen pain"    "abdumenXXX pain"

I would like the code to just replace the part that matches the misspelling, so the third element to becomes "abdomenXXX pain". It corrected the first two, but the third element is unchanged. The code only looks at whole words within the element. Not sure it's possible, but if you have any ideas or potential fixes, please point me where I need to look. Any help greatly appreciated. Thanks in advance.

CodePudding user response:

The following works on your example data, but not sure if it will work for your real dataset. Since the only difference in "abdomen" and "abdominal" is "al", I just checked for any of the wrong spellings from corrected$Wrong_spell (minus the "al" in abdominal):

str_replace_all(reported, 
                paste(gsub("al", "",corrected$Wrong_spell), collapse ="|"), 
                "abdomen")

Output:

[1] "abdomen pain"    "abdomen pain"    "abdomenXXX pain"

CodePudding user response:

You could use regexpr in outer.

f <- \(x, y) {
    s <- strsplit(x, '\\s ')
    k <- outer(y[, 1], sapply(s, `[`, 1), Vectorize(regexpr))
    j <- which(colSums(k == 1) == 1)
    i <- apply(k[, j], 2, which.max)
    s[j] <- Map(`[<-`, s[j], 1, y[i, 2])
    vapply(s, paste, collapse=' ', character(1))
}


f(reported, corrected)
# [1] "abdomen pain1"   "abdominal pain2" "abdomen pain3"   "abdomen pain4"   "abdominal pain5"

*Data:*

corrected <- structure(list(Wrong_spell = c("abdmen", "abdomane", "abdome", 
"abdumen", "abodmen", "adnomen", "aabdominal", "abddominal"), 
    Correct_spell = c("abdomen", "abdomen", "abdomen", "abdomen", 
    "abdomen", "abdomen", "abdominal", "abdominal")), class = "data.frame", row.names = c(NA, 
-8L))

reported <- c("abdmen pain1", "abddominal pain2", "abdumenXXX pain3", "abdomen pain4", 
"abddominal pain5")
  •  Tags:  
  • r
  • Related