I am trying to automatically spell-check a string column of a data.table/data.frame.
Looking around, I found several approaches that all give an "out of bounds" error in the case hunspell.suggest
returns no suggestions (that is, an empty list, e.g. "pippasnjfjsfiadjg"), see approaches here (the accepted answer here yields NA so does work in principal) and here
We seem to require unlist
in order to identify these empty suggestions and then exclude them from the part of the code that picks the first suggestion but I cannot figure out how.
library(dplyr)
library(stringi)
library(hunspell)
df1 <- data.frame("Index" = 1:7, "Text" = c("pippasnjfjsfiadjg came to dinner with us tonigh.",
"Wuld you like to trave with me?",
"There is so muh to undestand.",
"Sentences cone in many shaes and sizes.",
"Learnin R is fun",
"yesterday was Friday",
"bing search engine"),
stringsAsFactors = FALSE)
# Get bad words.
badwords <- hunspell(df1$Text) %>% unlist
# Extract the first suggestion for each bad word.
suggestions <- sapply(hunspell_suggest(badwords), "[[", 1)
mutate(df1, Text = stri_replace_all_fixed(str = Text,
pattern = badwords,
replacement = suggestions,
vectorize_all = FALSE)) -> out
CodePudding user response:
You'll want to filter the list of bad words and suggestions to get rid of those without suggestions
badwords <- hunspell(df1$Text) %>% unlist()
# note use of '[' rather than '[['
suggestions <- sapply(hunspell_suggest(badwords), '[', 1)
badwords <- badwords[!is.na(suggestions)]
suggestions <- suggestions[!is.na(suggestions)]