Home > Software design >  Spelling correction using a reference in one data frame to fix text in another (r)
Spelling correction using a reference in one data frame to fix text in another (r)

Time:06-29

I created a data frame (out of csv file) that will be used to correct spelling errors in the text I'm working with:

df1 <- data.frame(
  old_text = c("typo1",
               "typo2",
               "typo3"), 
  fixed_text = c("typo1_fixed", 
                 "typo2_fixed", 
                 "typo3_fixed"))

I'm now trying to go through the actual text (located in a separate data frame) and if there's a typo, fix it:

df2 <- data.frame(
  text= c("typo1", "Hi", "typo2", "Bye", "typo3"))

I've tried mapply but it doesn't work:

df2$text[grepl(df1$old_text, df2$text)] = mapply(function(x,y) gsub(x,y,df2$text[grepl(df1$old_text, df2$text)]), df1$old_text, df1$new_text)

"Error in mapply(function(x, y) gsub(x, y, df2$text[grepl(df1$old_text,  : 
  zero-length inputs cannot be mixed with those of non-zero length"

Any help would be appreciated!

CodePudding user response:

With stringr::str_replace_all you can use a named vector of patterns and replacements:

library(stringr)
df2$result = str_replace_all(string = df2$text, pattern = setNames(df1$fixed_text, nm = df1$old_text))
df2
#    text      result
# 1 typo1 typo1_fixed
# 2    Hi          Hi
# 3 typo2 typo2_fixed
# 4   Bye         Bye
# 5 typo3 typo3_fixed  

With base R I'd use a for loop. Your mapply error is because of a typo (df1$new_text should be df1$fixed_text), but addressing that will lead to new errors because of the grepl... it's hard to have mapply modify a single column multiple times. But a for loop is quick to write - see Method 2 below.

If you are searching for exact full-string matches as in this example, you don't need regex at all. You don't need regex to see that "a" == "a", you only need regex functions to see that "abc" contains "a"`. See Method 3 below.

# Method 1
library(stringr)
df2$result1 = str_replace_all(string = df2$text, pattern = setNames(df1$fixed_text, nm = df1$old_text))

# Method 2
df2$result2 = df2$text 
for(i in 1:nrow(df1)) {
  df2$result2 = gsub(pattern = df1$old_text[i], replacement = df1$fixed_text[i], x = df2$result2)
}

# Method 3
df2$results3 = df2$text
matches = match(df2$text, df1$old_text) 
df2$results3[!is.na(matches)] = df1$fixed_text[na.omit(matches)]

df2
#    text     result1     result2    results3
# 1 typo1 typo1_fixed typo1_fixed typo1_fixed
# 2    Hi          Hi          Hi          Hi
# 3 typo2 typo2_fixed typo2_fixed typo2_fixed
# 4   Bye         Bye         Bye         Bye
# 5 typo3 typo3_fixed typo3_fixed typo3_fixed

(And even if you are searching within strings, if you are doing exact matches without regex special characters you can use the stringr::fixed() function or the fixed = TRUE) argument for gsub to speed things up.)

CodePudding user response:

A base R option using ifelse setNames

transform(
  transform(
    df2,
    result = with(df1, setNames(fixed_text, old_text))[text]
  ),
  result = ifelse(is.na(result), text, result)
)

gives

   text      result
1 typo1 typo1_fixed
2    Hi          Hi
3 typo2 typo2_fixed
4   Bye         Bye
5 typo3 typo3_fixed
  • Related