I created a data frame (out of csv file) that will be used to correct spelling errors in the text I'm working with:
df1 <- data.frame(
old_text = c("typo1",
"typo2",
"typo3"),
fixed_text = c("typo1_fixed",
"typo2_fixed",
"typo3_fixed"))
I'm now trying to go through the actual text (located in a separate data frame) and if there's a typo, fix it:
df2 <- data.frame(
text= c("typo1", "Hi", "typo2", "Bye", "typo3"))
I've tried mapply but it doesn't work:
df2$text[grepl(df1$old_text, df2$text)] = mapply(function(x,y) gsub(x,y,df2$text[grepl(df1$old_text, df2$text)]), df1$old_text, df1$new_text)
"Error in mapply(function(x, y) gsub(x, y, df2$text[grepl(df1$old_text, :
zero-length inputs cannot be mixed with those of non-zero length"
Any help would be appreciated!
CodePudding user response:
With stringr::str_replace_all
you can use a named vector of patterns and replacements:
library(stringr)
df2$result = str_replace_all(string = df2$text, pattern = setNames(df1$fixed_text, nm = df1$old_text))
df2
# text result
# 1 typo1 typo1_fixed
# 2 Hi Hi
# 3 typo2 typo2_fixed
# 4 Bye Bye
# 5 typo3 typo3_fixed
With base R I'd use a for
loop. Your mapply
error is because of a typo (df1$new_text
should be df1$fixed_text
), but addressing that will lead to new errors because of the grepl
... it's hard to have mapply
modify a single column multiple times. But a for
loop is quick to write - see Method 2 below.
If you are searching for exact full-string matches as in this example, you don't need regex at all. You don't need regex to see that "a" == "a"
, you only need regex functions to see that "abc"
contains "a"`. See Method 3 below.
# Method 1
library(stringr)
df2$result1 = str_replace_all(string = df2$text, pattern = setNames(df1$fixed_text, nm = df1$old_text))
# Method 2
df2$result2 = df2$text
for(i in 1:nrow(df1)) {
df2$result2 = gsub(pattern = df1$old_text[i], replacement = df1$fixed_text[i], x = df2$result2)
}
# Method 3
df2$results3 = df2$text
matches = match(df2$text, df1$old_text)
df2$results3[!is.na(matches)] = df1$fixed_text[na.omit(matches)]
df2
# text result1 result2 results3
# 1 typo1 typo1_fixed typo1_fixed typo1_fixed
# 2 Hi Hi Hi Hi
# 3 typo2 typo2_fixed typo2_fixed typo2_fixed
# 4 Bye Bye Bye Bye
# 5 typo3 typo3_fixed typo3_fixed typo3_fixed
(And even if you are searching within strings, if you are doing exact matches without regex special characters you can use the stringr::fixed()
function or the fixed = TRUE
) argument for gsub
to speed things up.)
CodePudding user response:
A base R option using ifelse
setNames
transform(
transform(
df2,
result = with(df1, setNames(fixed_text, old_text))[text]
),
result = ifelse(is.na(result), text, result)
)
gives
text result
1 typo1 typo1_fixed
2 Hi Hi
3 typo2 typo2_fixed
4 Bye Bye
5 typo3 typo3_fixed