I want to replace words in a vector based on original and replacement words in another dataframe. As an example:
A vector of strings to be altered:
my_words <- c("example r", "example River", "example R", "anthoer river",
"now a creek", "and another Ck", "example river tributary")
A dataframe of words to be replaced and the corresponding replacement words:
my_replace <- data.frame(
original = c("r", "River", "R", "river", "Ck", "creek", "Creek"),
replacement = c("R", "R", "R", 'R', "C", "C", "C"))
I want to replace any occurrence of one of the words in my_replace$original
with the corresponding value in my_replace$replacement
in the vector my_words
. I tried using stringr::str_replace_all()
, but it replaced all instances of the letter/word, rather than just whole words (e.g. "another" became "anotheR") which is undesirable.
psuedo code of what I want to do:
str_replace_all(my_words, my_replace$original, my_replace$replacement)
Desired output:
"example R", "example R", "example R", "another R", "now a C", "and another C", "example R tributary"
I did find a solution using a for
loop, but given my dataset is large, the for
loop option is too slow. Any advice much appreciated.
CodePudding user response:
Here is one sub
approach which makes just a single replacement:
my_words <- c("example r", "example River", "example R", "anthoer river",
"now a creek", "and another Ck", "example river tributary")
output <- gsub("\\b([rR])(?:iver)?\\b|\\b([cC])(?:ree)?k\\b", "\\U\\1\\U\\2", my_words, perl=TRUE)
output
[1] "example R" "example R" "example R"
[4] "anthoer R" "now a C" "and another C"
[7] "example R tributary"
Since the replacements for all river and creek occurrences is just R
and C
, respectively, we can capture the first letter of each possible match and then replace using the uppercase version of those letters.
CodePudding user response:
You need to build a dynamic word boundary based pattern out of the words in my_words$original
and then use stringr::str_replace_all
to replace with the corresponding values. Note the original
phrases need to be sorted by length in descending order to make longer strings match first:
my_words <- c("example r", "example River", "example R", "anthoer river", "now a creek", "and another Ck", "example river tributary")
my_replace <- data.frame(original = c("r", "River", "R", "river", "Ck", "creek", "Creek"), replacement = c("R", "R", "R", 'R', "C", "C", "C"))
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]
library(stringr)
regex <- paste0("\\b(",paste(sort.by.length.desc(my_replace$original), collapse="|"), ")\\b")
str_replace_all(my_words, regex, function(word) my_replace$replacement[my_replace$original==word][[1]][1])
Output:
[1] "example R" "example R" "example R" "anthoer R" "now a C" "and another C" "example R tributary"
The regex will be \b(River|river|creek|Creek|Ck|r|R)\b
, it matches any of the words inside as a whole word.
CodePudding user response:
library(stringi)
stri_replace_all_regex(my_words, "\\b" %s % my_replace$original %s % "\\b", my_replace$replacement, vectorize_all = FALSE)
[1] "example R" "example R" "example R" "anthoer R" "now a C" "and another C" "example R tributary"