Replace multiple words in multiple strings-CodePudding

I want to replace words in a vector based on original and replacement words in another dataframe. As an example:

A vector of strings to be altered:

my_words <- c("example r", "example River", "example R", "anthoer river",
        "now a creek", "and another Ck", "example river tributary")

A dataframe of words to be replaced and the corresponding replacement words:

my_replace <- data.frame(
  original = c("r", "River", "R", "river", "Ck", "creek", "Creek"),
  replacement = c("R", "R", "R", 'R', "C", "C", "C"))

I want to replace any occurrence of one of the words in my_replace$original with the corresponding value in my_replace$replacement in the vector my_words. I tried using stringr::str_replace_all(), but it replaced all instances of the letter/word, rather than just whole words (e.g. "another" became "anotheR") which is undesirable.

psuedo code of what I want to do:

str_replace_all(my_words, my_replace$original, my_replace$replacement)

Desired output:

"example R", "example R", "example R", "another R", "now a C", "and another C", "example R tributary"

I did find a solution using a for loop, but given my dataset is large, the for loop option is too slow. Any advice much appreciated.

CodePudding user response：

Here is one sub approach which makes just a single replacement:

my_words <- c("example r", "example River", "example R", "anthoer river",
    "now a creek", "and another Ck", "example river tributary")

output <- gsub("\\b([rR])(?:iver)?\\b|\\b([cC])(?:ree)?k\\b", "\\U\\1\\U\\2", my_words, perl=TRUE)
output

[1] "example R"           "example R"           "example R"
[4] "anthoer R"           "now a C"             "and another C"
[7] "example R tributary"

Since the replacements for all river and creek occurrences is just R and C, respectively, we can capture the first letter of each possible match and then replace using the uppercase version of those letters.

CodePudding user response：

You need to build a dynamic word boundary based pattern out of the words in my_words$original and then use stringr::str_replace_all to replace with the corresponding values. Note the original phrases need to be sorted by length in descending order to make longer strings match first:

my_words <- c("example r", "example River", "example R", "anthoer river", "now a creek", "and another Ck", "example river tributary")
my_replace <- data.frame(original = c("r", "River", "R", "river", "Ck", "creek", "Creek"), replacement = c("R", "R", "R", 'R', "C", "C", "C"))
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]
library(stringr)
regex <- paste0("\\b(",paste(sort.by.length.desc(my_replace$original), collapse="|"), ")\\b")
str_replace_all(my_words, regex, function(word) my_replace$replacement[my_replace$original==word][[1]][1])

Output:

[1] "example R"           "example R"           "example R"           "anthoer R"           "now a C"             "and another C"       "example R tributary"

CodePudding user response：

library(stringi)

stri_replace_all_regex(my_words, "\\b" %s % my_replace$original %s % "\\b", my_replace$replacement, vectorize_all = FALSE)

[1] "example R" "example R" "example R" "anthoer R" "now a C" "and another C" "example R tributary"