How to perform "find and replace" with multiple patterns to be found in a string in R-CodePudding

I am trying to switch genders of words in a string in R. For example, if I have the sentence "My gf has a mother who talks to my father and his bf", I want it to read "My bf has a father who talks to my mother and her gf".

I have a key-value pair list which contains a list of gender pairs -- right now it is just a dataframe which looks something like the below. Then my naive way of solving it was just to do a string replace where I iterate through the list and replace the key with the value. The obvious problem with this is that it just ends up swapping everything in the sentence, and then swapping it all back. You can see this is the example code below.

library(stringr)

key_vals = data.frame(first_word = c("bf", "gf", "mother", "father", "his", "her"), second_word = c("gf", "bf", "father", "mother", "her", "his"))

ex = "My gf has a mother who talks to my father and his bf"

for(i in 1:nrow(key_vals)){
   ex = str_replace_all(ex, key_vals$first_word[i], key_vals$second_word[i])
}

My other idea was making two lists, one which had all male keys and all female values, and one which was the opposite. Then if I split up the sentence into individual words, for each word I could do an if statement like "if a male string is present, replace it with a female string, elif a female string is present, replace it with a male string, else do nothing". However, I can't figure out how to get just the words alone in a way I can then easily recombine into a working sentence. String split based on regex etc. just deletes the words, so I'm really struggling.

Another problem is that if, for example, there is something like "mother", it might get replaced to be "mothis", since I'm using a stupid way of matching strings which doesn't first identify the words, so it seems like I need to split it into words in any case.

This feels like it should be much more straightforward than it has been for me! Any help would be very appreciated.

CodePudding user response：

We may use gsubfn

library(gsubfn)
gsubfn("(\\w )", setNames(as.list(key_vals[[2]]), key_vals[[1]]), ex)
[1] "My bf has a father who talks to my mother and her gf"

CodePudding user response：

Change for loop part to this:

plyr::mapvalues(str_split(ex, ' ')[[1]], key_vals$first_word, key_vals$second_word) %>%
    str_flatten(' ')
    The following `from` values were not present in `x`: her
[1] "My bf has a father who talks to my mother and her gf"
 ex
[1] "My gf has a mother who talks to my father and his bf"

I think the warning can be ignored as it is just complaining that her is not in the sentence that ex contains. The code first splits the character into a vector, then replaces the individual words and then pastes them back together again.

CodePudding user response：

Here is a base R option using strsplit match like below

with(
  key_vals,
  {
    v <- unlist(strsplit(ex, "(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w)", perl = TRUE))
    p <- second_word[match(v, first_word)]
    paste0(ifelse(is.na(p), v, p), collapse = "")
  }
)

and it yields

[1] "My bf has a father who talks to my mother and her gf"

CodePudding user response：

This does what you need.

library(stringr)

# I've updated the columns names, for clarity
key_vals <- data.frame(words = c("bf", "gf", "mother", "father", "his", "her"), swapped_words = c("gf", "bf", "father", "mother", "her", "his"))

# used str_split to break the sentence into multiple words
ex <- "My gf has a mother who talks to my father and his bf"
words <- stringr::str_split(ex, " ")[[1]] #break into words

# do a inner join between the two tables
dict <- merge(data.frame(words=words), key_vals, by = "words", all.x = TRUE, incomparables = NA)

# now we basically apply the dictionary to the string, using an apply function
# we also use paste(..., collapse = " ") to make them into one sentence again
words <- paste(sapply(words, function(x) {
    if (!x %in% key_vals$words)
        return (x)
    return(dict$swapped_words[dict$words == x])
}), collapse=" ")

CodePudding user response：

Rather than relying on a data frame of replacements, you could use a named vector, which is similar to a dictionary of values:

replacements <- key_vals$second_word
names(replacements) <- key_vals$first_word

 bf       gf   mother   father      his      her 
 "gf"     "bf" "father" "mother"    "her"    "his" 

ex_split <- str_split(ex, ' ')[[1]]
swapped <- replacements[ex_split]
final <- paste0(ifelse(!is.na(swapped), swapped, ex_split), collapse = ' ')

"My bf has a father who talks to my mother and her gf"

After creating ex_split, you could also substitute and glue everything together with Reduce:

Reduce(function(x, y) paste(x, ifelse(!is.na(replacements[y]), replacements[y], y)), ex_split)