How to remove the first words of specific rows that appear in another column?-CodePudding

Is there a way to remove the first n words of the column "content" when there are words present in the "keyword" column?

I am working with a data frame similar to this:

keyword <- c("Mr. Jones", "My uncle Sam", "Tom", "", "The librarian")
content <- c("Mr. Jones is drinking coffee", "My uncle Sam is sitting in the kitchen with my uncle Richard", "Tom is playing with Tom's family's dog", "Cassandra is jogging for her first time", "The librarian is jogging with her")
data <- data.frame(keyword, content)
data

In some cases, the first few words of the "keyboard" sting are contained in the "content" string. In others, the "keyword" string remains empty and only "content" is filled.

What I want to achieve here is to remove the first appearance of the word combination in "keyword" that appears in the same row in "content". Unfortunately, I am only able to create code that deletes all the matching words. But as you can see, some words (like "uncle" or "Tom") appear more than one time in a cell. I'd like to only delete the first appearance and keep all that come after in the same cell.

My next-best solution was to use the following code:

data$content <- mapply(function(x,y)gsub(x,"",y) ,gsub(" ", "|",data$keyword),data$content)

This code was designed to remove all of the words from "content" that are present in "keyword" of the same row. (It was initially posted here).

Another option that I tried was to design a function for this: I first created a new variable which counted the number of words that are included in the "keyword" string of the corresponding line:

numw <- lengths(gregexpr("\\S ", data$keyword))
data <- cbind(data, numw)

Second, I tried to formulate a function to remove the first n words of content[i], with n = numw[i]

shorten <- function(v, z){
  v <- gsub(".*^\\w ", z, v)
}

shorten(data$content, data$numw)

Unfortunately, I am not able to make the function work and the following error message will be generated:

Error in gsub(".*^\w ", z, v) : invalid 'replacement' argument

So, I'd be incredibly greatful if one could help me to formulate a function that could actually deal with the issue more appropriately.

CodePudding user response：

Here is a solution which is based on str_remove. As str_remove gives warnings, if the pattern is '' the first row exchanges it with NA. If then keyword is NA the keyword is stripped off, if not content is taken as is.

library(tidyverse )

data |> 
  mutate(keyword = na_if(keyword, '')) |> 
  mutate(content = case_when(
    !is.na(keyword) ~ str_remove(content, keyword),
    is.na(keyword) ~content))
#>         keyword                                          content
#> 1     Mr. Jones                               is drinking coffee
#> 2  My uncle Sam  is sitting in the kitchen with my uncle Richard
#> 3           Tom               is playing with Tom's family's dog
#> 4          <NA>          Cassandra is jogging for her first time
#> 5 The librarian                              is jogging with her

CodePudding user response：

Note your use of gsub stands for "global" sub which will search the whole string. sub will only replace the first occurrence.

So you can use sub with paste(..., collapse = “|”) and remove the leading white space with trimws(..., "l"):

trimws(sub(paste(data$keyword, collapse = "|"), "", data$content), "l")

Output:

#[1] "is drinking coffee"                              
#[2] "is sitting in the kitchen with my uncle Richard"
#[3] "is playing with Tom's family's dog"              
#[4] "Cassandra is jogging for her first time"        
#[5] "is jogging with her"