Is there a way to remove the first n words of the column "content" when there are words present in the "keyword" column?
I am working with a data frame similar to this:
keyword <- c("Mr. Jones", "My uncle Sam", "Tom", "", "The librarian")
content <- c("Mr. Jones is drinking coffee", "My uncle Sam is sitting in the kitchen with my uncle Richard", "Tom is playing with Tom's family's dog", "Cassandra is jogging for her first time", "The librarian is jogging with her")
data <- data.frame(keyword, content)
data
In some cases, the first few words of the "keyboard" sting are contained in the "content" string. In others, the "keyword" string remains empty and only "content" is filled.
What I want to achieve here is to remove the first appearance of the word combination in "keyword" that appears in the same row in "content". Unfortunately, I am only able to create code that deletes all the matching words. But as you can see, some words (like "uncle" or "Tom") appear more than one time in a cell. I'd like to only delete the first appearance and keep all that come after in the same cell.
My next-best solution was to use the following code:
data$content <- mapply(function(x,y)gsub(x,"",y) ,gsub(" ", "|",data$keyword),data$content)
This code was designed to remove all of the words from "content" that are present in "keyword" of the same row. (It was initially posted here).
Another option that I tried was to design a function for this: I first created a new variable which counted the number of words that are included in the "keyword" string of the corresponding line:
numw <- lengths(gregexpr("\\S ", data$keyword))
data <- cbind(data, numw)
Second, I tried to formulate a function to remove the first n words of content[i], with n = numw[i]
shorten <- function(v, z){
v <- gsub(".*^\\w ", z, v)
}
shorten(data$content, data$numw)
Unfortunately, I am not able to make the function work and the following error message will be generated:
Error in gsub(".*^\w ", z, v) : invalid 'replacement' argument
So, I'd be incredibly greatful if one could help me to formulate a function that could actually deal with the issue more appropriately.
CodePudding user response:
Here is a solution which is based on str_remove
. As str_remove
gives warnings, if the pattern is ''
the first row exchanges it with NA
. If then keyword
is NA
the keyword
is stripped off, if not content
is taken as is.
library(tidyverse )
data |>
mutate(keyword = na_if(keyword, '')) |>
mutate(content = case_when(
!is.na(keyword) ~ str_remove(content, keyword),
is.na(keyword) ~content))
#> keyword content
#> 1 Mr. Jones is drinking coffee
#> 2 My uncle Sam is sitting in the kitchen with my uncle Richard
#> 3 Tom is playing with Tom's family's dog
#> 4 <NA> Cassandra is jogging for her first time
#> 5 The librarian is jogging with her
CodePudding user response:
Note your use of gsub
stands for "global" sub which will search the whole string. sub
will only replace the first occurrence.
So you can use sub
with paste(..., collapse = “|”)
and remove the leading white space with trimws(..., "l")
:
trimws(sub(paste(data$keyword, collapse = "|"), "", data$content), "l")
Output:
#[1] "is drinking coffee"
#[2] "is sitting in the kitchen with my uncle Richard"
#[3] "is playing with Tom's family's dog"
#[4] "Cassandra is jogging for her first time"
#[5] "is jogging with her"