I try to clean up a column containing long speeches during a debate. Right now, every row starts with a new speaker, however, things like subheaders remain at the end of each speech, which is not desirable.
Here is some example data:
speeches <- tibble(subheader = c("3.Discussion", "8.Voting"),
full_speech = c("I close this part. 3.Discussion Let's start with",
"I think we can vote now")
)
Desired Outcome:
subheader full_speech
3. Discussion I close this part.
8. Voting I think we can vote now
What I tried so far:
speeches %>%
mutate(full_speech = str_remove(full_speech, subheader))
But of course this only deletes the subheaders and not what follows after them.
CodePudding user response:
We can paste the subheader
with .*
to match any characters that succeeds the subheader
library(dplyr)
library(stringr)
speeches %>%
mutate(full_speech = str_remove(full_speech, str_c("\\s ",
subheader, ".*")))
-output
# A tibble: 2 × 2
subheader full_speech
<chr> <chr>
1 3.Discussion I close this part.
2 8.Voting I think we can vote now
CodePudding user response:
An approach using sub
and paste
to construct the replacements from subheader.
library(dplyr)
speeches %>%
rowwise() %>%
mutate(full_speech = gsub(
paste0(" ", subheader, ".*", collapse=""), "", full_speech)) %>%
ungroup()
# A tibble: 2 × 2
subheader full_speech
<chr> <chr>
1 3.Discussion I close this part.
2 8.Voting I think we can vote now