I have a character vector in a data frame in R which contains inbound email text. Most of the rows contain 'Dear x,' where x is any intended recipient and x can vary. There could also be typos such as the incorrect use of lowercase. Either way, the common feature is that they start with the word 'dear' (upper or lowercase) and end in a comma.
df <- data.frame(emails = c("Dear dave, I have seen what you...", "Dear Mr Smith, I recieved your reply...", "dear stu, I note that you have not..."),
account = c(534, 434, 544)
)
df
emails account
1 Dear dave, I have seen what you... 534
2 Dear Mr Smith, I recieved your reply... 434
3 dear stu, I note that you have not... 544
I am looking to trim off the email intro to just start with the main body of text so it looks like the one below.
emails account
1 I have seen what you... 534
2 I recieved your reply... 434
3 I note that you have not... 544
CodePudding user response:
We can use sub()
here:
df$emails <- sub("^[Dd]ear(?: \\S ) ,\\s*", "", df$emails)
CodePudding user response:
In case you'd like a tidyverse / stringr option:
The ? stops the search at the first comma.
library(tidyverse)
tribble(
~emails, ~account,
"Dear dave, I have seen what you...", 534,
"Dear Mr Smith, I recieved your reply...", 434,
"dear stu, I note, that you have not...", 544
) |>
mutate(emails = str_remove(emails, "[Dd]ear.*?, "))
#> # A tibble: 3 × 2
#> emails account
#> <chr> <dbl>
#> 1 I have seen what you... 534
#> 2 I recieved your reply... 434
#> 3 I note, that you have not... 544
Created on 2022-12-26 with reprex v2.0.2