I work with gmailR and I need to extract emails from brackets <> (sometimes few in one row) but in case when there are no brackets (e.g. [email protected]
) I need to keep those elements.
This is an example
x2 <- c("John Smith <[email protected]> <[email protected]>","[email protected]" ,
"<[email protected]>")
I need output like:
[1] "[email protected]" "[email protected]"
[2] "[email protected]"
[3] "[email protected]"
I tried this in purpose to merge that 2 results
library("qdapRegex")
y1 <- ex_between(x2, "<", ">", extract = FALSE)
y2 <- rm_between(x2, "<", ">", extract = TRUE )
My data code sample:
from <- sapply(msgs_meta, gm_from)
from[sapply(from, is.null)] <- NA
from1 <- rm_bracket(from)
from2 <- ex_bracket(from)
gmail_DK <- gmail_DK %>%
mutate(from = unlist(y1)) %>%
mutate(from = unlist(y2))
but when I use this function to my data (only one day emails) and unlist I get
Error in
mutate()
: ! Problem while computingcc = unlist(cc2)
. xcc
must be size 103 or 1, not 104. Runrlang::last_error()
to see where the error occurred.
I suppose that in data from more days difference should be bigger, so I prefer to not go this way.
Preferred answer in R but if you know how to make it in for example PowerQuery should be great too.
CodePudding user response:
We may also use base R
- split the strings at the space that follows the >
(strsplit
) and then capture the substring between the <
and >
in sub
(in the replacement, we specify the backreference (\\1
) of the captured group) - [^>]
- implies one or more characters that are not a >
sub(".*<([^>] )>", "\\1", unlist(strsplit(x2,
"(?<=>)\\s ", perl = TRUE)))
[1] "[email protected]" "[email protected]"
[3] "[email protected]" "[email protected]"
CodePudding user response:
Clunky but OK?
(x2
## split into single words/tokens
%>% strsplit(" ")
%>% unlist()
## find e-mail-like strings, with or without brackets
%>% stringr::str_extract("<?[\\w-.] @[\\w-.] >?")
## drop elements with no e-mail component
%>% na.omit()
## strip brackets
%>% stringr::str_remove_all("[<>]")
)