Extract emails in brackets-CodePudding

I work with gmailR and I need to extract emails from brackets <> (sometimes few in one row) but in case when there are no brackets (e.g. [email protected]) I need to keep those elements.

This is an example

x2 <- c("John Smith <[email protected]>  <[email protected]>","[email protected]" ,
        "<[email protected]>")

I need output like:

[1] "[email protected]"       "[email protected]"
[2] "[email protected]"
[3] "[email protected]"

I tried this in purpose to merge that 2 results

library("qdapRegex")
y1 <- ex_between(x2, "<", ">", extract = FALSE)
y2 <- rm_between(x2, "<", ">", extract = TRUE )

My data code sample:

from <- sapply(msgs_meta, gm_from)
from[sapply(from, is.null)] <- NA
from1 <- rm_bracket(from)
from2 <- ex_bracket(from)

gmail_DK <- gmail_DK %>% 
  mutate(from = unlist(y1)) %>%
  mutate(from = unlist(y2))

but when I use this function to my data (only one day emails) and unlist I get

Error in mutate(): ! Problem while computing cc = unlist(cc2). x cc must be size 103 or 1, not 104. Run rlang::last_error() to see where the error occurred.

I suppose that in data from more days difference should be bigger, so I prefer to not go this way.

Preferred answer in R but if you know how to make it in for example PowerQuery should be great too.

CodePudding user response：

We may also use base R - split the strings at the space that follows the > (strsplit) and then capture the substring between the < and > in sub (in the replacement, we specify the backreference (\\1) of the captured group) - [^>] - implies one or more characters that are not a >

sub(".*<([^>] )>", "\\1", unlist(strsplit(x2, 
       "(?<=>)\\s ", perl = TRUE)))
[1] "[email protected]"    "[email protected]"  
[3]  "[email protected]"    "[email protected]"

CodePudding user response：

Clunky but OK?

(x2 
   ## split into single words/tokens
   %>% strsplit(" ")
   %>% unlist()
   ## find e-mail-like strings, with or without brackets
   %>% stringr::str_extract("<?[\\w-.] @[\\w-.] >?") 
   ## drop elements with no e-mail component
   %>% na.omit()   
   ## strip brackets
   %>% stringr::str_remove_all("[<>]")
)