I have a table of over 100K names and addresses . I would like to filter the table to keep only those emails I think are not spam.
i have for example addresses as such
[email protected]
[email protected]
[email protected]
I would like to filter now those addresses that have only digit before the @
symbol as well as those emails which have only digit after the @
, but before the suffix .com
.
I know I can extract them using str_split
and grepl
, but I can't fit them into a filter
query to remove them from the table.
pattern <- "[email protected]"
str_split(pattern, '@') # this will split the address based on the sumbol
str_split(string = str_split(pattern, '@')[[1]][2], pattern = "\\.") # this will split the doamin name based on the dot separating the suffix from the numbers.
as.numeric(str_split(string = str_split(pattern, '@')[[1]][2], pattern = "\\.")[[1]][1]) # This for example will check if the string extracted above contains only numbers, if not it will return NA
But how do I combine this in a tidyverse
query?
thanks
P.S. I know this is a farfetched question, but is there some kind a spam filter for email address one can use within R?
CodePudding user response:
I think this pattern should help you identify the spam email as per your condition.
^\\d @|@\\d \\.com
To use it in filter
you may use grepl
or str_detect
from stringr
.
data %>% filter(grepl('^\\d @|@\\d \\.com', email))
To get rows which are not spam negate the condition using !
.
data %>% filter(!grepl('^\\d @|@\\d \\.com', email))
Example :
x <- c('[email protected]', '[email protected]', '[email protected]', '[email protected]')
grepl('^\\d @|@\\d \\.com', x)
#[1] TRUE TRUE TRUE FALSE
CodePudding user response:
It's a rather simple solution and I think there might be a cleaner way without creating all these extra columns:
adress <- c("[email protected]","[email protected]","[email protected]")
adf <- as.data.frame(adress)
adf[c("Before","After")] <- str_split_fixed(adf$adress, '@',2) # this will split the address before @
adf[c("After2","com")] <- str_split_fixed(adf$After,"\\.",2) # this will split the remaining @
library(dplyr)
adf <- adf %>% filter(grepl('[a-zA-Z]', Before))
adf <- adf %>% filter(grepl('[a-zA-Z]', adf$After2))
adf$adress
[1] "[email protected]"