Home > Blockchain >  checking duplicates in email data for all cases
checking duplicates in email data for all cases

Time:05-20

I am checking duplicates in email data but for its working for same case only and i wan to change it to work for irrespective of upper and lower case.

I have data of around 2cr and want to check duplicate duplicate name employeeeid and email and mutating the data frame like below.

i dont want to change the required output but just to change the code so that it can check everything for upper and lower case also.

for example here its not showing duplicate for "[email protected]" and "gb,[email protected]"

df <- data.frame(EMP_ID = c(88111,"BBB4477","BBB4058","BBB5832","BBB0338","BBB1814","BBB6543",875430,875970,"BBB0243","BBB1943","BBB9344","BBB9701","BBB1814","BBB8648","BBB4373","BBB7270","BBB6165","BBB7460","BBB7528","BBB6092"),
                 name = c("link adam","dy tt","link adam","gbesada","dy tt","slew lang","dy tt","gbesada","jachaval","allo nyyn","mbautis","grand fring","jali","kintom dang","namoti","shan mig","NA","NA","NA","NA",NA),
                 email = c("[email protected]","[email protected]","[email protected]","gb,[email protected]","[email protected]","[email protected]","[email protected]","gb,[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]",NA,"NA",NA))

valuesToIgnore <- c("", NA)
colss <- c("EMP_ID","name","email")

df1 <- df %>%
  mutate(across(colss, ~ c('', paste(cur_column(), 'duplicated'))[1 ((duplicated(.) | duplicated(., fromLast = T)) & !is.na(.)) ],
                .names = "{c(1,2,3)}. unique {col}")) %>% as.data.frame()

CodePudding user response:

Does this solve the issue? I added tolower() to the code to harmonize all capitalization. Without a desired output I can't check for sure, though, so if this doesn't solve your issue please add and I will modify.

df1 <- df %>%
  mutate(across(colss, ~ c('', paste(cur_column(), 'duplicated'))[1 ((duplicated(tolower(.)) | duplicated(tolower(.), fromLast = T)) & !is.na(.)) ],
                .names = "{c(1,2,3)}. unique {col}")) %>% as.data.frame()

output:

head(df1, 10)
#      EMP_ID      name               email  1. unique EMP_ID  2. unique name  3. unique email
#1    88111 link adam [email protected]                   name duplicated                 
#2  BBB4477     dy tt      [email protected]                   name duplicated email duplicated
#3  BBB4058 link adam [email protected]                   name duplicated                 
#4  BBB5832   gbesada   gb,[email protected]                   name duplicated email duplicated
#5  BBB0338     dy tt      [email protected]                   name duplicated email duplicated
#6  BBB1814 slew lang  [email protected] EMP_ID duplicated                 email duplicated
#7  BBB6543     dy tt      [email protected]                   name duplicated                 
#8   875430   gbesada   gb,[email protected]                   name duplicated email duplicated
#9   875970  jachaval   [email protected]                                                   
#10 BBB0243 allo nyyn       [email protected]  
  •  Tags:  
  • r
  • Related