I am checking duplicates in email data but for its working for same case only and i wan to change it to work for irrespective of upper and lower case.
I have data of around 2cr and want to check duplicate duplicate name employeeeid and email and mutating the data frame like below.
i dont want to change the required output but just to change the code so that it can check everything for upper and lower case also.
for example here its not showing duplicate for "[email protected]" and "gb,[email protected]"
df <- data.frame(EMP_ID = c(88111,"BBB4477","BBB4058","BBB5832","BBB0338","BBB1814","BBB6543",875430,875970,"BBB0243","BBB1943","BBB9344","BBB9701","BBB1814","BBB8648","BBB4373","BBB7270","BBB6165","BBB7460","BBB7528","BBB6092"),
name = c("link adam","dy tt","link adam","gbesada","dy tt","slew lang","dy tt","gbesada","jachaval","allo nyyn","mbautis","grand fring","jali","kintom dang","namoti","shan mig","NA","NA","NA","NA",NA),
email = c("[email protected]","[email protected]","[email protected]","gb,[email protected]","[email protected]","[email protected]","[email protected]","gb,[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]","[email protected]",NA,"NA",NA))
valuesToIgnore <- c("", NA)
colss <- c("EMP_ID","name","email")
df1 <- df %>%
mutate(across(colss, ~ c('', paste(cur_column(), 'duplicated'))[1 ((duplicated(.) | duplicated(., fromLast = T)) & !is.na(.)) ],
.names = "{c(1,2,3)}. unique {col}")) %>% as.data.frame()
CodePudding user response:
Does this solve the issue? I added tolower()
to the code to harmonize all capitalization. Without a desired output I can't check for sure, though, so if this doesn't solve your issue please add and I will modify.
df1 <- df %>%
mutate(across(colss, ~ c('', paste(cur_column(), 'duplicated'))[1 ((duplicated(tolower(.)) | duplicated(tolower(.), fromLast = T)) & !is.na(.)) ],
.names = "{c(1,2,3)}. unique {col}")) %>% as.data.frame()
output:
head(df1, 10)
# EMP_ID name email 1. unique EMP_ID 2. unique name 3. unique email
#1 88111 link adam [email protected] name duplicated
#2 BBB4477 dy tt [email protected] name duplicated email duplicated
#3 BBB4058 link adam [email protected] name duplicated
#4 BBB5832 gbesada gb,[email protected] name duplicated email duplicated
#5 BBB0338 dy tt [email protected] name duplicated email duplicated
#6 BBB1814 slew lang [email protected] EMP_ID duplicated email duplicated
#7 BBB6543 dy tt [email protected] name duplicated
#8 875430 gbesada gb,[email protected] name duplicated email duplicated
#9 875970 jachaval [email protected]
#10 BBB0243 allo nyyn [email protected]