I have a large dataset, containing a couple million rows and several columns. Two columns should be unique ID's. However there are multiple duplicates. This is a small sample of my dataset (the first 10 rows, currently without any duplicates in N or C):
inschrijf_last Pseudoniem_N Pseudoniem_C Verwijder
10 2020-10-01 UG2N006a768jvr18l2i1jahj8lba0gkp UG2C00780vgkrb6ducv16n4hm0jcvb7g 0
14 2020-10-01 UG2N005htkh5ptahdqidvptb85h9c3sc UG2C006mc1mjciccnvtdoke65lu75c71 0
19 <NA> UG2N0035cscm05jojivdt85av0sh34vo UG2C000r2dhjudrt679v15t4a5a9htnl 1
25 2020-10-01 UG2N005evrtnube8939bojvnmeevkkkc UG2C0035j36uuosv0l4250ctj1m4rm2u 0
26 <NA> UG2N007352gn5be17ppemn19dhf7h1fj UG2C 1
27 2020-10-01 UG2N0023reb9qr1t8lr8k11hnolu49b3 UG2C00361k53ni7u3dfge25neob23lrk 0
30 2016-01-01 UG2N002tv1i0s76h4ljehu2tj03g61a8 UG2C004jvu3f3ig9pbmi2tga0uc7db30 0
33 2020-10-01 UG2N0073i8mi4rmq8dqlrse6vk6t81sg UG2C001g3tla577jnh3s617thjh97631 0
34 2018-01-01 UG2N005ut027j93cbmhoa70nmnadjp6g UG2C001tubiclm75f9nad25pads9eata 0
35 <NA> UG2N0076de059qhm989ge9e7agjosmek UG2C005r9p6arkr3foi4b4ma5nbl6s2p 1
Columns Pseudoniem_N and Pseudoniem_C are "character". I would like identify and mark 3 optional duplicates:
1: If Pseudoniem_N = duplicate -> identify and mark
2: If pseudoniem_C = duplicate -> identify and mark
3: If pseudoniem N AND C both duplicate -> identify and mark
Preferably in three different variables/columns
Desired result I would like to add these extra variables marking the (character) duplicates N, C and N C. Below i made a small example.
inschrijf_last Pseudoniem_N Pseudoniem_C Duplicate N Duplicate C Duplicate N C
10 2020-10-01 aaaaaaaa aaaabbbb 1 1 1
14 2020-10-01 aaaaaaaa aaaabbbb 1 1 1
19 <NA> bbbbbbbb bbbbaaaa 2
25 2020-10-01 bbbbbbbb aaaacccc 2 2
26 <NA> cccccccc aaaacccc 2
etc.
Thank you very much in advance!
CodePudding user response:
Based on what you want I think this is what you're looking for :
my_df <- data.frame("ID" = seq.int(from = 1, to = 6, by = 1),
"Date" = rep("01/01/2021", 6),
"Pseudoniem_N" = c("aaa", "aaa", "bbb", "bbb", "ccc", "ddd"),
"Pseudoniem_C" = c("aab", "aab", "bba", "aac", "aac", "ddd"),
stringsAsFactors = FALSE)
my_df <- my_df %>% group_by(Pseudoniem_N) %>% mutate("Dup_N" = cur_group_id())
my_df <- my_df %>% group_by(Pseudoniem_C) %>% mutate("Dup_C" = cur_group_id())
my_df <- my_df %>% group_by(Pseudoniem_N, Pseudoniem_C) %>% mutate("Dup_NC" = cur_group_id())
my_var <- which(!(duplicated(my_df$Pseudoniem_N) | duplicated(my_df$Pseudoniem_N, fromLast = TRUE)))
if (length(my_var)) {
my_df[my_var, "Dup_N"] <- NA
}
my_var <- which(!(duplicated(my_df$Pseudoniem_C) | duplicated(my_df$Pseudoniem_C, fromLast = TRUE)))
if (length(my_var)) {
my_df[my_var, "Dup_C"] <- NA
}
my_var <- which(!(duplicated(my_df[, c("Pseudoniem_N", "Pseudoniem_C")]) | duplicated(my_df[, c("Pseudoniem_N", "Pseudoniem_C")], fromLast = TRUE)))
if (length(my_var)) {
my_df[my_var, "Dup_NC"] <- NA
}