I would like to create an id grouping two columns, and taking into account the duplicates. To make it more clear here below an example from my dataset:
data<-data.frame(event =c("1", "1", "2", "2", "3", "3", "3"), code = c("64 2 16 204", "64 204", "70 2 05 002", "70 002", "60 1 15 001","64 2 16 204", "60 1 15 076"))
In the data above, "64 2 16 204" appears twice in two different events - ie., event 1 and 3. Hence I would like all the codes in event 1 and 3 to have the same id.
The output wanted is then:
data<-data.frame(event=c("1", "1", "2", "2", "3", "3", "3"), code = c("64 2 16 204", "64 204", "70 2 05 002", "70 002", "60 1 15 001","64 2 16 204", "60 1 15 076"), group_id =c("1", "1", "2", "2", "1", "1", "1"))
Thank you for your help,
CodePudding user response:
Here is something raw:
tmp = aggregate(event ~ code, data = data, \(x) sort(unique(x)))
tmp = tmp[lengths(tmp$event) > 1, "event"]
data$group_id = data$event
for (i in seq_along(tmp)) {
for (ev in tmp[[i]][-1]) {
data[data$event == ev, "group_id"] = tmp[[i]][1]
}
}
# event code group_id
# 1 1 64 2 16 204 1
# 2 1 64 204 1
# 3 2 70 2 05 002 2
# 4 2 70 002 2
# 5 3 60 1 15 001 1
# 6 3 64 2 16 204 1
# 7 3 60 1 15 076 1