uniqid | client_id | hh_id | group_id |
---|---|---|---|
u1 | c1 | h1 | 1 |
u1 | c2 | h1 | 1 |
u1 | c3 | h2 | 1 |
u2 | c4 | h1 | 1 |
u2 | c5 | h2 | 1 |
u3 | c6 | h3 | 2 |
u3 | c7 | h3 | 2 |
u3 | c8 | h4 | 2 |
Let's say a Household can have more than 1 individual as a part of it, with each of them having one master unique-id in the system. But each individual can have more than 1 client-id generated in the system due to the process/workflow. Also there are rare cases where the same client is mapped to more than 1 household.
Expected outcome is to group all the related individuals together into one group, say g1, such that all the individuals who are part of a household (or overlapping with other households), are there in one place.
Dataset:
df <- data.frame(list(uniqid = c("u1", "u1", "u1", "u2", "u2", "u3", "u3", "u3"),
client_id = c("c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8"),
hh_id = c("h1", "h1", "h2", "h1", "h2", "h3", "h3", "h4"),
group_id = c(1,1,1,1,1,2,2,2)))
Group_id is the expected output, one unique id per group of related individuals (or household).
I have tried this approach to group the individuals, solving part of the problem, but this would miss out other household-ids the individual is mapped to.
df %>% group_by(hh_id) %>%
arrange(hh_id, uniqid) %>%
mutate(hh_group = str_c(uniqid, collapse = ""))
CodePudding user response:
This is a graph relation. Use the following:
library(igraph)
df$groups <-components(graph_from_data_frame(df[c('uniqid', 'hh_id')]))$membership[df$uniqid]
df
uniqid client_id hh_id group_id groups
1 u1 c1 h1 1 1
2 u1 c2 h1 1 1
3 u1 c3 h2 1 1
4 u2 c4 h1 1 1
5 u2 c5 h2 1 1
6 u3 c6 h3 2 2
7 u3 c7 h3 2 2
8 u3 c8 h4 2 2