How to group rows using more than 1 key when there is no consistent 1:1 mapping between keys?-CodePudding

uniqid	client_id	hh_id	group_id
u1	c1	h1	1
u1	c2	h1	1
u1	c3	h2	1
u2	c4	h1	1
u2	c5	h2	1
u3	c6	h3	2
u3	c7	h3	2
u3	c8	h4	2

Let's say a Household can have more than 1 individual as a part of it, with each of them having one master unique-id in the system. But each individual can have more than 1 client-id generated in the system due to the process/workflow. Also there are rare cases where the same client is mapped to more than 1 household.

Expected outcome is to group all the related individuals together into one group, say g1, such that all the individuals who are part of a household (or overlapping with other households), are there in one place.

Dataset:

df <- data.frame(list(uniqid = c("u1", "u1", "u1", "u2", "u2", "u3", "u3", "u3"), 
                  client_id = c("c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8"), 
                  hh_id = c("h1", "h1", "h2", "h1", "h2", "h3", "h3", "h4"), 
                  group_id = c(1,1,1,1,1,2,2,2)))

Group_id is the expected output, one unique id per group of related individuals (or household).

I have tried this approach to group the individuals, solving part of the problem, but this would miss out other household-ids the individual is mapped to.

df %>% group_by(hh_id) %>% 
  arrange(hh_id, uniqid) %>% 
  mutate(hh_group = str_c(uniqid, collapse = ""))

CodePudding user response：

This is a graph relation. Use the following:

library(igraph)
df$groups <-components(graph_from_data_frame(df[c('uniqid', 'hh_id')]))$membership[df$uniqid]
df
  uniqid client_id hh_id group_id groups
1     u1        c1    h1        1      1
2     u1        c2    h1        1      1
3     u1        c3    h2        1      1
4     u2        c4    h1        1      1
5     u2        c5    h2        1      1
6     u3        c6    h3        2      2
7     u3        c7    h3        2      2
8     u3        c8    h4        2      2