Unique ID for interconnected cases-CodePudding

I have the following data frame, that shows which cases are interconnected:

   DebtorId DupDebtorId
1:        1           2
2:        1           3
3:        1           4
4:        5           1
5:        5           2
6:        5           3
7:        6           7
8:        7           6

My goal is to assign a unique group ID to each group of cases. The desired output is:

   DebtorId group
1:        1     1
2:        2     1
3:        3     1
4:        4     1
5:        5     1
6:        6     2
7:        7     2

My train of thought:

library(data.table)

example <- data.table(
  DebtorId = c(1,1,1,5,5,5,6,7),
  DupDebtorId = c(2,3,4,1,2,3,7,6)
)

unique_pairs <- example[!duplicated(t(apply(example, 1, sort))),] #get unique pairs of DebtorID and DupDebtorID
unique_pairs[, group := .GRP, by=.(DebtorId)] #assign a group ID for each DebtorId
unique_pairs[, num := rowid(group)]
groups <- dcast(unique_pairs, group   DebtorId ~ num, value.var = 'DupDebtorId') #format data to wide for each group ID

#create new data table with unique cases to assign group ID
newdt <- data.table(DebtorId = sort(unique(c(example$DebtorId, example$DupDebtorId))), group = NA)
newdt$group <- as.numeric(newdt$group)

#loop through the mapped groups, selecting the first instance of group ID for the case
for (i in 1:nrow(newdt)) {
  a <- newdt[i]$DebtorId
  b <- min(which(groups[,-1] == a, arr.ind=TRUE)[,1])
  newdt[i]$group <- b
}

Output:

   DebtorId group
1:        1     1
2:        2     1
3:        3     1
4:        4     1
5:        5     2
6:        6     3
7:        7     3

There are 2 problems in my approach:

From the output, you can see that it fails to recognize that case 5 belongs to group 1;
The final loop is agonizingly slow, which would render it useless for my use case of 1M rows in my original data, and going the traditional := way does not work with which()

I'm not sure whether my approach could be optimized, or there is a better way of doing this altogether.

CodePudding user response：

This functionality already exists in igraph, so if you don't need to do it yourself, we can build a graph from your data frame and then extract cluster membership. stack() is just an easy way to convert a named vector to data frame.

library(igraph)

g <- graph.data.frame(df)
df_membership <- clusters(g)$membership
stack(df_membership)
#>   values ind
#> 1      1   1
#> 2      1   5
#> 3      2   6
#> 4      2   7
#> 5      1   2
#> 6      1   3
#> 7      1   4

Above, values corresponds to group and ind to DebtorId.