Home > database >  How to quickly sort and reassign parent/child style hierarchy in R?
How to quickly sort and reassign parent/child style hierarchy in R?

Time:06-29

I have a dataset consisting of two variables, and over 100000 entries. A small subset is:

master_id id
1 1
2 2
2 3
2 4
4 5
4 6
6 7
8 8
8 9
9 10

As code:

df <- c(master_id = c(1, 2, 2, 2, 4, 4, 6, 8, 8, 9),
        id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))

Both columns refer to the same values, and the data frame is intended to indicate a list of unique instances (ids) and which are duplicates of each other.

In df above, instances (id) 2, 3, 4, are all identical, and therefore the lowest id (2) is assigned to all (master_id), which will become the unique instance that is taken forward.

However, further down the list, we can see that instances 4, 5, 6 are also the same, and as 4 has already been assigned to master_id = 2, 5 and 6 should also be assigned in the same way. Following in the same way, instances 6 and 7 are the same, and therefore by moving through the hierarchy to the highest level, 7 should be assigned master_id = 2 as well.

The desired final output would be:

master_id id
1 1
2 2
2 3
2 4
2 5
2 6
2 7
8 8
8 9
8 10

How would I go about achieving this hierarchy manipulation/sorting in the most efficient way? As the dataset contains over 100000 rows, I want to avoid loops where possible.

Thanks.

CodePudding user response:

library(igraph)
g <- components(graph_from_data_frame(df, FALSE))$membership
h <- stack(g[as.character(df$master_id)])
transform(df, master_id=ave(h$ind, h$values, FUN = \(x)x[1]))

   master_id id
1          1  1
2          2  2
3          2  3
4          2  4
5          2  5
6          2  6
7          2  7
8          8  8
9          8  9
10         8 10

CodePudding user response:

In base R:

n = 10L
# Assume ids are 1:n
find_master <- function(x) {
  curm = df$master_id[x]
  while (df$master_id[curm] != curm) curm = df$master_id[curm]
  df$master_id[x] <<- curm
}
for (i in seq_len(n)) find_master(i)
df

#    master_id id
# 1          1  1
# 2          2  2
# 3          2  3
# 4          2  4
# 5          2  5
# 6          2  6
# 7          2  7
# 8          8  8
# 9          8  9
# 10         8 10

Data:

df <- data.frame(master_id = c(1, 2, 2, 2, 4, 4, 6, 8, 8, 9),
        id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
  • Related