I have a dataset consisting of two variables, and over 100000 entries. A small subset is:
master_id | id |
---|---|
1 | 1 |
2 | 2 |
2 | 3 |
2 | 4 |
4 | 5 |
4 | 6 |
6 | 7 |
8 | 8 |
8 | 9 |
9 | 10 |
As code:
df <- c(master_id = c(1, 2, 2, 2, 4, 4, 6, 8, 8, 9),
id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
Both columns refer to the same values, and the data frame is intended to indicate a list of unique instances (id
s) and which are duplicates of each other.
In df
above, instances (id
) 2, 3, 4, are all identical, and therefore the lowest id
(2) is assigned to all (master_id
), which will become the unique instance that is taken forward.
However, further down the list, we can see that instances 4, 5, 6 are also the same, and as 4 has already been assigned to master_id = 2
, 5 and 6 should also be assigned in the same way. Following in the same way, instances 6 and 7 are the same, and therefore by moving through the hierarchy to the highest level, 7 should be assigned master_id = 2
as well.
The desired final output would be:
master_id | id |
---|---|
1 | 1 |
2 | 2 |
2 | 3 |
2 | 4 |
2 | 5 |
2 | 6 |
2 | 7 |
8 | 8 |
8 | 9 |
8 | 10 |
How would I go about achieving this hierarchy manipulation/sorting in the most efficient way? As the dataset contains over 100000 rows, I want to avoid loops where possible.
Thanks.
CodePudding user response:
library(igraph)
g <- components(graph_from_data_frame(df, FALSE))$membership
h <- stack(g[as.character(df$master_id)])
transform(df, master_id=ave(h$ind, h$values, FUN = \(x)x[1]))
master_id id
1 1 1
2 2 2
3 2 3
4 2 4
5 2 5
6 2 6
7 2 7
8 8 8
9 8 9
10 8 10
CodePudding user response:
In base R:
n = 10L
# Assume ids are 1:n
find_master <- function(x) {
curm = df$master_id[x]
while (df$master_id[curm] != curm) curm = df$master_id[curm]
df$master_id[x] <<- curm
}
for (i in seq_len(n)) find_master(i)
df
# master_id id
# 1 1 1
# 2 2 2
# 3 2 3
# 4 2 4
# 5 2 5
# 6 2 6
# 7 2 7
# 8 8 8
# 9 8 9
# 10 8 10
Data:
df <- data.frame(master_id = c(1, 2, 2, 2, 4, 4, 6, 8, 8, 9),
id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))