I have a data frame with 100 columns but this operation require only two column. I am keeping only tow columns like this:
org <- data.frame(
ID = c( "ID1", "ID2", "ID3", "ID4", "ID5", "ID1", "ID2",
"ID1", "ID2", "ID1", "ID2", "ID1", "ID2", "ID3",
"ID1", "ID5", "ID1", "ID2", "ID1", "ID2", "ID3"),
Key = c(1, 2, 3, 5, 7, 1, 2, 1, 8, 3, 9, 4,
11, 15, 4, 17, 11, 15, 17, 4, 18)
)
Notice that "ID1" is assigned to keys 1, 3, 4, 11, and 17. So this keys are in a group1
Here "3" is the key for ID3 also.
And ID3 has key "15" and "18". Thus "15" and "18" will also be in the same group as 1, 3, 4, 11, 17, 15, 18. The association of "ID"s to key (vice versa : key to ID ) is not unique.
I want to find the all the groups and their keys.
I found a solution on my previous question as :
t <- table(org$ID_id,org$key)
new_group <- list()
for (i in rownames(t)) {
row_values <- names(t[i,][t[i,]>0])
if(length(new_group)==0){
new_group[[i]] <- row_values # add first key values to group 1
} else{
create_new_group <- TRUE
for (list_item in seq_len(length(new_group))) {
if(max(row_values %in% new_group[[list_item]]) == 1){# If key values (some or all) exist in current group
new_group[[list_item]] <- unique(c(new_group[[list_item]], row_values))
create_new_group <- FALSE
}
}
if(create_new_group){
new_group[[length(new_group) 1]] <- row_values
}
}
}
This works great. However, my data frame has 8 millions of observation and this is not efficient. I would appreciate any help on making this code efficient.
CodePudding user response:
You can try the code below
g <- simplify(graph_from_data_frame(org))
lapply(
V(g)[names(V(g)) %in% org$ID],
function(k) neighbors(g, k)
)
which gives
$ID1
5/17 vertices, named, from 3a33432:
[1] 1 3 4 11 17
$ID2
6/17 vertices, named, from 3a33432:
[1] 2 8 9 4 11 15
$ID3
3/17 vertices, named, from 3a33432:
[1] 3 15 18
$ID4
1/17 vertex, named, from 3a33432:
[1] 5
$ID5
2/17 vertices, named, from 3a33432:
[1] 7 17
CodePudding user response:
library(igraph)
g = simplify(graph.data.frame(org, directed = FALSE))
grp = decompose(g)
lapply(grp, function(x){
m = get.edgelist(x)
list(id = unique(m[,1]),
val = unique(m[,2]))
})
#[[1]]
#[[1]]$id
#[1] "ID1" "ID2" "ID3" "ID5"
#[[1]]$val
#[1] "1" "3" "4" "11" "17" "2" "8" "9" "15" "18" "7"
#[[2]]
#[[2]]$id
#[1] "ID4"
#[[2]]$val
#[1] "5"