I have an igraph
network graph with 103,887 nodes and 4,795,466 ties.
This can be structured as an edgelist in a data.table
with almost 9 million rows.
I can find the common neighbors in this network, following @chinsoon12's answer here. See the example below.
This works beautifully for smaller networks, but runs into problems in my use-case because the merge results in more than 2^31 rows.
Questions:
- Are there efficient alternatives on how to deal with this?
- Can I split the data and do the computation in steps? The results will be used to query about common neighbors.
Example - modified from @chinsoon12's answer:
library(data.table)
library(igraph)
set.seed(1234)
g <- random.graph.game(10, p=0.10)
adjSM <- as(get.adjacency(g), "dgTMatrix")
adjDT <- data.table(V1=adjSM@i 1, V2=adjSM@j 1)
res <- adjDT[adjDT, nomatch=0, on="V2", allow.cartesian=TRUE
][V1 < i.V1, .(Neighbours=paste(V2, collapse=",")),
by=c("V1","i.V1")][order(V1)]
res
V1 i.V1 Neighbours
1: 4 5 8
2: 4 10 8
3: 5 10 8
CodePudding user response:
common neighbors
Can I split the data and do the computation in steps?
You can split by V1 to avoid running into the big-merge issue:
neighDT = adjDT[, if (.N > 1) {
cb = combn(V2, 2)
.(a = cb[1, ], b = cb[2, ])
}, by=.(neighbor = V1)]
which gives
neighbor a b
1: 8 4 5
2: 8 4 10
3: 8 5 10
(The OP found gRbase::combnPrim
to be faster than combn
here.)
How can we collapse all the common neighbors (separated with a comma) for the same combination into one observation?
neighDT_agg = neighDT[order(neighbor),
.(neighbors = toString(neighbor))
, keyby=.(a,b)]
The order
ensures that the string is sorted alphabetically. The keyby
ensures that the table is sorted by pairs {a,b} and facilitates a simple fast lookup for multiple pairs at once:
# single query
neighDT_agg[.(4,10), neighbors]
# [1] "8"
# multi query
pairs_queryDT = data.table(a = c(4,5,8), b = c(5,10,10))
neighDT_agg[pairs_queryDT, neighbors]
[1] "8" "8" NA
I have an igraph network graph with 103,887 nodes and 4,795,466 ties.
Each call to combn
will be making a 2-by-choose(.N, 2)
matrix. If a node is connected to all other nodes, then it is a common neighbor to all pairs of other nodes and you'll be facing choose(103887-1, 2)
of these pairs. I guess this is more an issue with the way the problem is defined than with the approach to solving it.
The results will be used to query about common neighbors.
For the approach above, you'll need to compute the full neighbors table first.
If you just have a few ad hoc queries about intersecting neighbors:
find_neighbors <- function(a, b){
adjDT[.(c(a, b)), on=.(V1), V2[duplicated(V2)]]
}
find_neighbors(4, 10)
# [1] 8
This can similarly be wrapped in toString
to collapse the values.
CodePudding user response:
I think you can use ego
over g
directly to generate res
, e.g.,
do.call(
rbind,
lapply(
Filter(function(x) length(x) > 2, ego(g, 1)),
function(x) {
setNames(
data.frame(cbind(t(combn(x[-1], 2)), x[1])),
c("V1", "V2", "Neighbours")
)
}
)
)
which gives
V1 V2 Neighbours
1 4 5 8
2 4 10 8
3 5 10 8