R - joining more than 2^31 rows with data.table-CodePudding

I have an igraph network graph with 103,887 nodes and 4,795,466 ties.

This can be structured as an edgelist in a data.table with almost 9 million rows.

I can find the common neighbors in this network, following @chinsoon12's answer here. See the example below.

This works beautifully for smaller networks, but runs into problems in my use-case because the merge results in more than 2^31 rows.

Questions:

Are there efficient alternatives on how to deal with this?
Can I split the data and do the computation in steps? The results will be used to query about common neighbors.

Example - modified from @chinsoon12's answer:

library(data.table)
library(igraph)

set.seed(1234)
g <- random.graph.game(10, p=0.10)

adjSM <- as(get.adjacency(g), "dgTMatrix")
adjDT <- data.table(V1=adjSM@i 1, V2=adjSM@j 1)

res <- adjDT[adjDT, nomatch=0, on="V2", allow.cartesian=TRUE
][V1 < i.V1, .(Neighbours=paste(V2, collapse=",")),
  by=c("V1","i.V1")][order(V1)]

res

   V1 i.V1 Neighbours
1:  4    5          8
2:  4   10          8
3:  5   10          8

CodePudding user response：

common neighbors

Can I split the data and do the computation in steps?

You can split by V1 to avoid running into the big-merge issue:

neighDT = adjDT[, if (.N > 1) {
    cb = combn(V2, 2)
    .(a = cb[1, ], b = cb[2, ])
}, by=.(neighbor = V1)]

which gives

   neighbor a  b
1:        8 4  5
2:        8 4 10
3:        8 5 10

(The OP found gRbase::combnPrim to be faster than combn here.)

How can we collapse all the common neighbors (separated with a comma) for the same combination into one observation?

neighDT_agg = neighDT[order(neighbor), 
  .(neighbors = toString(neighbor))
, keyby=.(a,b)]

The order ensures that the string is sorted alphabetically. The keyby ensures that the table is sorted by pairs {a,b} and facilitates a simple fast lookup for multiple pairs at once:

# single query
neighDT_agg[.(4,10), neighbors]
# [1] "8"

# multi query
pairs_queryDT = data.table(a = c(4,5,8), b = c(5,10,10))
neighDT_agg[pairs_queryDT, neighbors]
[1] "8" "8" NA

I have an igraph network graph with 103,887 nodes and 4,795,466 ties.

Each call to combn will be making a 2-by-choose(.N, 2) matrix. If a node is connected to all other nodes, then it is a common neighbor to all pairs of other nodes and you'll be facing choose(103887-1, 2) of these pairs. I guess this is more an issue with the way the problem is defined than with the approach to solving it.

The results will be used to query about common neighbors.

For the approach above, you'll need to compute the full neighbors table first.

If you just have a few ad hoc queries about intersecting neighbors:

find_neighbors <- function(a, b){
    adjDT[.(c(a, b)), on=.(V1), V2[duplicated(V2)]]
}

find_neighbors(4, 10)
# [1] 8

This can similarly be wrapped in toString to collapse the values.

CodePudding user response：

I think you can use ego over g directly to generate res, e.g.,

do.call(
  rbind,
  lapply(
    Filter(function(x) length(x) > 2, ego(g, 1)),
    function(x) {
      setNames(
        data.frame(cbind(t(combn(x[-1], 2)), x[1])),
        c("V1", "V2", "Neighbours")
      )
    }
  )
)

which gives

  V1 V2 Neighbours
1  4  5          8
2  4 10          8
3  5 10          8