How to calculate the number of common neighbours for all edges in a graph?-CodePudding

I have a network defined by a list of edges. The network is large and sparse. For each pair of connected vertices, I would like to calculate the number of common neighbours. This post discusses how to do this for a single pair of vertices, but it strikes me as inefficient to loop over all edges to calculate this statistic for each edge in the graph. Instead, the statistic I'm after can be calculated from the product of the adjacency matrix with itself, as follows:

library(igraph)
library(data.table)
set.seed(1111)

E <- data.table(i = sample(as.character(1:5e4), 1e5, replace = T),
                j = sample(as.character(1:5e4), 1e5, replace = T))
G <- simplify(graph_from_data_frame(E, directed = F))  # remove loops and multiples
N <- as_adjacency_matrix(G) %*% as_adjacency_matrix(G)

However, I don't know how to efficiently get the information out of the resulting matrix N, without looping over all the cells, which would look like this:

extract_entries <- function(x, M) {
 
  nl <- M@p[x]   1  # index from 1, not 0
  nu <- M@p[x 1]
  j.col <- M@Dimnames[[1]][M@i[nl:nu]   1]
  i.col <- M@Dimnames[[2]][x]
  nb.col <- M@x[nl:nu]
  
  data.table(i = i.col, j = j.col, nb = nb.col)
  
}
  
system.time(E.nb <- rbindlist(lapply(1:N@Dim[1], extract_entries, N), fill = T))

   # user  system elapsed 
  #  8.29    0.02    8.31 

E <- E.nb[E, on = c('i', 'j')][is.na(nb), nb := 0]

Even in the reproducible example above, looping is slow, and the true graph might have millions of vertices and tens of millions of edges. My end goal is to add a column to the data frame E with the number of common neighbours for each edge, as illustrated in the MWE.

My question is: is there a (much) more efficient way of extracting the number of common neighbours for each pair of vertices and merging this information back into the list of edges?

I have seen that the package diagramme_R includes a function that calculates the number of common neighbours, however it again appears intended to be used for a limited number of edges, and wouldn't solve the problem of adding the information on the number of common neighbours back to the original data frame.

CodePudding user response：

You're pretty much there. Just a couple things: converting N to a triangular matrix lets us build E.nb without lapply. Also, the i-j ordering is messing up the final E->E.nb join. Temporarily sorting each row fixes this.

library(igraph)
library(data.table)
library(Matrix)
set.seed(1111)

E <- data.table(i = sample(as.character(1:5e4), 1e5, replace = TRUE),
                j = sample(as.character(1:5e4), 1e5, replace = TRUE))

f <- function(E) {
  # temporarily sort each row of E
  blnSort <- E[[1]] > E[[2]]
  E[blnSort, 2:1 := .SD, .SDcols = 1:2]
  G <- simplify(graph_from_data_frame(E, directed = FALSE))  # remove loops and multiples
  N <- as(tril(as_adjacency_matrix(G) %*% as_adjacency_matrix(G), -1), "dtTMatrix")
  E.nb <- data.table(i = N@Dimnames[[1]][N@i   1], j = N@Dimnames[[2]][N@j   1], nb = N@x)[
    i > j, 2:1 := .SD, .SDcols = 1:2 # sort each row
  ]
  E <- E.nb[E, on = c('i', 'j')][is.na(nb), nb := 0]
  E[blnSort, 2:1 := .SD, .SDcols = 1:2] # undo the row sorting
}

system.time(E <- f(E))
#>    user  system elapsed 
#>    0.38    0.13    0.41

Each triangle should add 1 to three separate rows of E$nb. Check that the sum of the nb column is three times the number of triangles:

sum(E$nb) # 13 triangles
#> [1] 39
length(triangles(simplify(graph_from_data_frame(E, directed = FALSE))))
#> [1] 39