I'm trying to make a network graph from a Dataframe made up of 1300 pharmaceutical molecules with a target molecule as a string and 31 molecular descriptors of int/float type. I set the nodes to be the same as the index of the dataframe. Below a dataframe sample and the initial code is given:
import networkx as nx
import pandas as pd
import numpy as np
df_data = pd.read_csv("QSAR_2.csv")
df_targets = df_data["Target"]
df_descriptors = df_data.iloc[:,2:-1]
G = nx.Graph()
G.add_nodes_from(df_descriptors.index.values.tolist())
Now I need to add edges between nodes that have a correlation above a treshold c. Getting the correlation matrix is easy with:
df_corr = df_descriptors.T.corr()
But now I need to check the condition: correlation > c if so get the (x,y) of the cell and add as a tuple to G.add_edge(x, y)
I could make this work with a nested loop, but I guess there is a far simpler and faster way of implementing this. Does anyone know the solution?
CodePudding user response:
From a correlation matrix cc_matrix
, you can simply extract the indices of the edges with a correlation above your threshold c
by using edge_list=np.argwhere(cc_matrix>c)
. You can then add those edges to your graph with G.add_edges_from(edge_list)
.
See full example below:
import networkx as nx
import numpy as np
#Create random correlation matrix
a=np.random.choice(10,size=(10,10))
cc_matrix=np.corrcoef(a)
#Create graph
G=nx.Graph()
c=0.5
edge_list=np.argwhere(cc_matrix>c)
G.add_edges_from(edge_list)