Home > database >  How to check on a condition for each cell in a dataframe
How to check on a condition for each cell in a dataframe

Time:03-09

I'm trying to make a network graph from a Dataframe made up of 1300 pharmaceutical molecules with a target molecule as a string and 31 molecular descriptors of int/float type. I set the nodes to be the same as the index of the dataframe. Below a dataframe sample and the initial code is given:

enter image description here

import networkx as nx
import pandas as pd
import numpy as np

df_data = pd.read_csv("QSAR_2.csv")
df_targets = df_data["Target"]
df_descriptors = df_data.iloc[:,2:-1]

G = nx.Graph()
G.add_nodes_from(df_descriptors.index.values.tolist())

Now I need to add edges between nodes that have a correlation above a treshold c. Getting the correlation matrix is easy with:

df_corr = df_descriptors.T.corr()

But now I need to check the condition: correlation > c if so get the (x,y) of the cell and add as a tuple to G.add_edge(x, y)

I could make this work with a nested loop, but I guess there is a far simpler and faster way of implementing this. Does anyone know the solution?

CodePudding user response:

From a correlation matrix cc_matrix, you can simply extract the indices of the edges with a correlation above your threshold c by using edge_list=np.argwhere(cc_matrix>c). You can then add those edges to your graph with G.add_edges_from(edge_list).

See full example below:

import networkx as nx
import numpy as np

#Create random correlation matrix
a=np.random.choice(10,size=(10,10))
cc_matrix=np.corrcoef(a)

#Create graph
G=nx.Graph()
c=0.5
edge_list=np.argwhere(cc_matrix>c) 
G.add_edges_from(edge_list)
  • Related