I have a data frame with two columns. The two columns contain integer numbers. The second column contains numbers that are linked to the first column. In case there is no link between the two columns, the number in the second column will have zero value. Here is an example of the table.
The expected output is a list of connections between the two columns. Using the attached table as an example, the output will be [[2, 3, 4, 5], [6, 7, 8]]
This question is similar but not the same as finding transitive relation between two columns in pandas.
CodePudding user response:
You could approach this as a graph, treating the dataframe as an edge list. You can then retrieve the connected nodes with networkx
:
import pandas as pd
import networkx as nx
df = pd.DataFrame({'a': range(1, 11), 'b': [0, 4, 2, 5, 0, 7, 8, 0, 0, 0]})
g = nx.from_pandas_edgelist(df[df['b'] != 0], source='a', target='b')
print(list(nx.connected_components(g)))
Output:
[{2, 3, 4, 5}, {8, 6, 7}]
CodePudding user response:
Not really a Pandas answer, but here's one approach (with help from here for finding runs of consecutive integers):
df = pd.DataFrame({'a': range(1, 11),
'b': [0, 4, 2, 5, 0, 7, 8, 0, 0, 0]})
from itertools import groupby
from operator import itemgetter
zero_locs = df['b'].to_numpy().nonzero()[0]
connections = []
for k,g in groupby(enumerate(zero_locs), lambda x: x[0]-x[1]):
group = (map(itemgetter(1),g))
group = list(map(int,group))
group.append(group[-1] 1)
connections.append(list(df['a'][group]))
connections # [[2, 3, 4, 5], [6, 7, 8]]