I am trying to go through a hierarchy dataframe and record every possible routes into another dataframe. These routes can have variable depth.
Original dataframe (df). The highest column means that the value in the parent column is not a child of any:
parent | child | highest |
---|---|---|
a | b | 1 |
b | c | 0 |
b | d | 0 |
d | e | 0 |
End goal dataframe:
level 3 | level 2 | level 1 | level 0 |
---|---|---|---|
a | b | c | |
a | b | d | e |
This what I currently have
def search(parent):
for i in range(df.shape[0]):
if(df.iloc[i,0] == parent):
search(df.iloc[i,1])
for i in range(df.shape[0]):
if(df.iloc[i,2] == 1):
search(df.iloc[i,0])
I am able to go through the hierarchy but I do not know how to save it in the format I want.
CodePudding user response:
You can use networkx
to solve the problem. Note if you use networkx
you don't need the highest
columns. The main function to find all paths is all_simple_paths
# Python env: pip install networkx
# Anaconda env: conda install networkx
# Create network from your dataframe
G = nx.from_pandas_edgelist(df, source='parent', target='child',
create_using=nx.DiGraph)
# Find roots of your graph (a root is a node with no input)
roots = [node for node, degree in G.in_degree() if degree == 0]
# Find leaves of your graph (a leaf is a node with no output)
leaves = [node for node, degree in G.out_degree() if degree == 0]
# Find all paths
paths = []
for root in roots:
for leaf in leaves:
for path in nx.all_simple_paths(G, root, leaf):
paths.append(path)
# Create a new dataframe
out = pd.DataFrame(paths).fillna('')
out.columns = reversed(out.add_prefix('level ').columns)
Output:
>>> out
level 3 level 2 level 1 level 0
0 a b c
1 a b d e
CodePudding user response:
This can be done using the networkx library. You do not need the highest
column. It can be deduced from the graph.
import pandas as pd
import networkx as nx
df = pd.DataFrame([ \
('a', 'b', 1), \
('b', 'c', 0), \
('b', 'd', 0), \
('d', 'e', 0)], columns=['parent', 'child', 'highest'])
parents = df.parent.values
childs = df.child
G = nx.DiGraph()
for i in range(len(parents)):
parent = parents[i]
child = childs[i]
if not parent in G.nodes:
G.add_node(parent)
if not child in G.nodes:
G.add_node(child)
G.add_edge(parent, child)
roots = [node for node in G.nodes if G.in_degree(node) == 0]
root = roots[0]
leaves = [node for node in G.nodes if G.out_degree(node) == 0]
all_paths = [list(nx.simple_paths.all_simple_paths(G, root, leaf)) for leaf in leaves]
# Flatten
all_paths = [tuple(path) for paths in all_paths for path in paths]
paths_df = pd.DataFrame(all_paths)
I will leave for you to rename the columns as per your liking.