I have a dataframe which has parent_id, parent_name, id, name, last_category columns. df is like this:
parent_id parent_name id name last_category
NaN NaN 1 b 0
1 b 11 b1 0
11 b1 111 b2 0
111 b2 1111 b3 0
1111 b3 11111 b4 1
NaN NaN 2 a 0
2 a 22 a1 0
22 a1 222 a2 0
222 a2 2222 a3 1
I want to create a hierarchical path of df with last_category column 1. From the root category to the last. So the new dataframe I will create should be like this (df_last):
name_path id_path
b / b1 / b2 / b3 / b4 1 / 11 / 111 / 1111 / 11111
a / a1 / a2 / a3 / a4 2 / 22 / 222 / 2222
How to do this?
CodePudding user response:
A solution using only numpy and pandas :
# It's easier if we index the dataframe with the `id`
# I assume this ID is unique
df = df.set_index("id")
# `parents[i]` returns the parent ID of `i`
parents = df["parent_id"].to_dict()
paths = {}
# Find all nodes with last_category == 1
for id_ in df.query("last_category == 1").index:
child_id = id_
path = [child_id]
# Iteratively travel up the hierarchy until the parent is nan
while True:
pid = parents[id_]
if np.isnan(pid):
break
else:
path.append(pid)
id_ = pid
# The path to the child node is the reverse of
# the path we traveled
paths[int(child_id)] = np.array(path[::-1], dtype="int")
And constructing the result data frame:
result = pd.DataFrame({
id_: (
" / ".join(df.loc[pids, "name"]),
" / ".join(pids.astype("str"))
)
for id_, pids in paths.items()
}, index=["name_path", "id_path"]).T
CodePudding user response:
You can use networkx
to resolve the path between root node and leaf node with all_simple_paths
function.
# Python env: pip install networkx
# Anaconda env: conda install networkx
import networkx as nx
# Create network from your dataframe
G = nx.from_pandas_edgelist(df, source='parent_id', target='id',
create_using=nx.DiGraph)
nx.set_node_attributes(G, df.set_index('id')[['name']].to_dict('index'))
# Find roots of your graph (a root is a node with no input)
roots = [node for node, degree in G.in_degree() if degree == 0]
# Find leaves of your graph (a leaf is a node with no output)
leaves = [node for node, degree in G.out_degree() if degree == 0]
# Find all paths
paths = []
for root in roots:
for leaf in leaves:
for path in nx.all_simple_paths(G, root, leaf):
# [1:] to remove NaN parent_id
paths.append({'id_path': ' / '.join(str(n) for n in path[1:]),
'name_path': ' / '.join(G.nodes[n]['name'] for n in path[1:])})
out = pd.DataFrame(paths)
Output:
>>> out
id_path name_path
0 1 / 11 / 111 / 1111 / 11111 b / b1 / b2 / b3 / b4
1 2 / 22 / 222 / 2222 a / a1 / a2 / a3