How to create hierarchical path by id column in a dataframe in python?-CodePudding

I have a dataframe which has parent_id, parent_name, id, name, last_category columns. df is like this:

parent_id   parent_name id      name    last_category
NaN         NaN         1       b       0
1           b           11      b1      0
11          b1          111     b2      0
111         b2          1111    b3      0
1111        b3          11111   b4      1
NaN         NaN         2       a       0
2           a           22      a1      0
22          a1          222     a2      0
222         a2          2222    a3      1

I want to create a hierarchical path of df with last_category column 1. From the root category to the last. So the new dataframe I will create should be like this (df_last):

name_path                id_path
b / b1 / b2 / b3 / b4    1 / 11 / 111 / 1111 / 11111
a / a1 / a2 / a3 / a4    2 / 22 / 222 / 2222

How to do this?

CodePudding user response：

A solution using only numpy and pandas :

# It's easier if we index the dataframe with the `id`
# I assume this ID is unique
df = df.set_index("id")

# `parents[i]` returns the parent ID of `i`
parents = df["parent_id"].to_dict()

paths = {}

# Find all nodes with last_category == 1
for id_ in df.query("last_category == 1").index:
    child_id = id_
    path = [child_id]
    
    # Iteratively travel up the hierarchy until the parent is nan
    while True:
        pid = parents[id_]
        if np.isnan(pid):
            break
        else:
            path.append(pid)
            id_ = pid

    # The path to the child node is the reverse of
    # the path we traveled
    paths[int(child_id)] = np.array(path[::-1], dtype="int")

And constructing the result data frame:

result = pd.DataFrame({
    id_: (
        " / ".join(df.loc[pids, "name"]),
        " / ".join(pids.astype("str"))
    )
    for id_, pids in paths.items()
}, index=["name_path", "id_path"]).T

CodePudding user response：

You can use networkx to resolve the path between root node and leaf node with all_simple_paths function.

# Python env: pip install networkx
# Anaconda env: conda install networkx
import networkx as nx

# Create network from your dataframe
G = nx.from_pandas_edgelist(df, source='parent_id', target='id',
                            create_using=nx.DiGraph)
nx.set_node_attributes(G, df.set_index('id')[['name']].to_dict('index'))

# Find roots of your graph (a root is a node with no input)
roots = [node for node, degree in G.in_degree() if degree == 0]

# Find leaves of your graph (a leaf is a node with no output)
leaves = [node for node, degree in G.out_degree() if degree == 0]

# Find all paths
paths = []
for root in roots:
  for leaf in leaves:
    for path in nx.all_simple_paths(G, root, leaf):
        # [1:] to remove NaN parent_id
        paths.append({'id_path': ' / '.join(str(n) for n in path[1:]),
                      'name_path': ' / '.join(G.nodes[n]['name'] for n in path[1:])})

out = pd.DataFrame(paths)

Output:

>>> out
                       id_path              name_path
0  1 / 11 / 111 / 1111 / 11111  b / b1 / b2 / b3 / b4
1          2 / 22 / 222 / 2222       a / a1 / a2 / a3