I spend almost half of my day trying to solve this...
I want to identify the value in parent and child columns and change it to rows.
The value has a tree structure in that the parent node becomes the child node, and the child node becomes the parent node at the next step.
My sample data looks like.
| Parent | Child |
--------------------------
0 | a b
1 | b c
2 | b d
3 | c e
4 | c f
5 | f g
6 | d h
and I want to change this like,
| Col1 | Col2 | Col3 | Col4 | Col5 | Col6 |
----------------------------------------------------------
0 | a | b | c | f | g | nan |
1 | a | b | c | e | nan | nan |
2 | a | b | d | h | nan | nan |
I have tried doing the loop for searching the next items, but it does not work.
Any help would be appreciated.
CodePudding user response:
You can approach this using a graph and
Create all edges, find the roots and leafs and compute the paths with all_simple_paths
:
import networkx as nx
G = nx.from_pandas_edgelist(df, source='Parent', target='Child',
create_using=nx.DiGraph)
roots = [n for n,d in G.in_degree() if d==0]
leafs = [n for n,d in G.out_degree() if d==0]
df2 = pd.DataFrame([l for r in roots for l in nx.all_simple_paths(G, r, leafs)])
output:
0 1 2 3 4
0 a b c e None
1 a b c f g
2 a b d h None