finding all dependencies in a dataframe-CodePudding

I have a data frame :

       Parent   Child1  Child2  Child3  Child4  Child5  Child6
0         A       A1      B2      -1     -1       -1     -1
1         B       B1      -1      -1     -1       -1     -1
2         A1      -1      -1      C1     -1       -1     C2
3         D       -1      C2      -1     A1       -1     -1
4         C1      -1      -1      -1     -1       -1     -1
5         C2      -1      -1      -1     -1       -1     -1
6         B1      -1      -1      -1     -1       -1     -1
7         B2      B3      B4      -1     -1       -1     -1
8         B3      -1      -1      -1     -1       -1     -1
9         B4      -1      -1      -1     -1       -1     -1

source :

df = pd.DataFrame({'Parent': ['A','B','A1','D','C1','C2','B1','B2','B3','B4'],'Child1': ['A1','B1','-1','-1','-1','-1','-1','B3','-1','-1'], 'Child2': ['B2','-1','-1','C2','-1','-1','-1','B4','-1','-1'] , 'Child3' : ['-1','-1','C1','-1','-1','-1','-1','-1','-1','-1'] , 'Child4' : ['-1','-1','-1','A1','-1','-1','-1','-1','-1','-1'],'Child5' : ['-1','-1','-1','-1','-1','-1','-1','-1','-1','-1'] ,'Child6' : ['-1','-1','C2','-1','-1','-1','-1','-1','-1','-1']})

Now, I have an input list with a couple of parents like parent_list = ['A','B'] . I need to find all the children of all these parents.
So for 'A' there are two children : A1 and B2 . A1 again has two children 'C1' and 'C2'. BUT 'C1' and 'C2' are childless (if all children are '-1' they are childless) and moving on B2 has two children - 'B3' and 'B4'. Both B3 and B4 are childless , and moving on B has only one child : 'B1' and 'B1' is childless.
so the final family list for ['A','B'] is going to be ['A','B','A1','B2','C1','C2','B3','B4','B1']

Here is how far I was able to come :

    parent_list= ['A','B']
    tmp_list = []
    output_list = []
    child_list= []

    for i in parent_list:
      output_list.append(i) if i not in output_list else output_list 
      parent_list.remove(i)
      tmp_list = df.loc[df['Parent']  == i, ['Child1','Child2','Child3','Child4','Child5','Child6']].values.flatten().tolist()
      while '-1' in tmp_list: tmp_list.remove('-1')
      if  tmp_list:
        parent_list = parent_list   tmp_list

However my code only runs for i = 'A' in the parent_list and stops. I Am not sure why it wouldn't iterate any further. when I check parent_list after the frist iteration I do see what I want to see but looping doesnt happen. Where am I doing wrong?
Also IF theres any better ways of approaching this problem please suggest.

CodePudding user response：

So the reason why the for loop only runs for A is because you're trying to edit the parent_list while you're iterating through it. So the iterator is on 'A', which you then remove, so it is then on 'B', and then at the end of the block it iterates again and that's the end of the list

The simplest solution in your case seems to be just removing the line parent_list.remove(i) as it doesn't seem necessary and is causing this problem

I think a better solution would be to use a recursive function to get the children. Something like:

def get_children(parent):
    child_list = df.loc[df['Parent']  == parent, ['Child1','Child2','Child3','Child4','Child5','Child6']].values.flatten().tolist()
    while '-1' in child_list: child_list.remove('-1')
    for i in child_list:
        child_list.extend(get_children(i))
    # cast to set and back to remove duplicate children
    return list(set(child_list))

I tested this and it more or less achieves what you want for a single parent. If you want the whole family you can just iterate through your parent_list using this function and then combine the parent_list with the returned lists

CodePudding user response：

We can melt the dataframe, then create a directed graph with the help of networkx then use descendents method to find all the children for each parent node in the parent_list

import networkx as nx

s = df.melt('Parent').astype(str).query("value != '-1'")
G = nx.from_pandas_edgelist(s, 'Parent', 'value', create_using=nx.DiGraph())
family = parent_list   [d for n in parent_list for d in nx.descendants(G, n)]

>>> family

['A', 'B', 'C1', 'C2', 'B3', 'B2', 'B4', 'A1', 'B1']