I have a data frame :
Parent Child1 Child2 Child3 Child4 Child5 Child6 0 A A1 B2 -1 -1 -1 -1 1 B B1 -1 -1 -1 -1 -1 2 A1 -1 -1 C1 -1 -1 C2 3 D -1 C2 -1 A1 -1 -1 4 C1 -1 -1 -1 -1 -1 -1 5 C2 -1 -1 -1 -1 -1 -1 6 B1 -1 -1 -1 -1 -1 -1 7 B2 B3 B4 -1 -1 -1 -1 8 B3 -1 -1 -1 -1 -1 -1 9 B4 -1 -1 -1 -1 -1 -1
source :
df = pd.DataFrame({'Parent': ['A','B','A1','D','C1','C2','B1','B2','B3','B4'],'Child1': ['A1','B1','-1','-1','-1','-1','-1','B3','-1','-1'], 'Child2': ['B2','-1','-1','C2','-1','-1','-1','B4','-1','-1'] , 'Child3' : ['-1','-1','C1','-1','-1','-1','-1','-1','-1','-1'] , 'Child4' : ['-1','-1','-1','A1','-1','-1','-1','-1','-1','-1'],'Child5' : ['-1','-1','-1','-1','-1','-1','-1','-1','-1','-1'] ,'Child6' : ['-1','-1','C2','-1','-1','-1','-1','-1','-1','-1']})
Now, I have an input list with a couple of parents like parent_list = ['A','B'] . I need to find all the children of all these parents.
So for 'A' there are two children : A1 and B2 . A1 again has two children 'C1' and 'C2'. BUT 'C1' and 'C2' are childless (if all children are '-1' they are childless) and moving on B2 has two children - 'B3' and 'B4'. Both B3 and B4 are childless , and moving on B has only one child : 'B1' and 'B1' is childless.
so the final family list for ['A','B'] is going to be ['A','B','A1','B2','C1','C2','B3','B4','B1']
Here is how far I was able to come :
parent_list= ['A','B'] tmp_list = [] output_list = [] child_list= [] for i in parent_list: output_list.append(i) if i not in output_list else output_list parent_list.remove(i) tmp_list = df.loc[df['Parent'] == i, ['Child1','Child2','Child3','Child4','Child5','Child6']].values.flatten().tolist() while '-1' in tmp_list: tmp_list.remove('-1') if tmp_list: parent_list = parent_list tmp_list
However my code only runs for i = 'A' in the parent_list and stops. I Am not sure why it wouldn't iterate any further. when I check parent_list after the frist iteration I do see what I want to see but looping doesnt happen. Where am I doing wrong?
Also IF theres any better ways of approaching this problem please suggest.
CodePudding user response:
So the reason why the for loop only runs for A is because you're trying to edit the parent_list while you're iterating through it. So the iterator is on 'A', which you then remove, so it is then on 'B', and then at the end of the block it iterates again and that's the end of the list
The simplest solution in your case seems to be just removing the line parent_list.remove(i)
as it doesn't seem necessary and is causing this problem
I think a better solution would be to use a recursive function to get the children. Something like:
def get_children(parent):
child_list = df.loc[df['Parent'] == parent, ['Child1','Child2','Child3','Child4','Child5','Child6']].values.flatten().tolist()
while '-1' in child_list: child_list.remove('-1')
for i in child_list:
child_list.extend(get_children(i))
# cast to set and back to remove duplicate children
return list(set(child_list))
I tested this and it more or less achieves what you want for a single parent. If you want the whole family you can just iterate through your parent_list using this function and then combine the parent_list with the returned lists
CodePudding user response:
We can melt
the dataframe, then create a directed graph with the help of networkx
then use descendents
method to find all the children for each parent node in the parent_list
import networkx as nx
s = df.melt('Parent').astype(str).query("value != '-1'")
G = nx.from_pandas_edgelist(s, 'Parent', 'value', create_using=nx.DiGraph())
family = parent_list [d for n in parent_list for d in nx.descendants(G, n)]
>>> family
['A', 'B', 'C1', 'C2', 'B3', 'B2', 'B4', 'A1', 'B1']