Create list of everyone in family tree SQL or Python-CodePudding

I have a table that has all parents linked to children. A parent can have one or more children. A child can have one of more parents. Any item without any link is not in the link table. Here is an example of what it would look like.

Parent	Child
A	B
A	C
A	D
D	E
G	H
I	J
J	K
L	M
O	M
N	P
Q	R
R	P

My ideal output would be a list of lists of all item within a family. For example:

Family 1 = [A, B, C, D, E]  
Family 2 = [G, H]  
Family 3 = [I, J, K]  
Family 4 = [L, M, O]  
Family 5 = [P, N, Q, R]

Things I have tried:

Recursive query in SQL to find topmost ancestor and then find descendants. Problem: Because the topmost ancestor could be even 3 parents, it is becoming difficult to drill down
In Python loop through even single row. For first row label that one link a family. Row two would then check if any of the items linked are in family 1, if so it adds it to family, if not it becomes a new family. Repeat for every row. Problem: family 5 presents an interesting issue so [P,N] would be a family then [Q,R] but when coming upon the link R > P either I need to account for links that match two families and merge them or merge at end.

I have around 15000 linkages I want to do this for and am looking for something efficient. I am open to both python and SQL and the use of recursion or other methods. Thanks!

CodePudding user response：

Suppose you have this dataframe:

   Parent Child
0       A     B
1       A     C
2       A     D
3       D     E
4       G     H
5       I     J
6       J     K
7       L     M
8       O     M
9       N     P
10      Q     R
11      R     P

Then you can use networkx to obtain all connected components:

import networkx as nx

G = nx.Graph()
for p, c in zip(df.Parent, df.Child):
    G.add_edge(p, c)

for c in nx.connected_components(G):
    print(c)

Prints:

{'A', 'C', 'D', 'B', 'E'}
{'G', 'H'}
{'K', 'J', 'I'}
{'M', 'L', 'O'}
{'P', 'R', 'N', 'Q'}

CodePudding user response：

Here's a simple BFS algorithm (no recursion involved):

edges = 'AB AC AD DE GH IJ JK LM OM NP QR RP'.split()

nodes = [n for n in ''.join(edges)]
seen = set()
groups = []


for node in nodes:
    if node in seen:
        continue

    groups.append([])
    queue = [node]

    while queue:
        n = queue.pop(0)
        if n in seen:
            continue

        groups[-1].append(n)
        seen.add(n)

        for a, b in edges:
            if a == n:
                queue.append(b)
            if b == n:
                queue.append(a)

print(groups)
# [['A', 'B', 'C', 'D', 'E'], ['G', 'H'], ['I', 'J', 'K'], ['L', 'M', 'O'], ['N', 'P', 'R', 'Q']]

Should be fast enough, but easy to optimize if needed (e.g. using an adjacency matrix instead of the list of edges).