Home > other >  Create list of everyone in family tree SQL or Python
Create list of everyone in family tree SQL or Python

Time:11-05

I have a table that has all parents linked to children. A parent can have one or more children. A child can have one of more parents. Any item without any link is not in the link table. Here is an example of what it would look like.

Parent Child
A B
A C
A D
D E
G H
I J
J K
L M
O M
N P
Q R
R P

My ideal output would be a list of lists of all item within a family. For example:

Family 1 = [A, B, C, D, E]  
Family 2 = [G, H]  
Family 3 = [I, J, K]  
Family 4 = [L, M, O]  
Family 5 = [P, N, Q, R]  

Things I have tried:

  1. Recursive query in SQL to find topmost ancestor and then find descendants. Problem: Because the topmost ancestor could be even 3 parents, it is becoming difficult to drill down
  2. In Python loop through even single row. For first row label that one link a family. Row two would then check if any of the items linked are in family 1, if so it adds it to family, if not it becomes a new family. Repeat for every row. Problem: family 5 presents an interesting issue so [P,N] would be a family then [Q,R] but when coming upon the link R > P either I need to account for links that match two families and merge them or merge at end.

I have around 15000 linkages I want to do this for and am looking for something efficient. I am open to both python and SQL and the use of recursion or other methods. Thanks!

CodePudding user response:

Suppose you have this dataframe:

   Parent Child
0       A     B
1       A     C
2       A     D
3       D     E
4       G     H
5       I     J
6       J     K
7       L     M
8       O     M
9       N     P
10      Q     R
11      R     P

Then you can use networkx to obtain all connected components:

import networkx as nx

G = nx.Graph()
for p, c in zip(df.Parent, df.Child):
    G.add_edge(p, c)

for c in nx.connected_components(G):
    print(c)

Prints:

{'A', 'C', 'D', 'B', 'E'}
{'G', 'H'}
{'K', 'J', 'I'}
{'M', 'L', 'O'}
{'P', 'R', 'N', 'Q'}

CodePudding user response:

Here's a simple BFS algorithm (no recursion involved):

edges = 'AB AC AD DE GH IJ JK LM OM NP QR RP'.split()

nodes = [n for n in ''.join(edges)]
seen = set()
groups = []


for node in nodes:
    if node in seen:
        continue

    groups.append([])
    queue = [node]

    while queue:
        n = queue.pop(0)
        if n in seen:
            continue

        groups[-1].append(n)
        seen.add(n)

        for a, b in edges:
            if a == n:
                queue.append(b)
            if b == n:
                queue.append(a)

print(groups)
# [['A', 'B', 'C', 'D', 'E'], ['G', 'H'], ['I', 'J', 'K'], ['L', 'M', 'O'], ['N', 'P', 'R', 'Q']]

Should be fast enough, but easy to optimize if needed (e.g. using an adjacency matrix instead of the list of edges).

  • Related