I have a table that has all parents linked to children. A parent can have one or more children. A child can have one of more parents. Any item without any link is not in the link table. Here is an example of what it would look like.
Parent | Child |
---|---|
A | B |
A | C |
A | D |
D | E |
G | H |
I | J |
J | K |
L | M |
O | M |
N | P |
Q | R |
R | P |
My ideal output would be a list of lists of all item within a family. For example:
Family 1 = [A, B, C, D, E]
Family 2 = [G, H]
Family 3 = [I, J, K]
Family 4 = [L, M, O]
Family 5 = [P, N, Q, R]
Things I have tried:
- Recursive query in SQL to find topmost ancestor and then find descendants. Problem: Because the topmost ancestor could be even 3 parents, it is becoming difficult to drill down
- In Python loop through even single row. For first row label that one link a family. Row two would then check if any of the items linked are in family 1, if so it adds it to family, if not it becomes a new family. Repeat for every row. Problem: family 5 presents an interesting issue so [P,N] would be a family then [Q,R] but when coming upon the link R > P either I need to account for links that match two families and merge them or merge at end.
I have around 15000 linkages I want to do this for and am looking for something efficient. I am open to both python and SQL and the use of recursion or other methods. Thanks!
CodePudding user response:
Suppose you have this dataframe:
Parent Child
0 A B
1 A C
2 A D
3 D E
4 G H
5 I J
6 J K
7 L M
8 O M
9 N P
10 Q R
11 R P
Then you can use networkx
to obtain all connected components:
import networkx as nx
G = nx.Graph()
for p, c in zip(df.Parent, df.Child):
G.add_edge(p, c)
for c in nx.connected_components(G):
print(c)
Prints:
{'A', 'C', 'D', 'B', 'E'}
{'G', 'H'}
{'K', 'J', 'I'}
{'M', 'L', 'O'}
{'P', 'R', 'N', 'Q'}
CodePudding user response:
Here's a simple BFS algorithm (no recursion involved):
edges = 'AB AC AD DE GH IJ JK LM OM NP QR RP'.split()
nodes = [n for n in ''.join(edges)]
seen = set()
groups = []
for node in nodes:
if node in seen:
continue
groups.append([])
queue = [node]
while queue:
n = queue.pop(0)
if n in seen:
continue
groups[-1].append(n)
seen.add(n)
for a, b in edges:
if a == n:
queue.append(b)
if b == n:
queue.append(a)
print(groups)
# [['A', 'B', 'C', 'D', 'E'], ['G', 'H'], ['I', 'J', 'K'], ['L', 'M', 'O'], ['N', 'P', 'R', 'Q']]
Should be fast enough, but easy to optimize if needed (e.g. using an adjacency matrix instead of the list of edges).