I have a dataframe
like this:
VAL1 VAL2
A A
B B
E E
F F
G G
H H
I I
J J
A B
A C
B A
B C
C A
C B
D E
E D
F E
E F
G H
H G
I J
J I
I H
H I
K K
And I would like to cluster into Groups
the VAL1
and VAL2
values.
For instance :
A
is in the same row asB
andC
, so I groupA,B
andC
within the same group.D
is in the same row asE
andE
is in the same row asF
, so I groupD,E, and F
within the same group.G
is in the same row asH
andH
is in the same row asI
, and II
is in the same group asJ
, so I groupG,H,I and J
within the same group.K
has nos shared row, so I group it alone.
and I should then get:
Groups VALs
G1 A
G1 B
G1 C
G2 D
G2 E
G2 F
G3 G
G3 H
G3 I
G3 J
G4 K
Here is the dataframe
if it can help
{'VAL1': {0: 'A', 1: 'B', 2: 'E', 3: 'F', 4: 'G', 5: 'H', 6: 'I', 7: 'J', 8: 'A', 9: 'A', 10: 'B', 11: 'B', 12: 'C', 13: 'C', 14: 'D', 15: 'E', 16: 'F', 17: 'E', 18: 'G', 19: 'H', 20: 'I', 21: 'J', 22: 'I', 23: 'H', 24: 'K'}, 'VAL2': {0: 'A', 1: 'B', 2: 'E', 3: 'F', 4: 'G', 5: 'H ', 6: 'I', 7: 'J', 8: 'B', 9: 'C', 10: 'A', 11: 'C', 12: 'A ', 13: 'B', 14: 'E', 15: 'D', 16: 'E', 17: 'F', 18: 'H', 19: 'G', 20: 'J', 21: 'I', 22: 'H', 23: 'I', 24: 'K'}}
CodePudding user response:
Create connected_components for list L
and then convert to DataFrame
:
import networkx as nx
# Create the graph from the dataframe
g = nx.Graph()
g.add_edges_from(df[['VAL1','VAL2']].itertuples(index=False))
new = list(nx.connected_components(g))
L = [(f'G{cid 1}', node) for cid, component in enumerate(new) for node in component]
df = pd.DataFrame(L, columns=['Groups','VALSs'])
print (df)
Groups VALSs
0 G1 A
1 G1 B
2 G1 C
3 G2 D
4 G2 F
5 G2 E
6 G3 G
7 G3 I
8 G3 J
9 G3 H
10 G4 K