Home > OS >  Cluster values within two columns in groups in pandas
Cluster values within two columns in groups in pandas

Time:12-28

I have a dataframe like this:

VAL1 VAL2
A    A
B    B
E    E
F    F
G    G
H    H 
I    I
J    J
A    B
A    C
B    A
B    C
C    A 
C    B
D    E
E    D
F    E
E    F
G    H
H    G
I    J
J    I
I    H
H    I
K    K

And I would like to cluster into Groups the VAL1 and VAL2 values.

For instance :

  1. A is in the same row as B and C, so I group A,B and C within the same group.
  2. D is in the same row as E and E is in the same row as F, so I group D,E, and F within the same group.
  3. G is in the same row as H and H is in the same row as I, and I I is in the same group as J, so I group G,H,I and J within the same group.
  4. K has nos shared row, so I group it alone.

and I should then get:

Groups VALs
G1     A
G1     B
G1     C
G2     D
G2     E
G2     F
G3     G
G3     H
G3     I
G3     J
G4     K

Here is the dataframe if it can help

{'VAL1': {0: 'A', 1: 'B', 2: 'E', 3: 'F', 4: 'G', 5: 'H', 6: 'I', 7: 'J', 8: 'A', 9: 'A', 10: 'B', 11: 'B', 12: 'C', 13: 'C', 14: 'D', 15: 'E', 16: 'F', 17: 'E', 18: 'G', 19: 'H', 20: 'I', 21: 'J', 22: 'I', 23: 'H', 24: 'K'}, 'VAL2': {0: 'A', 1: 'B', 2: 'E', 3: 'F', 4: 'G', 5: 'H ', 6: 'I', 7: 'J', 8: 'B', 9: 'C', 10: 'A', 11: 'C', 12: 'A ', 13: 'B', 14: 'E', 15: 'D', 16: 'E', 17: 'F', 18: 'H', 19: 'G', 20: 'J', 21: 'I', 22: 'H', 23: 'I', 24: 'K'}}

CodePudding user response:

Create connected_components for list L and then convert to DataFrame:

import networkx as nx

# Create the graph from the dataframe
g = nx.Graph()

g.add_edges_from(df[['VAL1','VAL2']].itertuples(index=False))

new = list(nx.connected_components(g))

L =  [(f'G{cid   1}', node) for cid, component in enumerate(new) for node in component]
df = pd.DataFrame(L, columns=['Groups','VALSs'])
print (df)
   Groups VALSs
0      G1     A
1      G1     B
2      G1     C
3      G2     D
4      G2     F
5      G2     E
6      G3     G
7      G3     I
8      G3     J
9      G3     H
10     G4     K
  • Related