Pandas create groups from column values-CodePudding

I have a dataframe df as follows:

Col1    Col2
A1      A1
B1      A1
B1      B1
C1      C1
D1      A1
D1      B1
D1      D1
E1      A1

I am trying to achieve the following:

Col1    Group
A1      A1
B1      A1
D1      A1
E1      A1
C1      C1

i.e. in df every value which have relationship gets grouped together as a single value. i.e. in the example above (A1, A1), (B1, A1), (B1, B1), (D1, A1), (D1, B1), (D1, D1), (E1, A1) can either directly or indirectly be all linked to A1 (the first in alphabet sort) so they all get assigned the group id A1 and so on.

I am not sure how to do this.

CodePudding user response：

This can be approached using a graph.

Here is your graph:

You can use networkx to find the connected_components:

import networkx as nx

G = nx.from_pandas_edgelist(df, source='Col1', target='Col2')

d = {}
for g in nx.connected_components(G):
    g = sorted(g)
    for x in g:
        d[x] = g[0]

out = pd.Series(d)

output:

A1    A1
B1    A1
D1    A1
E1    A1
C1    C1
dtype: object