Data Clustering using Pandas-CodePudding

I have a dataset

Name    Subset    Type
A00     IU00-A    OP
A00     IT00      PP
B01     IT-01A    PP
B01     IU        OP
B03     IM-09-B   LP
B03     IM03A     OP
B03     IT-09     OP
D09     IT        OP
D09     IM        LP
D09     IM        OP

So here I need to group the Name column such that Subset and Type are similar. We have to only consider the first alphabetical part of the subset column and ignore rest. for eg IM-09-B, IM03A can be considered as IM.

Output needed

Subset Cluster    Type Cluster    Name
IU,IT             OP,PP           A00,B01
IM,IM,IT          LP, OP, OP      B03, D09

Here the first cluster instance is formed coz IU is OP and IT is PP in both cases, similar for the second instance.

CodePudding user response：

You could use groupby agg(join) twice; once to make groups out of "Names"; then again to make groups out of "Subset"-"Type" pairs. Since we're joining strings, we first use sort_values so that when we join later, each group will have the same order of elements.

out = (df.assign(Subset=df['Subset'].str[:2])
       .sort_values(by=df.columns.tolist())
       .groupby('Name').agg(', '.join)
       .add_suffix(' Cluster').reset_index()
       .set_index(['Subset Cluster', 'Type Cluster'])
       .groupby(level=[0,1]).agg(', '.join).reset_index())

Output:

  Subset Cluster Type Cluster      Name
0     IM, IM, IT   LP, OP, OP  B03, D09
1         IT, IU       OP, PP  A00, B01