Grouping dataframe wrt three columns in Pandas-CodePudding

I have a dataset

Name    Subset    Type    System
A00     IU00-A    OP      A
A00     IT00      PP      A
B01     IT-01A    PP      B
B01     IU        OP      B
B03     IM-09-B   LP      A
B03     IM03A     OP      A
B03     IT-09     OP      A
D09     IT        OP      A
D09     IM        LP      A
D09     IM        OP      A

So here I need to group the Name column such that Subset, System and Type are similar. We have to only consider the first alphabetical part of the subset column and ignore rest. for eg IM-09-B, IM03A can be considered as IM.

Output needed

Subset Cluster    Type Cluster    Name          System
IU,IT             OP,PP           A00           A
IM,IM,IT          LP, OP, OP      B03, D09      A
IU,IT             OP,PP           B01           B

Here the first cluster instance is formed coz IU is OP and IT is PP in both cases, similar for the second instance.

CodePudding user response：

You could do a double groupby where in the first round, you groupby "Name" to group "Subset", "Type", and "System"; then again by these columns to group "Name"s.

out = df.assign(Subset=df['Subset'].str[:2])\
        .sort_values(by=df.columns.tolist())\
        .groupby('Name', as_index=False)\
        .agg(**{'Subset Cluster': ('Subset', ', '.join), 
                'Type Cluster': ('Type', ', '.join), 
                'System': ('System', 'first')})\
        .groupby(['Subset Cluster', 'Type Cluster', 'System'], as_index=False)\
        .agg(', '.join)

Output:

  Subset Cluster Type Cluster System      Name
0     IM, IM, IT   LP, OP, OP      A  B03, D09
1         IT, IU       PP, OP      A       A00
2         IT, IU       PP, OP      B       B01