I have a dataset
Name Subset Type System
A00 IU00-A OP A
A00 IT00 PP A
B01 IT-01A PP B
B01 IU OP B
B03 IM-09-B LP A
B03 IM03A OP A
B03 IT-09 OP A
D09 IT OP A
D09 IM LP A
D09 IM OP A
So here I need to group the Name column such that Subset, System and Type are similar. We have to only consider the first alphabetical part of the subset column and ignore rest. for eg IM-09-B, IM03A can be considered as IM.
Output needed
Subset Cluster Type Cluster Name System
IU,IT OP,PP A00 A
IM,IM,IT LP, OP, OP B03, D09 A
IU,IT OP,PP B01 B
Here the first cluster instance is formed coz IU is OP and IT is PP in both cases, similar for the second instance.
CodePudding user response:
You could do a double groupby
where in the first round, you groupby
"Name" to group "Subset", "Type", and "System"; then again by these columns to group "Name"s.
out = df.assign(Subset=df['Subset'].str[:2])\
.sort_values(by=df.columns.tolist())\
.groupby('Name', as_index=False)\
.agg(**{'Subset Cluster': ('Subset', ', '.join),
'Type Cluster': ('Type', ', '.join),
'System': ('System', 'first')})\
.groupby(['Subset Cluster', 'Type Cluster', 'System'], as_index=False)\
.agg(', '.join)
Output:
Subset Cluster Type Cluster System Name
0 IM, IM, IT LP, OP, OP A B03, D09
1 IT, IU PP, OP A A00
2 IT, IU PP, OP B B01