I have a dataset
Name Subset Type
A00 IU00-A OP
A00 IT00 PP
B01 IT-01A PP
B01 IU OP
B03 IM-09-B LP
B03 IM03A OP
B03 IT-09 OP
D09 IT OP
D09 IM LP
D09 IM OP
So here I need to group the Name column such that Subset and Type are similar. We have to only consider the first alphabetical part of the subset column and ignore rest. for eg IM-09-B, IM03A can be considered as IM.
Output needed
Subset Cluster Type Cluster Name
IU,IT OP,PP A00,B01
IM,IM,IT LP, OP, OP B03, D09
Here the first cluster instance is formed coz IU is OP and IT is PP in both cases, similar for the second instance.
CodePudding user response:
You could use groupby
agg(join)
twice; once to make groups out of "Names"; then again to make groups out of "Subset"-"Type" pairs. Since we're join
ing strings, we first use sort_values
so that when we join later, each group will have the same order of elements.
out = (df.assign(Subset=df['Subset'].str[:2])
.sort_values(by=df.columns.tolist())
.groupby('Name').agg(', '.join)
.add_suffix(' Cluster').reset_index()
.set_index(['Subset Cluster', 'Type Cluster'])
.groupby(level=[0,1]).agg(', '.join).reset_index())
Output:
Subset Cluster Type Cluster Name
0 IM, IM, IT LP, OP, OP B03, D09
1 IT, IU OP, PP A00, B01