I have the following dataframe "data" composed of ID and associated cluster number :
ID cluster
FP_101 1
FP_102 1
SP_209 3
SP_300 3
SP_209 1
FP_45 90
SP_50 90
FP_398 100
...
I would like to print clusters which contain more than one ID starting by SP and/or FP. I think that I have the two parts of the answer but just do not know of to combine them in propre way :
- data = data[data['ID'].str.startswith('FP')] (same for SP)
- selection fonction : data = data.groupby(['cluster']).filter(lambda x: x['ID'].nunique() > 1)
The result should give from the previous example :
ID cluster
FP_101 1
FP_102 1
SP_209 1
SP_209 3
SP_300 3
How can I combine arrange these fonction to obtain this result ?
CodePudding user response:
This is my understanding of your question; let me know if it helps:
- Separating SP & FP
df['Prefix'] = df['ID'].apply(lambda x: x.split('_')[0])
- Grouping by clusters
df2 = df.groupby(['cluster', 'Prefix'], as_index = False).agg({'ID':['nunique','unique']})
- Filtering
df2.columns = df2.columns.to_flat_index().str.join('') df2[df2['IDnunique']>1]