I have a table which looks like this: image is in the link and I want to delete rows that have both 'Pfam' and 'SMART' analysis under the same protein accession code. At the same time, I want to save entries that contain only 'Pfam' analysis without 'SMART'. I've wrote a bit of code but unfortunately, it doesn't work.
if (df_filtered['analysis']=='Pfam')&(df_filtered['analysis']=='SMART'):
df_filtered.drop(index=df_filtered[df_filtered['analysis']=='Pfam'].index)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any()
or a.all().
Could someone help me, please?
CodePudding user response:
IIUC: Let's say we have the following dataframe:
>>> df = pd.DataFrame({'group': list('AABCDD'), 'analysis': ['SMART', 'Pfam', 'SMART', 'Pfam', 'SMART', 'Pfam']})
>>> df
group analysis
0 A SMART
1 A Pfam
2 B SMART
3 C Pfam
4 D SMART
5 D Pfam
You only want to remove the rows with analysis 'SMART'
and within the same group analysis 'Pfam'
. So only row 0 and 4 are removed here:
df['nunique'] = df.groupby('group').analysis.transform('nunique')
df[~((df['analysis'] == 'SMART') & (df['nunique'] > 1))]
Output:
group analysis nunique
1 A Pfam 2
2 B SMART 1
3 C Pfam 1
5 D Pfam 2