Home > Software engineering >  Remove rows from Pandas dataframe where values appears more than once
Remove rows from Pandas dataframe where values appears more than once

Time:02-15

I have a dataframe:

ID  EM_results  pa_id_1850
 0           0           1 
 1           1           1 
 2           0           1 
 3           0           2 
 4           0           2 
 5           1           3 
 6           1           3 
 7           0           3 

I want to remove all the rows where EM_results is occurring as 1 more than once per pa_id_1850 index. In this case:

ID  EM_results  pa_id_1850
 0           0           1 
 1           1           1 
 2           0           1 
 3           0           2 
 4           0           2 

I have tried something like:

grouped = df.groupby('pa_id_1850')    
grouped.filter(lambda x: (x['EM_results'] == 1) <= 1)

But I can't quite get it to work

CodePudding user response:

Your attempt is close: filter needs to return a scalar, not a Series. One way to achieve this is to filter on the sum of EM_results values in each group, assuming these values are only ever 0 or 1:

grouped = df.groupby('pa_id_1850')    
grouped.filter(lambda x: (x['EM_results'].sum()) <= 1)

# Output:

   ID  EM_results  pa_id_1850
0   0           0           1
1   1           1           1
2   2           0           1
3   3           0           2
4   4           0           2

CodePudding user response:

Compute a boolean mask with the help of transform, then apply it to the dataframe.

>>> mask = df.groupby('pa_id_1850')['EM_results'].transform('sum').le(1)
>>> df = df[mask]
>>> df
   ID  EM_results  pa_id_1850
0   0           0           1
1   1           1           1
2   2           0           1
3   3           0           2
4   4           0           2
  • Related