Home > Blockchain >  Pandas group by id if condition is met and drop rows that are not meeting condition
Pandas group by id if condition is met and drop rows that are not meeting condition

Time:10-07

I have a dataframe where there is multiple user_id rows with different status, if there is at least one row for user_id where status = 'VALID', I then want to remove the rows with status = 'DELETED' for the user_id. Original dataframe looks like this:

user_id alert_source status
1 research deleted
1 research deleted
1 research valid
2 research deleted
2 research valid

I want my clean dataframe look like this

user_id alert_source status
1 research valid
2 research valid

Same goes if all the status values would be equal to 'deleted'. I want to remove the duplicates and have just one row of that user_id to be the desired status – in this case 'deleted'.

I tried googling, but for this scenario I didn't find any relatable sources that could've helped me.

CodePudding user response:

You can just use drop_duplicates method twice:

df = pd.DataFrame(
    {
        "user_id": [1, 1, 1, 2, 2, 3],
        "alert_source": ['research', 'research', 'research', 'research', 'research', 'research'],
        "status": ['deleted', 'deleted', 'valid', 'deleted', 'valid', 'deleted']
    }
)
df = df.drop_duplicates(subset=["user_id", "status"])
df = df.drop_duplicates(subset="user_id", keep='last')

Which will result in:

   user_id alert_source   status
2        1     research    valid
4        2     research    valid
5        3     research  deleted

But with this approach (i.e. with using keep='last' keyword argument) you have to make sure that status column is sorted alphabetically.

CodePudding user response:

One way to do this is to first drop_duplicates keeping both user_id and status as the subset. This will give you all rows with different user_ids, except where there is both 'valid' and 'deleted' present. Count and identify which user_ids have more than one entry and then delete the ones with status='deleted'. This way, the code will work without requirement for sorting.

Data

>> df
   user_id alert_source status
0   1   research    deleted
1   1   research    valid
2   1   research    deleted
3   2   research    deleted
4   2   research    valid
5   3   research    deleted
6   3   research    deleted
7   3   research    deleted

Code

df=df.drop_duplicates(subset=['user_id', 'status'])  ## Remove duplicates for subset of user id and status
repeats = df.user_id.value_counts() ## Count values for user_ids
repeats = repeats[repeats > 1].index.to_list() ## Get the IDs for > 1 entry
df=df.drop(df[(df.user_id.isin(repeats)) & (df.status == 'deleted')].index) ##Remove those with > 1 and status == deleted

print(df)

Output


  user_id alert_source  status
1   1   research    valid
4   2   research    valid
5   3   research    deleted
  • Related