Pandas group by id if condition is met and drop rows that are not meeting condition-CodePudding

I have a dataframe where there is multiple user_id rows with different status, if there is at least one row for user_id where status = 'VALID', I then want to remove the rows with status = 'DELETED' for the user_id. Original dataframe looks like this:

user_id	alert_source	status
1	research	deleted
1	research	deleted
1	research	valid
2	research	deleted
2	research	valid

I want my clean dataframe look like this

user_id	alert_source	status
1	research	valid
2	research	valid

Same goes if all the status values would be equal to 'deleted'. I want to remove the duplicates and have just one row of that user_id to be the desired status – in this case 'deleted'.

I tried googling, but for this scenario I didn't find any relatable sources that could've helped me.

CodePudding user response：

You can just use drop_duplicates method twice:

df = pd.DataFrame(
    {
        "user_id": [1, 1, 1, 2, 2, 3],
        "alert_source": ['research', 'research', 'research', 'research', 'research', 'research'],
        "status": ['deleted', 'deleted', 'valid', 'deleted', 'valid', 'deleted']
    }
)
df = df.drop_duplicates(subset=["user_id", "status"])
df = df.drop_duplicates(subset="user_id", keep='last')

Which will result in:

   user_id alert_source   status
2        1     research    valid
4        2     research    valid
5        3     research  deleted

But with this approach (i.e. with using keep='last' keyword argument) you have to make sure that status column is sorted alphabetically.

CodePudding user response：

One way to do this is to first drop_duplicates keeping both user_id and status as the subset. This will give you all rows with different user_ids, except where there is both 'valid' and 'deleted' present. Count and identify which user_ids have more than one entry and then delete the ones with status='deleted'. This way, the code will work without requirement for sorting.

Data

>> df
   user_id alert_source status
0   1   research    deleted
1   1   research    valid
2   1   research    deleted
3   2   research    deleted
4   2   research    valid
5   3   research    deleted
6   3   research    deleted
7   3   research    deleted

Code

df=df.drop_duplicates(subset=['user_id', 'status'])  ## Remove duplicates for subset of user id and status
repeats = df.user_id.value_counts() ## Count values for user_ids
repeats = repeats[repeats > 1].index.to_list() ## Get the IDs for > 1 entry
df=df.drop(df[(df.user_id.isin(repeats)) & (df.status == 'deleted')].index) ##Remove those with > 1 and status == deleted

print(df)

Output


  user_id alert_source  status
1   1   research    valid
4   2   research    valid
5   3   research    deleted