I have a dataframe where there is multiple user_id rows with different status, if there is at least one row for user_id where status = 'VALID', I then want to remove the rows with status = 'DELETED' for the user_id. Original dataframe looks like this:
user_id | alert_source | status |
---|---|---|
1 | research | deleted |
1 | research | deleted |
1 | research | valid |
2 | research | deleted |
2 | research | valid |
I want my clean dataframe look like this
user_id | alert_source | status |
---|---|---|
1 | research | valid |
2 | research | valid |
Same goes if all the status values would be equal to 'deleted'. I want to remove the duplicates and have just one row of that user_id to be the desired status – in this case 'deleted'.
I tried googling, but for this scenario I didn't find any relatable sources that could've helped me.
CodePudding user response:
You can just use drop_duplicates
method twice:
df = pd.DataFrame(
{
"user_id": [1, 1, 1, 2, 2, 3],
"alert_source": ['research', 'research', 'research', 'research', 'research', 'research'],
"status": ['deleted', 'deleted', 'valid', 'deleted', 'valid', 'deleted']
}
)
df = df.drop_duplicates(subset=["user_id", "status"])
df = df.drop_duplicates(subset="user_id", keep='last')
Which will result in:
user_id alert_source status
2 1 research valid
4 2 research valid
5 3 research deleted
But with this approach (i.e. with using keep='last'
keyword argument) you have to make sure that status
column is sorted alphabetically.
CodePudding user response:
One way to do this is to first drop_duplicates keeping both user_id and status as the subset. This will give you all rows with different user_ids, except where there is both 'valid' and 'deleted' present. Count and identify which user_ids have more than one entry and then delete the ones with status='deleted'. This way, the code will work without requirement for sorting.
Data
>> df
user_id alert_source status
0 1 research deleted
1 1 research valid
2 1 research deleted
3 2 research deleted
4 2 research valid
5 3 research deleted
6 3 research deleted
7 3 research deleted
Code
df=df.drop_duplicates(subset=['user_id', 'status']) ## Remove duplicates for subset of user id and status
repeats = df.user_id.value_counts() ## Count values for user_ids
repeats = repeats[repeats > 1].index.to_list() ## Get the IDs for > 1 entry
df=df.drop(df[(df.user_id.isin(repeats)) & (df.status == 'deleted')].index) ##Remove those with > 1 and status == deleted
print(df)
Output
user_id alert_source status
1 1 research valid
4 2 research valid
5 3 research deleted