I have a df with "id" and a column with a boolean (yes and no) I need to delete the duplicated ids but keep the ones that the bolean is yes.
ID | .... | buffer |
---|---|---|
br1 | .... | yes. |
br1 | .... | no. |
br2 | .... | no. |
br3 | .... | yes |
br4 | .... | no. |
br4 | .... | yes. |
I tried this:
df1=df[~df[['external_id']].duplicated() | df['buffers'].eq('si')]
where "buffers" is the boolean.
It is deleting but not all of them...I still have id repeated with yes and no
Im working with more than 800000 rows
CodePudding user response:
I think filter out for "yes" then clean the duplicates will solve the problem:
For filtering yes
:
data = data[data["buffer"] == "yes."]
For duplicates: data = data.drop_duplicates( subset=['ID'] , keep="last" )
Can you try out these ones? These two supposed to work
CodePudding user response:
First to drop the duplicated rows
df1 = df.drop_duplicates(keep='first')
This uses drop_duplicates() to drop the rows that contain the same values except for the first time row.
Now boolean filtering
df1 = df1[df1['buffer'] == 'yes.']
This would filter and save the rows that are having 'yes.'