How do I delete duplicated rows, and choosing which ones to keep?-CodePudding

I have a df with "id" and a column with a boolean (yes and no) I need to delete the duplicated ids but keep the ones that the bolean is yes.

ID	....	buffer
br1	....	yes.
br1	....	no.
br2	....	no.
br3	....	yes
br4	....	no.
br4	....	yes.

I tried this:

df1=df[~df[['external_id']].duplicated() | df['buffers'].eq('si')] where "buffers" is the boolean. It is deleting but not all of them...I still have id repeated with yes and no Im working with more than 800000 rows

CodePudding user response：

I think filter out for "yes" then clean the duplicates will solve the problem:

For filtering yes: data = data[data["buffer"] == "yes."]

For duplicates: data = data.drop_duplicates( subset=['ID'] , keep="last" )

Can you try out these ones? These two supposed to work

CodePudding user response：

First to drop the duplicated rows

df1 = df.drop_duplicates(keep='first')

This uses drop_duplicates() to drop the rows that contain the same values except for the first time row.

Now boolean filtering

df1 = df1[df1['buffer'] == 'yes.']

This would filter and save the rows that are having 'yes.'