How can I write a Python function that loops through a list of dataframes and drops duplicate rows?-CodePudding

I have a "Main" dataframe which is being continuously built. Each month I merge new data into that one, however I want to remove rows that exist in both dataframes.

Example "Main"

Name	Date	Begin Time	End Time
Bob	03/10/2022	11:04	14:10
Dirk	05/12/2022	13:15	16:56
Steve	01/13/2022	11:11	13:13

"Other"

Name	Date	Begin Time	End Time
Rog	03/14/2022	11:44	14:30
Dirk	05/12/2022	13:15	16:56
Steve	01/13/2022	11:11	13:13

Expected Result

Name	Date	Begin Time	End Time
Bob	03/10/2022	11:04	14:10
Rog	03/14/2022	11:44	14:30

So far I have been able to accomplish this in a messy way, but there has to be a neater way to do this...

def merger(month, month2):
   a = pd.merge(df1, month, how = 'left', indicator = True)
   a = a[a['_merge'] != 'both'].drop('_merge', axis = 1)
   b = pd.merge(a, month2, how = 'left', indicator = True)
   b = b[b['_merge'] != 'both'].drop('_merge', axis = 1)
   return b

It's hideous and I have to keep adding to the function as each month arrives, but I am fairly new to Python and coding in general. Ideally I would like to write a function that loops through a list of dataframes as I can have up to 12 months at a time. It is important that it removes rows where the indicator = 'both' before merging the next dataframe.

Any help?

CodePudding user response：

IIUC, this can be handled easily with df.drop_duplicates: Thanks to @Mustafa Aydın for reminding keep=False

new_df = pd.concat([main, other])
new_df = new_df.drop_duplicates(keep=False)

In my opinion, pd.concat suits better here as you can concat as many dataframes as you want:

larger_df = pd.concat(df_list)