I have a "Main" dataframe which is being continuously built. Each month I merge new data into that one, however I want to remove rows that exist in both dataframes.
Example "Main"
Name | Date | Begin Time | End Time |
---|---|---|---|
Bob | 03/10/2022 | 11:04 | 14:10 |
Dirk | 05/12/2022 | 13:15 | 16:56 |
Steve | 01/13/2022 | 11:11 | 13:13 |
"Other"
Name | Date | Begin Time | End Time |
---|---|---|---|
Rog | 03/14/2022 | 11:44 | 14:30 |
Dirk | 05/12/2022 | 13:15 | 16:56 |
Steve | 01/13/2022 | 11:11 | 13:13 |
Expected Result
Name | Date | Begin Time | End Time |
---|---|---|---|
Bob | 03/10/2022 | 11:04 | 14:10 |
Rog | 03/14/2022 | 11:44 | 14:30 |
So far I have been able to accomplish this in a messy way, but there has to be a neater way to do this...
def merger(month, month2):
a = pd.merge(df1, month, how = 'left', indicator = True)
a = a[a['_merge'] != 'both'].drop('_merge', axis = 1)
b = pd.merge(a, month2, how = 'left', indicator = True)
b = b[b['_merge'] != 'both'].drop('_merge', axis = 1)
return b
It's hideous and I have to keep adding to the function as each month arrives, but I am fairly new to Python and coding in general. Ideally I would like to write a function that loops through a list of dataframes as I can have up to 12 months at a time. It is important that it removes rows where the indicator = 'both' before merging the next dataframe.
Any help?
CodePudding user response:
IIUC, this can be handled easily with df.drop_duplicates
:
Thanks to @Mustafa Aydın for reminding keep=False
new_df = pd.concat([main, other])
new_df = new_df.drop_duplicates(keep=False)
In my opinion, pd.concat
suits better here as you can concat as many dataframes as you want:
larger_df = pd.concat(df_list)