I have two CSV files that I need to compare. the EF file has all users eligible for our company benefits program. The UL file has all users currently in our system. When you compare the EF to UL most 'EmployeeId's in question will be duplicates. I am trying to delete both of the duplicates, so I am only left with the users who are on the EF file that are NOT on the UL file. I have attached what I have so far. I feel like this is not too complicated, I just am getting lost in the logic.
def merger(ef_in, ul_in):
df_ef = pd.read_csv(ef_in, sep=',')
df_ul = pd.read_csv(ul_in, sep=',')
df = pd.concat(
map(pd.read_csv, [ef_in, ul_in]), ignore_index=True)
for i in df_ef["EmployeeId"]:
if i in df_ul["EmployeeId"]:
df_ef = df_ef.drop(df["EmployeeId"])
print(df_ef)
# df.to_csv("finalized.csv", header=True, index=False)
eff = "Test_Ready_EF.csv"
ull = "Test_Ready_UL.csv"
merger(eff, ull)
I have played around with this and googled. I am unsure of how to delete BOTH duplicates. I was hoping I could do pd.merge() and then hit the dataframe with a df.drop_duplicates(), but that did not seem to work.
CodePudding user response:
You can accomplish this by setting the keep
parameter to False
in drop_duplicates
and subsetting by EmployeeID
, in all versions of Pandas at least since 1.0.
import pandas as pd
df_uf = pd.DataFrame({'EmployeeID': ['AAA', 'BBB', 'CCC', 'DDD', 'EEE'], 'Pizza Type': ['pepperoni', 'sausage', 'cheese', 'cheese', 'hates pizza'], 'Salary': [55, 67, 34, 56, 76]})
df_ef = pd.DataFrame({'EmployeeID': ['AAA', 'CCC', 'EEE'], 'Benefits': [1,0,1], 'Other Column': [1,2,3]})
df = pd.concat([df_uf, df_ef], axis = 0, ignore_index = True, sort = False)
df.drop_duplicates(subset = ['EmployeeID'], keep = False, inplace = True)
print(df)
EmployeeID Benefits Other Column Pizza Type Salary
4 BBB NaN NaN sausage 67.0
6 DDD NaN NaN cheese 56.0