how to delete BOTH duplicates from a CSV file using Pandas?-CodePudding

I have two CSV files that I need to compare. the EF file has all users eligible for our company benefits program. The UL file has all users currently in our system. When you compare the EF to UL most 'EmployeeId's in question will be duplicates. I am trying to delete both of the duplicates, so I am only left with the users who are on the EF file that are NOT on the UL file. I have attached what I have so far. I feel like this is not too complicated, I just am getting lost in the logic.

def merger(ef_in, ul_in):
    df_ef = pd.read_csv(ef_in, sep=',')
    df_ul = pd.read_csv(ul_in, sep=',')
    df = pd.concat(
        map(pd.read_csv, [ef_in, ul_in]), ignore_index=True)

   
    for i in df_ef["EmployeeId"]:
        if i in df_ul["EmployeeId"]:
            df_ef = df_ef.drop(df["EmployeeId"])

    print(df_ef)
    # df.to_csv("finalized.csv", header=True, index=False)


eff = "Test_Ready_EF.csv"
ull = "Test_Ready_UL.csv"

merger(eff, ull)

I have played around with this and googled. I am unsure of how to delete BOTH duplicates. I was hoping I could do pd.merge() and then hit the dataframe with a df.drop_duplicates(), but that did not seem to work.

CodePudding user response：

You can accomplish this by setting the keep parameter to False in drop_duplicates and subsetting by EmployeeID, in all versions of Pandas at least since 1.0.

import pandas as pd

df_uf = pd.DataFrame({'EmployeeID': ['AAA', 'BBB', 'CCC', 'DDD', 'EEE'], 'Pizza Type': ['pepperoni', 'sausage', 'cheese', 'cheese', 'hates pizza'], 'Salary': [55, 67, 34, 56, 76]})
df_ef = pd.DataFrame({'EmployeeID': ['AAA', 'CCC', 'EEE'], 'Benefits': [1,0,1], 'Other Column': [1,2,3]})

df = pd.concat([df_uf, df_ef], axis = 0, ignore_index = True, sort = False)

df.drop_duplicates(subset = ['EmployeeID'], keep = False, inplace = True)

print(df)

 EmployeeID  Benefits  Other Column Pizza Type  Salary
4        BBB       NaN           NaN    sausage    67.0
6        DDD       NaN           NaN     cheese    56.0