Raise exception in pandas when one column has duplicates but only for given rows-CodePudding

I have a dataframe read in from this excel file, if you look at FNCL 2019 and 2018 you'll see that those years (and only the Vintage column, not Bal) are duplicated. How could I raise an exception to prevent that from happening? It's not that 2019 and 2018 cannot show up multiple times in the Vintage column, but rather that, within the FNCL cohort (or any other for that matter), each Vintage cannot show up more than once.

CodePudding user response：

Assuming the first values are the correct ones, why not simply remove the rest? You can use the following to drop duplicate rows based on certain columns: df_unique = df.drop_duplicates(subset=['Cohort', 'Vintage'])

CodePudding user response：

There are two different ways, depending on the expected outcome. If you want to combine the Bal entries for the same cohort and Vintage with a function 'f', then

df.groupby([ 'Cohort', 'Vintage' ]).agg( {'Bal':f} ).reset_index()

Otherwise, if you just want to drop the duplicates you can use (and keep the first row)

df.drop_duplicates(subset=['Cohort', 'Vintage'], keep='first')