Home > Software design >  calculate sum of rows in pandas dataframe grouped by date
calculate sum of rows in pandas dataframe grouped by date

Time:12-14

I have a csv that I loaded into a Pandas Dataframe.

I then select only the rows with duplicate dates in the DF:

df_dups = df[df.duplicated(['Date'])].copy()

I'm trying to get the sum of all the rows with the exact same date for 4 columns (all float values), like this:

df_sum = df_dups.groupby('Date')["Received Quantity","Sent Quantity","Fee Amount","Market Value"].sum()

However, this does not give the desired result. When I examine df_sum.groups, I've noticed that it did not include the first date in the indices. So for two items with the same date, there would only be one index in the groups object.

pprint(df_dups.groupby('Date')["Received Quantity","Sent Quantity","Fee Amount","Market Value"].groups)

I have no idea how to get the sum of all duplicates.

I've also tried:

df_sum = df_dups.groupby('Date')["Received Quantity","Sent Quantity","Fee Amount","Market Value"].apply(lambda x : x.sum())

This gives the same result, which makes sense I guess, as the indices in the groupby object are not complete. What am I missing here?

CodePudding user response:

Check the documentation for the method duplicated. By default duplicates are marked with True except for the first occurence, which is why the first date is not included in your sums.

You only need to pass in keep=False in duplicated for your desired behaviour.

df_dups = df[df.duplicated(['Date'], keep=False)].copy()

After that the sum can be calculated properly with the expression you wrote

df_sum = df_dups.groupby('Date')["Received Quantity","Sent Quantity","Fee Amount","Market Value"].apply(lambda x : x.sum())

  • Related