Home > database >  Aggregating in pandas with two different identification columns
Aggregating in pandas with two different identification columns

Time:03-22

I am trying to aggregate a dataset with purchases, I have shortened the example in this post to keep it simple. The purchases are distinguished based on two different columns used to identify both customer and transaction. The reference refers to the same transaction, while the ID refers to the type of transaction.

I want to sum these records based on ID, however while keeping in mind the reference and not double-counting the size. The example I provide clears it up.

Let me know if you have any idea what I can do in pandas, or otherwise in Python.

This is basically what I have,

Name Reference Side Size ID
Alex 0 BUY 2400 0
Alex 0 BUY 2400 0
Alex 0 BUY 2400 0
Alex 1 BUY 3000 0
Alex 1 BUY 3000 0
Alex 1 BUY 3000 0
Alex 2 SELL 4500 1
Alex 2 SELL 4500 1
Sam 3 BUY 1500 2
Sam 3 BUY 1500 2
Sam 3 BUY 1500 2

What I am trying to achieve is the following,

Name Side Size ID
Alex BUY 5400 0
Alex SELL 4500 1
Sam BUY 1500 2

CodePudding user response:

Use drop_duplicates, groupby, and agg:

new_df = df.drop_duplicates().groupby(['Name', 'Side']).agg({'Size': 'sum', 'ID': 'first'}).reset_index()

Output:

>>> new_df
   Name  Side  Size  ID
0  Alex   BUY  5400   0
1  Alex  SELL  4500   1
2   Sam   BUY  1500   2

CodePudding user response:

Just drop duplicates first and then aggregate with a list

something like this should do (not tested)

I always like to reset the index after

i.e

df.drop_duplicates().groupby(["Name","Side","ID"]).sum()["Size"].reset_index()

or

# stops the double counts
df_dropped = df.drop_duplicates()
# groups by all the fields in your example
df_grouped = df_dropped.groupby(["Name","Side","ID"]).sum()["Size"]
# resets the 3 indexes created with above
df_reset = df_grouped.reset_index()

CodePudding user response:

Edit: richardec's solution is better as this will also sum the ID column.

This double groupby should achieve the output you want, as long as names are unique.

df.groupby(['Name', 'Reference']).max().groupby(['Name', 'Side']).sum()

Explanation: First we group by Name and Reference to get the following dataframe. The ".max()" could just as well be ".min()" or ".mean()" as it seems your data will have the same size per unique transaction:

Name Reference Side Size ID
Alex 0 BUY 2400 0
1 BUY 3000 0
2 SELL 4500 1
Sam 3 BUY 1500 2

Then we group this data by Name and Side with a ".sum()" operation to get the final result.

Name Side Size ID
Alex BUY 5400 0
SELL 4500 1
Sam BUY 1500 2
  • Related