I am trying to aggregate a dataset with purchases, I have shortened the example in this post to keep it simple. The purchases are distinguished based on two different columns used to identify both customer and transaction. The reference refers to the same transaction, while the ID refers to the type of transaction.
I want to sum these records based on ID, however while keeping in mind the reference and not double-counting the size. The example I provide clears it up.
Let me know if you have any idea what I can do in pandas, or otherwise in Python.
This is basically what I have,
Name | Reference | Side | Size | ID |
---|---|---|---|---|
Alex | 0 | BUY | 2400 | 0 |
Alex | 0 | BUY | 2400 | 0 |
Alex | 0 | BUY | 2400 | 0 |
Alex | 1 | BUY | 3000 | 0 |
Alex | 1 | BUY | 3000 | 0 |
Alex | 1 | BUY | 3000 | 0 |
Alex | 2 | SELL | 4500 | 1 |
Alex | 2 | SELL | 4500 | 1 |
Sam | 3 | BUY | 1500 | 2 |
Sam | 3 | BUY | 1500 | 2 |
Sam | 3 | BUY | 1500 | 2 |
What I am trying to achieve is the following,
Name | Side | Size | ID |
---|---|---|---|
Alex | BUY | 5400 | 0 |
Alex | SELL | 4500 | 1 |
Sam | BUY | 1500 | 2 |
CodePudding user response:
Use drop_duplicates
, groupby
, and agg
:
new_df = df.drop_duplicates().groupby(['Name', 'Side']).agg({'Size': 'sum', 'ID': 'first'}).reset_index()
Output:
>>> new_df
Name Side Size ID
0 Alex BUY 5400 0
1 Alex SELL 4500 1
2 Sam BUY 1500 2
CodePudding user response:
Just drop duplicates first and then aggregate with a list
something like this should do (not tested)
I always like to reset the index after
i.e
df.drop_duplicates().groupby(["Name","Side","ID"]).sum()["Size"].reset_index()
or
# stops the double counts
df_dropped = df.drop_duplicates()
# groups by all the fields in your example
df_grouped = df_dropped.groupby(["Name","Side","ID"]).sum()["Size"]
# resets the 3 indexes created with above
df_reset = df_grouped.reset_index()
CodePudding user response:
Edit: richardec's solution is better as this will also sum the ID column.
This double groupby should achieve the output you want, as long as names are unique.
df.groupby(['Name', 'Reference']).max().groupby(['Name', 'Side']).sum()
Explanation: First we group by Name and Reference to get the following dataframe. The ".max()" could just as well be ".min()" or ".mean()" as it seems your data will have the same size per unique transaction:
Name | Reference | Side | Size | ID |
---|---|---|---|---|
Alex | 0 | BUY | 2400 | 0 |
1 | BUY | 3000 | 0 | |
2 | SELL | 4500 | 1 | |
Sam | 3 | BUY | 1500 | 2 |
Then we group this data by Name and Side with a ".sum()" operation to get the final result.
Name | Side | Size | ID |
---|---|---|---|
Alex | BUY | 5400 | 0 |
SELL | 4500 | 1 | |
Sam | BUY | 1500 | 2 |