Aggregating in pandas with two different identification columns-CodePudding

I am trying to aggregate a dataset with purchases, I have shortened the example in this post to keep it simple. The purchases are distinguished based on two different columns used to identify both customer and transaction. The reference refers to the same transaction, while the ID refers to the type of transaction.

I want to sum these records based on ID, however while keeping in mind the reference and not double-counting the size. The example I provide clears it up.

Let me know if you have any idea what I can do in pandas, or otherwise in Python.

This is basically what I have,

Name	Reference	Side	Size	ID
Alex	0	BUY	2400	0
Alex	0	BUY	2400	0
Alex	0	BUY	2400	0
Alex	1	BUY	3000	0
Alex	1	BUY	3000	0
Alex	1	BUY	3000	0
Alex	2	SELL	4500	1
Alex	2	SELL	4500	1
Sam	3	BUY	1500	2
Sam	3	BUY	1500	2
Sam	3	BUY	1500	2

What I am trying to achieve is the following,

Name	Side	Size	ID
Alex	BUY	5400	0
Alex	SELL	4500	1
Sam	BUY	1500	2

CodePudding user response：

Use drop_duplicates, groupby, and agg:

new_df = df.drop_duplicates().groupby(['Name', 'Side']).agg({'Size': 'sum', 'ID': 'first'}).reset_index()

Output:

>>> new_df
   Name  Side  Size  ID
0  Alex   BUY  5400   0
1  Alex  SELL  4500   1
2   Sam   BUY  1500   2

CodePudding user response：

Just drop duplicates first and then aggregate with a list

something like this should do (not tested)

I always like to reset the index after

i.e

df.drop_duplicates().groupby(["Name","Side","ID"]).sum()["Size"].reset_index()

# stops the double counts
df_dropped = df.drop_duplicates()
# groups by all the fields in your example
df_grouped = df_dropped.groupby(["Name","Side","ID"]).sum()["Size"]
# resets the 3 indexes created with above
df_reset = df_grouped.reset_index()

CodePudding user response：

Edit: richardec's solution is better as this will also sum the ID column.

This double groupby should achieve the output you want, as long as names are unique.

df.groupby(['Name', 'Reference']).max().groupby(['Name', 'Side']).sum()

Explanation: First we group by Name and Reference to get the following dataframe. The ".max()" could just as well be ".min()" or ".mean()" as it seems your data will have the same size per unique transaction:

Name	Reference	Side	Size	ID
Alex	0	BUY	2400	0
	1	BUY	3000	0
	2	SELL	4500	1
Sam	3	BUY	1500	2

Then we group this data by Name and Side with a ".sum()" operation to get the final result.

Name	Side	Size	ID
Alex	BUY	5400	0
	SELL	4500	1
Sam	BUY	1500	2