I have this df and want to aggregate it so that the last 2 columns sum up and reduce duplicates per user id.
current
user_id | name | product | ...| purchase_flag | retention_flag
123 | John | book | ...| 0 | 1
123 | John | book | ...| 1 | 0
....
desired state
user_id | name | product | ...| purchase_flag | retention_flag
123 | John | book | ...| 1 | 1
....
I have a total of 100 columns, so doing a groupby manually in pandas will not be feasible. How do I group by all columns in the df and then sum by the purchase_flag and retention_flag?
I attempted:
df.groupby([how to put all cols here expect the flag columns?]).agg({'purchase_flag':'sum','retention_flag':'sum',})
How do I finish this?
CodePudding user response:
You can filter all columns names if not exist in dict
in list comprehension:
d = {'purchase_flag':'sum','retention_flag':'sum'}
df = df.groupby([c for c in df.columns if c not in d], as_index=False).agg(d)
print (df)
user_id name product purchase_flag retention_flag
0 123 John book 1 1