Home > front end >  How to aggregate dataframe and sum by boolean columns?
How to aggregate dataframe and sum by boolean columns?

Time:12-22

I have this df and want to aggregate it so that the last 2 columns sum up and reduce duplicates per user id.

current

user_id | name | product | ...| purchase_flag | retention_flag
123     | John | book    | ...| 0             | 1
123     | John | book    | ...| 1             | 0
....

desired state

user_id | name | product | ...| purchase_flag | retention_flag
123     | John | book    | ...| 1             | 1
....

I have a total of 100 columns, so doing a groupby manually in pandas will not be feasible. How do I group by all columns in the df and then sum by the purchase_flag and retention_flag?

I attempted:

df.groupby([how to put all cols here expect the flag columns?]).agg({'purchase_flag':'sum','retention_flag':'sum',})

How do I finish this?

CodePudding user response:

You can filter all columns names if not exist in dict in list comprehension:

d = {'purchase_flag':'sum','retention_flag':'sum'}
df = df.groupby([c for c in df.columns if c not in d], as_index=False).agg(d)
print (df)
   user_id  name product  purchase_flag  retention_flag
0      123  John    book              1               1
  • Related