Home > Software design >  Sum up values based condition, if does not match keep current values
Sum up values based condition, if does not match keep current values

Time:04-27

I am looking for a way to sum up the values > or < a certain threshold in a given column (here > 6 in days_install_to_event column).

I tried many different ways, such a loc, query or groupby, but it return only the values > 6 not the ones < 6.

Here some of the things I have tried:

df = pd.DataFrame({
                    'custom_action' : ['First_puchase', 'First_puchase', 'First_puchase', 'First_puchase',
                    'First_puchase', 'First_puchase', 'First_puchase', 'First_puchase'],
                    'days_install_to_event' : [1, 2, 3, 4, 5, 6, 7, 8],
                    'number_unique_users' : [1350, 250, 13, 2, 1, 2, 1, 2]})
df

custom_action days_install_to_event number_unique_users
0 First_puchase                     1                1350
1 First_puchase                     2                 250
2 First_puchase                     3                  13
3 First_puchase                     4                   2
4 First_puchase                     5                   1
5 First_puchase                     6                   2
6 First_puchase                     7                   1
7 First_puchase                     8                   2
8 First_puchase                     9                   3
9 First_puchase                     10                  2

df_1 = df.loc[df['days_install_to_event'] > 6].sum()

df_2 = df.query("days_install_to_event > 6")['number_unique_users'].sum()

df_1
df_2

Output:

custom_action            First_puchaseFirst_puchase
days_install_to_event                            34
number_unique_users                               8
8

Desired output:

custom_action days_install_to_event number_unique_users
0 First_puchase                     1                1350
1 First_puchase                     2                 250
2 First_puchase                     3                  13
3 First_puchase                     4                   2
4 First_puchase                     5                   1
5 First_puchase                     6                   2
6 First_puchase                     7                   8

In advance, sorry if a very similar question have been asked, I have been looking around for past 2 days but found nothing that could match exactly what I was looking for. It may be due to formulation.

Thanks for your help :)

CodePudding user response:

As far as I know there is no out-of-the-box solution for this but you can get this result by creating a helper grouper column:

# Set days_install_to_event = 7  if the value is larger than 6
grouper = df['days_install_to_event'].mask(df['days_install_to_event'] > 6, '7 ')

Then, with the help of this column, you can use groupby.agg:

In [27]: df.groupby(grouper).agg({
             'number_unique_users': 'sum', 
             'custom_action': 'first',
         }).reset_index()
Out[27]:
  days_install_to_event  number_unique_users  custom_action
0                     1                 1350  First_puchase
1                     2                  250  First_puchase
2                     3                   13  First_puchase
3                     4                    2  First_puchase
4                     5                    1  First_puchase
5                     6                    2  First_puchase
6                    7                     8  First_puchase
  • Related