count by id with dynamic criteria-CodePudding

I have a DataFrame that I want to add to it a new col that states for each id and for each year, how many times there were positive profits in the past. For example:

   id  year  profit
0   1  2018       0
1   1  2019      10
2   1  2020      20
3   1  2021       0
4   2  2018       0
5   2  2019      20
6   2  2020      10

the desired outcome should look like this:

   id  year  profit  past_profit
0   1  2018       0            0
1   1  2019      10            0
2   1  2020      20            1
3   1  2021       0            2
4   2  2018       0            0
5   2  2019      20            0
6   2  2020      10            1

Any ideas?

CodePudding user response：

We could use groupby shift to get past year's data and use groupby cumsum to get how many times a positive profit happened for each "id".

df['past_profit'] = (df['profit']>0).groupby(df['id']).shift().fillna(False).groupby(df['id']).cumsum()

Output:

   id  year  profit  past_profit
0   1  2018       0            0
1   1  2019      10            0
2   1  2020      20            1
3   1  2021       0            2
4   2  2018       0            0
5   2  2019      20            0
6   2  2020      10            1

Note that this assumes that the data is sorted by year. If not, we could first sort by id and year; then use the above code.

df = df.sort_values(by=['id','year'])

CodePudding user response：

With datar, a pandas wrapper that reimagines pandas APIs, it easy to do it like this:

>>> from datar.all import f, tibble, group_by, mutate, lag, cumsum, as_integer, coalesce
>>> 
>>> df = tibble(
...   id=[1,1,1,1,2,2,2], 
...   year=[2018,2019,2020,2021,2018,2019,2020], 
...   profit=[0,10,20,0,0,20,10]
... )
>>> 
>>> (
...   df 
...   >> group_by(f.id) 
...   >> mutate(
...     past_profit=cumsum(       # get the cumsum
...       as_integer(             # convert to integers
...         coalesce(             # replace NAs with 0
...           lag(f.profit > 0),  # shift the result
...           0
...         )
...       )
...     )
...   )
... )
       id    year  profit  past_profit
  <int64> <int64> <int64>      <int64>
0       1    2018       0            0
1       1    2019      10            0
2       1    2020      20            1
3       1    2021       0            2
4       2    2018       0            0
5       2    2019      20            0
6       2    2020      10            1
[TibbleGrouped: id (n=2)]