I have a DataFrame that I want to add to it a new col that states for each id and for each year, how many times there were positive profits in the past. For example:
id year profit
0 1 2018 0
1 1 2019 10
2 1 2020 20
3 1 2021 0
4 2 2018 0
5 2 2019 20
6 2 2020 10
the desired outcome should look like this:
id year profit past_profit
0 1 2018 0 0
1 1 2019 10 0
2 1 2020 20 1
3 1 2021 0 2
4 2 2018 0 0
5 2 2019 20 0
6 2 2020 10 1
Any ideas?
CodePudding user response:
We could use groupby
shift
to get past year's data and use groupby
cumsum
to get how many times a positive profit happened for each "id".
df['past_profit'] = (df['profit']>0).groupby(df['id']).shift().fillna(False).groupby(df['id']).cumsum()
Output:
id year profit past_profit
0 1 2018 0 0
1 1 2019 10 0
2 1 2020 20 1
3 1 2021 0 2
4 2 2018 0 0
5 2 2019 20 0
6 2 2020 10 1
Note that this assumes that the data is sorted by year
. If not, we could first sort by id and year; then use the above code.
df = df.sort_values(by=['id','year'])
CodePudding user response:
With datar
, a pandas wrapper that reimagines pandas APIs, it easy to do it like this:
>>> from datar.all import f, tibble, group_by, mutate, lag, cumsum, as_integer, coalesce
>>>
>>> df = tibble(
... id=[1,1,1,1,2,2,2],
... year=[2018,2019,2020,2021,2018,2019,2020],
... profit=[0,10,20,0,0,20,10]
... )
>>>
>>> (
... df
... >> group_by(f.id)
... >> mutate(
... past_profit=cumsum( # get the cumsum
... as_integer( # convert to integers
... coalesce( # replace NAs with 0
... lag(f.profit > 0), # shift the result
... 0
... )
... )
... )
... )
... )
id year profit past_profit
<int64> <int64> <int64> <int64>
0 1 2018 0 0
1 1 2019 10 0
2 1 2020 20 1
3 1 2021 0 2
4 2 2018 0 0
5 2 2019 20 0
6 2 2020 10 1
[TibbleGrouped: id (n=2)]