I have a multi index dataset given below:
arr = np.array([12, 12, 12, 72, 72, 72, 26, 26, 26, 22, 22, 22, 46, 46, 46, 32, 32, 32])
df = pd.DataFrame({'date': ['1/1/2000', '1/1/2000', '1/1/2000',
'2/1/2000', '2/1/2000', '2/1/2000',
'3/1/2000', '3/1/2000', '3/1/2000',
'1/1/2000', '1/1/2000', '1/1/2000',
'2/1/2000', '2/1/2000', '2/1/2000',
'3/1/2000', '3/1/2000', '3/1/2000'],
'type': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A',
'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B'],
'lags': ['31/12/1999', '30/12/1999', '29/12/1999',
'1/1/2000', '31/12/1999', '30/12/1999',
'2/1/2000', '1/1/2000', '31/12/1999',
'31/12/1999', '30/12/1999', '29/12/1999',
'1/1/2000', '31/12/1999', '30/12/1999',
'2/1/2000', '1/1/2000', '31/12/1999']})
df["target"] = arr
df.set_index(['date', 'type', 'lags'], inplace=True)
I am trying to use cumsum for each type and run the code:
for g_name, g_df in df.groupby("type"):
df.loc[g_df.index, 'target'] = df.loc[g_df.index, 'target'].cumsum().astype('float64')
However, it applies cumsum for each row and outputted the result:
target
date type lags
1/1/2000 A 31/12/1999 12.0
30/12/1999 24.0
29/12/1999 36.0
2/1/2000 A 1/1/2000 108.0
31/12/1999 180.0
30/12/1999 252.0
3/1/2000 A 2/1/2000 278.0
1/1/2000 304.0
31/12/1999 330.0
1/1/2000 B 31/12/1999 22.0
30/12/1999 44.0
29/12/1999 66.0
2/1/2000 B 1/1/2000 112.0
31/12/1999 158.0
30/12/1999 204.0
3/1/2000 B 2/1/2000 236.0
1/1/2000 268.0
31/12/1999 300.0
The expected result is:
target
date type lags
1/1/2000 A 31/12/1999 12.0
30/12/1999 12.0
29/12/1999 12.0
2/1/2000 A 1/1/2000 84.0
31/12/1999 84.0
30/12/1999 84.0
3/1/2000 A 2/1/2000 110.0
1/1/2000 110.0
31/12/1999 110.0
1/1/2000 B 31/12/1999 22.0
30/12/1999 22.0
29/12/1999 22.0
2/1/2000 B 1/1/2000 68.0
31/12/1999 68.0
30/12/1999 68.0
3/1/2000 B 2/1/2000 100.0
1/1/2000 100.0
31/12/1999 100.0
How can I do get expected results and update the original data in a pythonic way?
CodePudding user response:
You can group by date
and type
and extract the first value then group by type
to compute the cumulative sum. Finally reindex the output to broadcast all values:
>>> df.groupby(['date', 'type']).first().groupby('type').cumsum().reindex(df.index)
target
date type lags
1/1/2000 A 31/12/1999 12
30/12/1999 12
29/12/1999 12
2/1/2000 A 1/1/2000 84
31/12/1999 84
30/12/1999 84
3/1/2000 A 2/1/2000 110
1/1/2000 110
31/12/1999 110
1/1/2000 B 31/12/1999 22
30/12/1999 22
29/12/1999 22
2/1/2000 B 1/1/2000 68
31/12/1999 68
30/12/1999 68
3/1/2000 B 2/1/2000 100
1/1/2000 100
31/12/1999 100
CodePudding user response:
You could use groupby
cumcount
first to create groups; then use it in groupby
cumsum
:
out = df.groupby([df.groupby(level=[0,1]).cumcount(), df.index.get_level_values(1)]).cumsum()
Output:
target
date type lags
1/1/2000 A 31/12/1999 12
30/12/1999 12
29/12/1999 12
2/1/2000 A 1/1/2000 84
31/12/1999 84
30/12/1999 84
3/1/2000 A 2/1/2000 110
1/1/2000 110
31/12/1999 110
1/1/2000 B 31/12/1999 22
30/12/1999 22
29/12/1999 22
2/1/2000 B 1/1/2000 68
31/12/1999 68
30/12/1999 68
3/1/2000 B 2/1/2000 100
1/1/2000 100
31/12/1999 100