I want to calculate the cumulative sum of values from different columns based on month.
Code:
import pandas as pd
import numpy as np
data = {'month': ['April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December', 'January', 'February', 'March'],
'kpi': ['sales', 'sales', 'sales', 'sales', 'sales', 'sales', 'sales', 'sales', 'sales', 'sales', 'sales', 'sales'],
're': [10, 20, 30, 40, 50, 60, 70, 80, 90, 10, 10, 20],
're3 9': [10, 20, 30, 40, 50, 60, 70, 80, 90, 10, 10, 20],
're6 6': [10, 20, 30, 40, 50, 60, 70, 80, 90, 10, 10, 20],
're9 3': [10, 20, 30, 40, 50, 60, 70, 80, 90, 10, 10, 20]
}
# Create DataFrame
df = pd.DataFrame(data)
print(df)
df['re_final'] = np.where((df['month'] == 'April') | (df['month'] == 'May') | (df['month'] == 'June'),
df.groupby(['kpi'], sort=False)['re'].cumsum(), df['re_final'])
df['re_final'] = np.where((df['month'] == 'July') | (df['month'] == 'August') | (df['month'] == 'September'),
df.groupby(['kpi'], sort=False)['re'].cumsum(), df['re_final'])
df['re_final'] = np.where((df['month'] == 'October') | (df['month'] == 'November') | (df['month'] == 'December'),
df.groupby(['kpi'], sort=False)['re'].cumsum(), df['re_final'])
df['re_final'] = np.where((df['month'] == 'January') | (df['month'] == 'February') | (df['month'] == 'March'),
df.groupby(['kpi'], sort=False)['re'].cumsum(), df['re_final'])
print(df)
Here there is four columns named re
,re3 9
,re6 6
,re9 3
for taking the cumulative sum values.I want to calculate cumulative sum based on the below conditions:
- If the months are April,May and June, cumulative sum will be taken from column
re
- If the months are July,August and September , cumulative sum will be taken from
re3 9
- If the months are October,November and December , cumulative sum will be taken from
re6 6
- If the months are January,February and March,Cumulative sum will be taken from
re9 3
My expected output should be like this:
month kpi re re3 9 re6 6 re9 3 re_final
0 April sales 10 10 10 10 10
1 May sales 20 20 20 20 30
2 June sales 30 30 30 30 60
3 July sales 40 40 40 40 100
4 August sales 50 50 50 50 150
5 September sales 60 60 60 60 210
6 October sales 70 70 70 70 280
7 November sales 80 80 80 80 360
8 December sales 90 90 90 90 450
9 January sales 10 10 10 10 460
10 February sales 10 10 10 10 470
11 March sales 20 20 20 20 490
But when I use the above code, I got an error like below:
Traceback (most recent call last):
File "/home/anf/anaconda3/envs/pa/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 're_final'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/anf/pa/s.py", line 19, in <module>
df.groupby(['kpi'], sort=False)['re'].cumsum(), df['re_final'])
File "/home/anf/anaconda3/envs/pa/lib/python3.9/site-packages/pandas/core/frame.py", line 3505, in __getitem__
indexer = self.columns.get_loc(key)
File "/home/anf/anaconda3/envs/pa/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc
raise KeyError(key) from err
KeyError: 're_final'
Can anyone suggest a solution to solve this issue?
CodePudding user response:
You can create re_final
column before your multiple np.where
df['re_final'] = -1
I would suggest you use np.select
for multiple conditions and Series.isin
for checking value in list, for example:
g = df.groupby(['kpi'], sort=False)
df['re_final'] = np.select(
[(df['month'].isin(['April', 'May', 'June']),
(df['month'].isin(['July', 'August', 'September'])),
], # conditions
[g['re'].cumsum(),
g['re 9'].cumsum()], # choices
-1, # default value
)
CodePudding user response:
I would use a dictionary to perform the month/column mapping, then use an indexing lookup combined with numpy.cumsum
:
d = {'re9 3': ['January', 'February', 'March'],
're': ['April', 'May', 'June'],
're3 9': ['July', 'August', 'September'],
're6 6': ['October', 'November', 'December'],
}
# reverse dictionary
d2 = {k:v for v,l in d.items() for k in l}
idx, cols = pd.factorize(df['month'].map(d2))
df['re_final'] = np.cumsum(df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx])
output:
month kpi re re3 9 re6 6 re9 3 re_final
0 April sales 10 10 10 10 10
1 May sales 20 20 20 20 30
2 June sales 30 30 30 30 60
3 July sales 40 40 40 40 100
4 August sales 50 50 50 50 150
5 September sales 60 60 60 60 210
6 October sales 70 70 70 70 280
7 November sales 80 80 80 80 360
8 December sales 90 90 90 90 450
9 January sales 10 10 10 10 460
10 February sales 10 10 10 10 470
11 March sales 20 20 20 20 490