How to calculate cumulative sum from different columns based on months using pandas dataframe?-CodePudding

I want to calculate the cumulative sum of values from different columns based on month.

Code:

import pandas as pd
import numpy as np

data = {'month': ['April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December', 'January', 'February', 'March'],
    'kpi': ['sales', 'sales', 'sales', 'sales', 'sales', 'sales', 'sales', 'sales', 'sales', 'sales', 'sales', 'sales'],
    're': [10, 20, 30, 40, 50, 60, 70, 80, 90, 10, 10, 20],
    're3 9': [10, 20, 30, 40, 50, 60, 70, 80, 90, 10, 10, 20],
    're6 6': [10, 20, 30, 40, 50, 60, 70, 80, 90, 10, 10, 20],
    're9 3': [10, 20, 30, 40, 50, 60, 70, 80, 90, 10, 10, 20]
    }

# Create DataFrame
df = pd.DataFrame(data)
print(df)
df['re_final'] = np.where((df['month'] == 'April') | (df['month'] == 'May') | (df['month'] == 'June'),
                                    df.groupby(['kpi'], sort=False)['re'].cumsum(), df['re_final'])
df['re_final'] = np.where((df['month'] == 'July') | (df['month'] == 'August') | (df['month'] == 'September'),
                                    df.groupby(['kpi'], sort=False)['re'].cumsum(), df['re_final'])
df['re_final'] = np.where((df['month'] == 'October') | (df['month'] == 'November') | (df['month'] == 'December'),
                                    df.groupby(['kpi'], sort=False)['re'].cumsum(), df['re_final'])
df['re_final'] = np.where((df['month'] == 'January') | (df['month'] == 'February') | (df['month'] == 'March'),
                                    df.groupby(['kpi'], sort=False)['re'].cumsum(), df['re_final'])
print(df)

Here there is four columns named re,re3 9,re6 6,re9 3 for taking the cumulative sum values.I want to calculate cumulative sum based on the below conditions:

If the months are April,May and June, cumulative sum will be taken from column re
If the months are July,August and September , cumulative sum will be taken from re3 9
If the months are October,November and December , cumulative sum will be taken from re6 6
If the months are January,February and March,Cumulative sum will be taken from re9 3

My expected output should be like this:

       month             kpi  re  re3 9  re6 6  re9 3     re_final
0       April           sales  10     10     10     10     10
1         May           sales  20     20     20     20     30
2        June           sales  30     30     30     30     60
3        July           sales  40     40     40     40     100
4      August           sales  50     50     50     50     150
5   September           sales  60     60     60     60     210
6     October           sales  70     70     70     70     280
7    November           sales  80     80     80     80     360    
8    December           sales  90     90     90     90     450
9     January           sales  10     10     10     10     460
10   February           sales  10     10     10     10     470
11      March           sales  20     20     20     20     490

But when I use the above code, I got an error like below:

Traceback (most recent call last):
File "/home/anf/anaconda3/envs/pa/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in   pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 're_final'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/anf/pa/s.py", line 19, in <module>
    df.groupby(['kpi'], sort=False)['re'].cumsum(), df['re_final'])
  File "/home/anf/anaconda3/envs/pa/lib/python3.9/site-packages/pandas/core/frame.py", line 3505, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/anf/anaconda3/envs/pa/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc
    raise KeyError(key) from err
KeyError: 're_final'

Can anyone suggest a solution to solve this issue?

CodePudding user response：

You can create re_final column before your multiple np.where

df['re_final'] = -1

I would suggest you use np.select for multiple conditions and Series.isin for checking value in list, for example:

g = df.groupby(['kpi'], sort=False)

df['re_final'] = np.select(
    [(df['month'].isin(['April', 'May', 'June']),
     (df['month'].isin(['July', 'August', 'September'])),
     ], # conditions
    [g['re'].cumsum(),
     g['re 9'].cumsum()], # choices
    -1, # default value
)

CodePudding user response：

I would use a dictionary to perform the month/column mapping, then use an indexing lookup combined with numpy.cumsum:

d = {'re9 3': ['January', 'February', 'March'],
     're':    ['April', 'May', 'June'],
     're3 9': ['July', 'August', 'September'],
     're6 6': ['October', 'November', 'December'],
    }

# reverse dictionary
d2 = {k:v for v,l in d.items() for k in l}

idx, cols = pd.factorize(df['month'].map(d2))
df['re_final'] = np.cumsum(df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx])

output:

        month    kpi  re  re3 9  re6 6  re9 3  re_final
0       April  sales  10     10     10     10        10
1         May  sales  20     20     20     20        30
2        June  sales  30     30     30     30        60
3        July  sales  40     40     40     40       100
4      August  sales  50     50     50     50       150
5   September  sales  60     60     60     60       210
6     October  sales  70     70     70     70       280
7    November  sales  80     80     80     80       360
8    December  sales  90     90     90     90       450
9     January  sales  10     10     10     10       460
10   February  sales  10     10     10     10       470
11      March  sales  20     20     20     20       490