Home > OS >  How to perform Cumulative sum based on groupby() output in Python
How to perform Cumulative sum based on groupby() output in Python

Time:06-07

Below is the data in which I am facing the issue:

dict_1 = {'id' : [101, 101, 101, 101, 101, 102, 102, 102, 102, 103, 103, 103, 103, 103, 103, 104, 104, 104, 
                  104, 104, 104, 104, 104, 104, 105, 105, 105, 105, 105],
          'electronics' : ['Mobile', "Laptop", "Laptop", "Laptop", "TV", 'Mobile', "Laptop", 'Mobile', "TV",  
                           'Mobile', "Laptop", 'Mobile', "Laptop", "Mobile", "TV", 'Mobile', "Laptop", "TV", 
                           'Mobile', "Laptop", "TV",'Mobile', "Laptop", "TV", 'Mobile', "Laptop", "TV", "Laptop", "TV",],
          'date' : ['2022-05-30', '2022-05-30', '2022-05-30', '2022-05-30', '2022-05-30', '2022-05-31', '2022-05-31', 
                    '2022-05-31', '2022-05-31', '2022-06-01', '2022-06-01', '2022-06-01', '2022-06-01', '2022-06-01', 
                    '2022-06-01', '2022-06-02', '2022-06-02', '2022-06-02', '2022-06-02', '2022-06-02', '2022-06-02',
                    '2022-06-02', '2022-06-02', '2022-06-02', '2022-06-03', '2022-06-03', '2022-06-03','2022-06-03', '2022-06-03']}
test = pd.DataFrame(dict_1)
test.head(7)
id electronics date
101 Mobile 2022-05-30
101 Laptop 2022-05-30
101 Laptop 2022-05-30
101 Laptop 2022-05-30
101 TV 2022-05-30
102 Mobile 2022-05-31
102 Laptop 2022-05-31

I need to find Cumulative sum of ID(Count) based on Month i.e. if the month end then it should start with 0.
I have used below code to find cumulative sum of the above data

grp_by = test.groupby(['electronics', 'date'])['id'].count().groupby(level=0).cumsum().reset_index()
grp_by.head()
electronics date id
Laptop 2022-05-30 3
Laptop 2022-05-31 4
Laptop 2022-06-01 6
Laptop 2022-06-02 9
Laptop 2022-06-03 11

Till date = 2022-05-31 data is working correctly however for 2022-06-01 value should be 2(because new month and there are 2 Laptop) and for 2022-06-02 value should be 5 and so on.

CodePudding user response:

Let us try two groupby's

s = test.groupby(['electronics', 'date'], as_index=False).size()
s['size'] = s.groupby(['electronics', s['date'].str[:7]]).cumsum()

print(s)

   electronics        date  size
0       Laptop  2022-05-30     3
1       Laptop  2022-05-31     4
2       Laptop  2022-06-01     2
3       Laptop  2022-06-02     5
4       Laptop  2022-06-03     7
5       Mobile  2022-05-30     1
6       Mobile  2022-05-31     3
7       Mobile  2022-06-01     3
8       Mobile  2022-06-02     6
9       Mobile  2022-06-03     7
10          TV  2022-05-30     1
11          TV  2022-05-31     2
12          TV  2022-06-01     1
13          TV  2022-06-02     4
14          TV  2022-06-03     6

CodePudding user response:

you can create a month column and then include in the groupby. you can always drop the month column afterwards

test['yearmonth'] = test['date'].astype('datetime64').dt.strftime('%Y%m')
test.groupby(['electronics', 'yearmonth', 'date'])['id'].count().groupby(level=1).cumsum().reset_index()
    electronics     month   date    id
0   Laptop  202205  2022-05-30  3
1   Laptop  202205  2022-05-31  4
2   Laptop  202206  2022-06-01  2
3   Laptop  202206  2022-06-02  5
4   Laptop  202206  2022-06-03  7

thanks you @creanion, you have a valid point, i updated it

  • Related