Home > Software engineering >  How to calculate cumulative weighted average using pandas
How to calculate cumulative weighted average using pandas

Time:10-18

I would like to ask you about how to find the cumulative average of a specific column value in a pandas dataframe. First, the data looks like this:

firm date reviewer rate
A 2021-01-01 a 5
A 2021-01-01 b 1
A 2021-01-01 c 2
A 2021-01-02 d 3
A 2021-01-02 e 4
A 2021-01-03 f 3
A 2021-01-04 g 5
B 2021-01-01 h 5
B 2021-01-01 i 2
B 2021-01-02 j 3
B 2021-01-02 k 4
B 2021-01-03 a 3
B 2021-01-04 b 5

What I want to find is to get the average rating of a specific company by date, and add a column to find the cumulative average rating including today's average rating.
I want to make it into a dataframe like the one below.

firm date reviewer rate cum_avg_rate
A 2021-01-01 a 5 2.667
A 2021-01-01 b 1 2.667
A 2021-01-01 c 2 2.667
A 2021-01-02 d 3 3
A 2021-01-02 e 4 3
A 2021-01-03 f 3 3
A 2021-01-04 g 5 3.286
B 2021-01-01 h 5 3.5
B 2021-01-01 i 2 3.5
B 2021-01-02 j 3 3.5
B 2021-01-02 k 4 3.5
B 2021-01-03 a 3 3.4
B 2021-01-04 b 5 3.667

The method I have tried so far is to create a new data frame that calculates the average score and the number of reviews using the 'groupby' method with firm and date, and use this to create a cumulative average for each day.
The code is below.

firm_gp=avg_mean_rate.groupby(['firm','date'])['mean']
firm_gp_count=avg_mean_rate.groupby(['firm','date'])['count']
avg_mean_rate['new_avg_grade']=( (firm_gp * firm_gp_count).cumsum())/firm_gp_count.cumsum()

However, the problem is that the following error occurs in the process of calculating the cumulative average for each day.
TypeError: unsupported operand type(s) for *: 'SeriesGroupBy' and 'method'

As the second method, I tried the following method using numpy.

def w_cum_avg(avg_mean_rate,mean,count):
    d=avg_mean_rate['mean']
    w= avg_mean_rate['count']
    return(d*w).cumsum() / w.cumsum()
avg_mean_rate.groupby(['firm','date']).apply(w_cum_avg,'mean','count')

But this doesn’t work well, what I expected.

I would appreciate it if you could teach me how to get results.

Thank you in advance.

CodePudding user response:

We could compute the daily sum and count per firm with groupby aggregate then groupby cumsum to get the daily cumulative total per firm. Compute the mean by dividing and join back to the DataFrame:

g = (
    df.groupby(['firm', 'date'])['rate']
        .agg(['sum', 'count'])
        .groupby(level='firm').cumsum()
)

df = df.join(
    g['sum'].div(g['count']).rename('cum_avg_rate'),
    on=['firm', 'date']  # align index on columns
)

df:

   firm        date reviewer  rate  cum_avg_rate
0     A  2021-01-01        a     5      2.666667
1     A  2021-01-01        b     1      2.666667
2     A  2021-01-01        c     2      2.666667
3     A  2021-01-02        d     3      3.000000
4     A  2021-01-02        e     4      3.000000
5     A  2021-01-03        f     3      3.000000
6     A  2021-01-04        g     5      3.285714
7     B  2021-01-01        h     5      3.500000
8     B  2021-01-01        i     2      3.500000
9     B  2021-01-02        j     3      3.500000
10    B  2021-01-02        k     4      3.500000
11    B  2021-01-03        a     3      3.400000
12    B  2021-01-04        b     5      3.666667

Setup:

import pandas as pd

df = pd.DataFrame({
    'firm': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
    'date': ['2021-01-01', '2021-01-01', '2021-01-01', '2021-01-02',
             '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-01',
             '2021-01-01', '2021-01-02', '2021-01-02', '2021-01-03',
             '2021-01-04'],
    'reviewer': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'a',
                 'b'],
    'rate': [5, 1, 2, 3, 4, 3, 5, 5, 2, 3, 4, 3, 5]
})
  • Related