Home > Software design >  conditional groupby mean with dynamic dates
conditional groupby mean with dynamic dates

Time:10-06

I have a dataframe which I need to calculate groupby(class_1 & class_2) mean between dynamic start and end dates.

The end date being 1 day before the DATE and start date being 1yr before the end date. Is it possible to get mean of (example): class_1: math, class_2: math where placement is 1 between the start and end dates.

sample df

 ---------- --------- --------- ----------- ---------- ---------- 
|   date   | class_1 | class_2 | placement |   end    |   1yr    |
 ---------- --------- --------- ----------- ---------- ---------- 
| 12/05/21 | math    | math    |         1 | 12/04/21 | 12/04/20 |
| 12/04/21 | math    | eng     |         3 | 12/03/21 | 12/03/20 |
| 12/03/21 | eng     | math    |         4 | 12/02/21 | 12/02/20 |
| 12/02/21 | math    | math    |         4 | 12/01/21 | 12/01/20 |
| 12/01/21 | math    | math    |         1 | 11/30/21 | 11/30/20 |
| 11/30/21 | math    | math    |         2 | 11/29/21 | 11/29/20 |
 ---------- --------- --------- ----------- ---------- ---------- 

sample output

 ---------- --------- --------- ----------- ---------- ---------- -------- 
|   date   | class_1 | class_2 | placement |   end    |   1yr    |  Mean  |
 ---------- --------- --------- ----------- ---------- ---------- -------- 
| 12/05/21 | math    | math    |         1 | 12/04/21 | 12/04/20 | 0.3333 |
| 12/04/21 | math    | eng     |         3 | 12/03/21 | 12/03/20 | 0      |
| 12/03/21 | eng     | math    |         4 | 12/02/21 | 12/02/20 | 0      |
| 12/02/21 | math    | math    |         4 | 12/01/21 | 12/01/20 | 0.5    |
| 12/01/21 | math    | math    |         1 | 11/30/21 | 11/30/20 | 0      |
| 11/30/21 | math    | math    |         2 | 11/29/21 | 11/29/20 | 0      |
 ---------- --------- --------- ----------- ---------- ---------- -------- 

for simplicity, i've used 11/30/21 to 12/05/21 for the above sample.

I was using the set_index(),groupby() ,assign() and reset_index() method for static dates, however, can't figure out how apply them with dynamic days.

Thanks in advance

CodePudding user response:

You could use apply to compute your score tow by row:

def compute_1yr_mean(row):
    # Get the rows belonging to the period and classes of interest
    curr_df = df[
        # select row in the time window computed starting from the current row
        (df['date'] >= row['1yr'])
        # or you can compute this date on the fly, as shown in the next line
        # (df['date'] >= row['date'] - datetime.timedelta(days=1) - relativedelta(years=1))
        & (df['date'] < row['date'])
        # select same classes
        & (df['Class_1'] == row['Class_1'])
        & (df['Class_2'] == row['Class_2'])
    ]

    # Compute the ratio between the number of rows with placement 1 and the number of all rows
    return (
            (curr_df['placement'] == 1).sum() / len(curr_df)
    ) if len(curr_df) else 0

df['score_mean_1yr'] = df.apply(compute_1yr_mean, axis=1)

Assuming that your dataframe of interest is:

        date Class_1 Class_2  placement        end        1yr
0 2021-12-05    math    math          1 2021-12-04 2020-12-04
1 2021-12-04    math     eng          3 2021-12-03 2020-12-03
2 2021-12-03     eng    math          4 2021-12-02 2020-12-02
3 2021-12-02    math    math          4 2021-12-01 2020-12-01
4 2021-12-01    math    math          1 2021-11-30 2020-11-30
5 2021-11-30    math    math          2 2021-11-29 2020-11-29

The result of applying this to your dataframe is:

        date Class_1 Class_2  placement        end        1yr  score_mean_1yr
0 2021-12-05    math    math          1 2021-12-04 2020-12-04        0.333333
1 2021-12-04    math     eng          3 2021-12-03 2020-12-03        0.000000
2 2021-12-03     eng    math          4 2021-12-02 2020-12-02        0.000000
3 2021-12-02    math    math          4 2021-12-01 2020-12-01        0.500000
4 2021-12-01    math    math          1 2021-11-30 2020-11-30        0.000000
5 2021-11-30    math    math          2 2021-11-29 2020-11-29        0.000000

One note: the function compute_1yr_mean assumes the existence of df in the scope where it is defined.

  • Related