conditional groupby mean with dynamic dates-CodePudding

I have a dataframe which I need to calculate groupby(class_1 & class_2) mean between dynamic start and end dates.

The end date being 1 day before the DATE and start date being 1yr before the end date. Is it possible to get mean of (example): class_1: math, class_2: math where placement is 1 between the start and end dates.

sample df

 ---------- --------- --------- ----------- ---------- ---------- 
|   date   | class_1 | class_2 | placement |   end    |   1yr    |
 ---------- --------- --------- ----------- ---------- ---------- 
| 12/05/21 | math    | math    |         1 | 12/04/21 | 12/04/20 |
| 12/04/21 | math    | eng     |         3 | 12/03/21 | 12/03/20 |
| 12/03/21 | eng     | math    |         4 | 12/02/21 | 12/02/20 |
| 12/02/21 | math    | math    |         4 | 12/01/21 | 12/01/20 |
| 12/01/21 | math    | math    |         1 | 11/30/21 | 11/30/20 |
| 11/30/21 | math    | math    |         2 | 11/29/21 | 11/29/20 |
 ---------- --------- --------- ----------- ---------- ----------

sample output

 ---------- --------- --------- ----------- ---------- ---------- -------- 
|   date   | class_1 | class_2 | placement |   end    |   1yr    |  Mean  |
 ---------- --------- --------- ----------- ---------- ---------- -------- 
| 12/05/21 | math    | math    |         1 | 12/04/21 | 12/04/20 | 0.3333 |
| 12/04/21 | math    | eng     |         3 | 12/03/21 | 12/03/20 | 0      |
| 12/03/21 | eng     | math    |         4 | 12/02/21 | 12/02/20 | 0      |
| 12/02/21 | math    | math    |         4 | 12/01/21 | 12/01/20 | 0.5    |
| 12/01/21 | math    | math    |         1 | 11/30/21 | 11/30/20 | 0      |
| 11/30/21 | math    | math    |         2 | 11/29/21 | 11/29/20 | 0      |
 ---------- --------- --------- ----------- ---------- ---------- --------

for simplicity, i've used 11/30/21 to 12/05/21 for the above sample.

I was using the set_index(),groupby() ,assign() and reset_index() method for static dates, however, can't figure out how apply them with dynamic days.

Thanks in advance

CodePudding user response：

You could use apply to compute your score tow by row:

def compute_1yr_mean(row):
    # Get the rows belonging to the period and classes of interest
    curr_df = df[
        # select row in the time window computed starting from the current row
        (df['date'] >= row['1yr'])
        # or you can compute this date on the fly, as shown in the next line
        # (df['date'] >= row['date'] - datetime.timedelta(days=1) - relativedelta(years=1))
        & (df['date'] < row['date'])
        # select same classes
        & (df['Class_1'] == row['Class_1'])
        & (df['Class_2'] == row['Class_2'])
    ]

    # Compute the ratio between the number of rows with placement 1 and the number of all rows
    return (
            (curr_df['placement'] == 1).sum() / len(curr_df)
    ) if len(curr_df) else 0

df['score_mean_1yr'] = df.apply(compute_1yr_mean, axis=1)

Assuming that your dataframe of interest is:

        date Class_1 Class_2  placement        end        1yr
0 2021-12-05    math    math          1 2021-12-04 2020-12-04
1 2021-12-04    math     eng          3 2021-12-03 2020-12-03
2 2021-12-03     eng    math          4 2021-12-02 2020-12-02
3 2021-12-02    math    math          4 2021-12-01 2020-12-01
4 2021-12-01    math    math          1 2021-11-30 2020-11-30
5 2021-11-30    math    math          2 2021-11-29 2020-11-29

The result of applying this to your dataframe is:

        date Class_1 Class_2  placement        end        1yr  score_mean_1yr
0 2021-12-05    math    math          1 2021-12-04 2020-12-04        0.333333
1 2021-12-04    math     eng          3 2021-12-03 2020-12-03        0.000000
2 2021-12-03     eng    math          4 2021-12-02 2020-12-02        0.000000
3 2021-12-02    math    math          4 2021-12-01 2020-12-01        0.500000
4 2021-12-01    math    math          1 2021-11-30 2020-11-30        0.000000
5 2021-11-30    math    math          2 2021-11-29 2020-11-29        0.000000

One note: the function compute_1yr_mean assumes the existence of df in the scope where it is defined.