I have a dataframe which I need to calculate groupby(class_1 & class_2) mean between dynamic start and end dates.
The end date being 1 day before the DATE and start date being 1yr before the end date. Is it possible to get mean of (example): class_1: math, class_2: math where placement is 1 between the start and end dates.
sample df
---------- --------- --------- ----------- ---------- ---------- | date | class_1 | class_2 | placement | end | 1yr | ---------- --------- --------- ----------- ---------- ---------- | 12/05/21 | math | math | 1 | 12/04/21 | 12/04/20 | | 12/04/21 | math | eng | 3 | 12/03/21 | 12/03/20 | | 12/03/21 | eng | math | 4 | 12/02/21 | 12/02/20 | | 12/02/21 | math | math | 4 | 12/01/21 | 12/01/20 | | 12/01/21 | math | math | 1 | 11/30/21 | 11/30/20 | | 11/30/21 | math | math | 2 | 11/29/21 | 11/29/20 | ---------- --------- --------- ----------- ---------- ----------
sample output
---------- --------- --------- ----------- ---------- ---------- -------- | date | class_1 | class_2 | placement | end | 1yr | Mean | ---------- --------- --------- ----------- ---------- ---------- -------- | 12/05/21 | math | math | 1 | 12/04/21 | 12/04/20 | 0.3333 | | 12/04/21 | math | eng | 3 | 12/03/21 | 12/03/20 | 0 | | 12/03/21 | eng | math | 4 | 12/02/21 | 12/02/20 | 0 | | 12/02/21 | math | math | 4 | 12/01/21 | 12/01/20 | 0.5 | | 12/01/21 | math | math | 1 | 11/30/21 | 11/30/20 | 0 | | 11/30/21 | math | math | 2 | 11/29/21 | 11/29/20 | 0 | ---------- --------- --------- ----------- ---------- ---------- --------
for simplicity, i've used 11/30/21 to 12/05/21 for the above sample.
I was using the set_index(),groupby() ,assign() and reset_index() method for static dates, however, can't figure out how apply them with dynamic days.
Thanks in advance
CodePudding user response:
You could use apply
to compute your score tow by row:
def compute_1yr_mean(row):
# Get the rows belonging to the period and classes of interest
curr_df = df[
# select row in the time window computed starting from the current row
(df['date'] >= row['1yr'])
# or you can compute this date on the fly, as shown in the next line
# (df['date'] >= row['date'] - datetime.timedelta(days=1) - relativedelta(years=1))
& (df['date'] < row['date'])
# select same classes
& (df['Class_1'] == row['Class_1'])
& (df['Class_2'] == row['Class_2'])
]
# Compute the ratio between the number of rows with placement 1 and the number of all rows
return (
(curr_df['placement'] == 1).sum() / len(curr_df)
) if len(curr_df) else 0
df['score_mean_1yr'] = df.apply(compute_1yr_mean, axis=1)
Assuming that your dataframe of interest is:
date Class_1 Class_2 placement end 1yr
0 2021-12-05 math math 1 2021-12-04 2020-12-04
1 2021-12-04 math eng 3 2021-12-03 2020-12-03
2 2021-12-03 eng math 4 2021-12-02 2020-12-02
3 2021-12-02 math math 4 2021-12-01 2020-12-01
4 2021-12-01 math math 1 2021-11-30 2020-11-30
5 2021-11-30 math math 2 2021-11-29 2020-11-29
The result of applying this to your dataframe is:
date Class_1 Class_2 placement end 1yr score_mean_1yr
0 2021-12-05 math math 1 2021-12-04 2020-12-04 0.333333
1 2021-12-04 math eng 3 2021-12-03 2020-12-03 0.000000
2 2021-12-03 eng math 4 2021-12-02 2020-12-02 0.000000
3 2021-12-02 math math 4 2021-12-01 2020-12-01 0.500000
4 2021-12-01 math math 1 2021-11-30 2020-11-30 0.000000
5 2021-11-30 math math 2 2021-11-29 2020-11-29 0.000000
One note: the function compute_1yr_mean
assumes the existence of df
in the scope where it is defined.