STUDY Teacher UPDATE_DATE COMPARE_DATE INTERVAL_DAYS
0 entertainment C 2022-03-02 2022-01-01 0.61
1 entertainment B, C 2022-03-02 2022-04-10 0.39
2 math A 2022-02-25 2022-01-01 0.56
3 math A, C 2022-02-25 2022-04-10 0.44
4 science D 2022-01-03 2022-01-01 0.02
5 science A 2022-01-03 2022-02-20 0.48
6 science A, B, E 2022-02-20 2022-04-10 0.49
7 technology E 2021-09-01 2022-09-01 0.00
Q. when there are more than 2 teacheres, I want to create a row by splitting based on "," and divide the value of the INTERVAL_DAYS field by the number of teachers to add. I thought it had a similar structure to the compare_date function you helped and tried to apply it, but the desired output did not come out, so I had to raise an additional question.
STUDY Teacher UPDATE_DATE COMPARE_DATE INTERVAL_DAYS
0 entertainment C 2022-03-02 2022-01-01 0.61
1 entertainment B 2022-03-02 2022-04-10 0.2
2 entertainment C 2022-03-02 2022-04-10 0.2
3 math A 2022-02-25 2022-01-01 0.56
4 math A 2022-02-25 2022-04-10 0.22
5 math C 2022-02-25 2022-04-10 0.22
6 science D 2022-01-03 2022-01-01 0.02
7 science A 2022-01-03 2022-02-20 0.48
8 science A 2022-02-20 2022-04-10 0.16
9 science B 2022-02-20 2022-04-10 0.16
10 science E 2022-02-20 2022-04-10 0.16 11 technology E 2021-09-01 2022-09-01 1.00
The final thing you want to get is the sum of interval days for each Teacher.
Teacher WORKING_COUNTS
0 A 2
1 B 0.36
2 C 1.03
3 D 0.02
4 E 1.16
CodePudding user response:
First split value by ,
and use DataFrame.explode
, divide rows by counts by GroupBy.transform
and last aggrgate sum
:
df = df.assign(Teacher = df['Teacher'].str.split(', ')).explode('Teacher')
df['INTERVAL_DAYS'] /= df.groupby(level=0)['INTERVAL_DAYS'].transform('size')
print (df)
STUDY Teacher UPDATE_DATE COMPARE_DATE INTERVAL_DAYS
0 entertainment C 2022-03-02 2022-01-01 0.610000
1 entertainment B 2022-03-02 2022-04-10 0.195000
1 entertainment C 2022-03-02 2022-04-10 0.195000
2 math A 2022-02-25 2022-01-01 0.560000
3 math A 2022-02-25 2022-04-10 0.220000
3 math C 2022-02-25 2022-04-10 0.220000
4 science D 2022-01-03 2022-01-01 0.020000
5 science A 2022-01-03 2022-02-20 0.480000
6 science A 2022-02-20 2022-04-10 0.163333
6 science B 2022-02-20 2022-04-10 0.163333
6 science E 2022-02-20 2022-04-10 0.163333
7 technology E 2021-09-01 2022-09-01 0.000000
out = df.groupby('Teacher')['INTERVAL_DAYS'].sum().reset_index(name='WORKING_COUNTS')
print (out)
Teacher WORKING_COUNTS
0 A 1.423333
1 B 0.358333
2 C 1.025000
3 D 0.020000
4 E 0.163333