I have a dataframe and need to do some custom calculations.
Protocol Visit Day
111 1 1
111 1 1
111 1 1
111 2 15
111 2 15
111 2 15
111 3 29
111 3 29
222 2 14
222 2 14
222 2 14
222 3 28
222 3 28
222 3 28
Reproducible input:
{'Protocol ': {0: 111, 1: 111, 2: 111, 3: 111, 4: 111, 5: 111, 6: 111, 7: 111, 8: 222, 9: 222, 10: 222, 11: 222, 12: 222, 13: 222, 14: 222}, 'Visit': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 1, 9: 2, 10: 2, 11: 2, 12: 3, 13: 3, 14: 3}, 'Day': {0: 1, 1: 1, 2: 1, 3: 15, 4: 15, 5: 15, 6: 29, 7: 29, 8: 1, 9: 14, 10: 14, 11: 14, 12: 28, 13: 28, 14: 28}}
What do I want:
- Calculate the difference between unique values between values in
Day
column in a certain way.
For instance, for protocol 111 it will be 15-1 = 14
and 29 - 15 = 14
. The same logic should be apply for all numbers in the column for a specific protocol, because I will have more for each protocol.
- Then I want to calculate the average of differences for each protocol.
14 14 = 28/3 = 9.3
. 3 means 3 unique values for calculating differences -1, 15, 29
.
For protocol 222 it will be 28-14 = 14
divided by 2 (the number of unique values). Result 7
.
Expected output
Protocol Average difference
111 9.3
222 7
CodePudding user response:
Use DataFrame.drop_duplicates
by both columns first and then aggregate lambda function for mean of differncies:
#remove traling whitespaces in columns names
df.columns = df.columns.str.strip()
df1 = (df.drop_duplicates(['Protocol','Day'])
.groupby('Protocol')['Day']
.agg(lambda x: x.diff().sum() / len(x))
.reset_index(name='Average difference'))
print (df1)
Protocol Average difference
0 111 9.333333
1 222 7.000000
Or:
df1 = (df.drop_duplicates(['Protocol','Day'])
.groupby('Protocol')['Day']
.agg(lambda x: x.diff().fillna(0).mean())
.reset_index(name='Average difference'))