Calculating the differences between unique values in the column and averaging by number of unique va-CodePudding

I have a dataframe and need to do some custom calculations.

Protocol    Visit   Day
111 1   1
111 1   1
111 1   1
111 2   15
111 2   15
111 2   15
111 3   29
111 3   29
222 2   14
222 2   14
222 2   14
222 3   28
222 3   28
222 3   28

Reproducible input:

{'Protocol ': {0: 111, 1: 111, 2: 111, 3: 111, 4: 111, 5: 111, 6: 111, 7: 111, 8: 222, 9: 222, 10: 222, 11: 222, 12: 222, 13: 222, 14: 222}, 'Visit': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 1, 9: 2, 10: 2, 11: 2, 12: 3, 13: 3, 14: 3}, 'Day': {0: 1, 1: 1, 2: 1, 3: 15, 4: 15, 5: 15, 6: 29, 7: 29, 8: 1, 9: 14, 10: 14, 11: 14, 12: 28, 13: 28, 14: 28}}

What do I want:

Calculate the difference between unique values between values in Day column in a certain way.

For instance, for protocol 111 it will be 15-1 = 14 and 29 - 15 = 14. The same logic should be apply for all numbers in the column for a specific protocol, because I will have more for each protocol.

Then I want to calculate the average of differences for each protocol. 14 14 = 28/3 = 9.3. 3 means 3 unique values for calculating differences - 1, 15, 29.

For protocol 222 it will be 28-14 = 14 divided by 2 (the number of unique values). Result 7.

Expected output

Protocol Average difference
111       9.3 
222       7

CodePudding user response：

Use DataFrame.drop_duplicates by both columns first and then aggregate lambda function for mean of differncies:

#remove traling whitespaces in columns names
df.columns = df.columns.str.strip()

df1 = (df.drop_duplicates(['Protocol','Day'])
         .groupby('Protocol')['Day']
         .agg(lambda x: x.diff().sum() / len(x))
         .reset_index(name='Average difference'))
print (df1)
   Protocol  Average difference
0       111            9.333333
1       222            7.000000

Or:

df1 = (df.drop_duplicates(['Protocol','Day'])
         .groupby('Protocol')['Day']
         .agg(lambda x: x.diff().fillna(0).mean())
         .reset_index(name='Average difference'))