I have a pandas dataframe that has a column that contains a list of attempt numbers and another column that contains the score achieved on those attempts. A simplified example is below:
scores = [[0,1,0], [0,0], [0,6,2]]
attempt_num = [[1,2,3], [2,4], [2,3,4]]
df = pd.DataFrame([attempt_num, scores]).T
df.columns = ['Attempt', 'Score']
Each row represents a different person, which for the purposes of this question, we can assume are unique. The data is incomplete, and so I have attempt number 1, 2 and 3 for the first person, 2 and 4 for the second and 2, 3 and 4 for the last. What I want to do is to get an average score per attempt. For example, attempt 1 only shows up once and so the average would be 0, the score achieved when it did show up. Attempt 2 shows up for all persons which gives an average of 0.33 ((1 0 0)/3) and so on. So the expected output would be:
Attempt_Number Average_Score
0 1 0.00
1 2 0.33
2 3 3.00
3 4 1.00
I could loop through every element of row of the dataframe and then through every element in the list in that row, append the score to an ordered list and calculate the average for every element in that list, but this would seem to be very inefficient. Is there a better way?
CodePudding user response:
Use DataFrame.explode
with aggregate mean
:
df = (df.explode(['Number','Score'])
.astype({'Score':int})
.groupby('Attempt', as_index=False)['Score']
.mean()
.rename(columns={'Attempt':'Attempt_Number','Score':'Average_Score'})
)
print (df)
Attempt_Number Average_Score
0 1 0.000000
1 2 0.333333
2 3 3.000000
3 4 1.000000
For oldier pandas versions use:
df = (df.apply(pd.Series.explode)
.astype({'Score':int})
.groupby('Attempt', as_index=False)['Score']
.mean()
.rename(columns={'Attempt':'Attempt_Number','Score':'Average_Score'})
)