I want to be able to calculate the mean of 4 week columns, but if the number in the Top x column is less than 4, I only want to use the maximum x values to calculate the mean (i.e. if Top x = 3, discard the lowest week value when calculating the mean.
Example dataframe:
df = pd.DataFrame({'week 1' : [1.0, 5.0, 7.0, 6.0, np.nan],
'week 2' : [3.0, np.nan, 9.0, 8.0, np.nan],
'week 3' : [1.0, 2.0, 2.0, 1.0, 6.0],
'week 4' : [np.nan, 4.0, 2.0, 7.0, 6.0],
'Top x' : [3.0, 2.0, 4.0, 3.0, 3.0]})
week 1 week 2 week 3 week 4 Top x
0 1.0 3.0 1.0 NaN 3.0
1 5.0 NaN 2.0 4.0 2.0
2 7.0 9.0 2.0 2.0 4.0
3 6.0 8.0 1.0 7.0 3.0
4 NaN NaN 6.0 6.0 3.0
Expected output:
week 1 week 2 week 3 week 4 Top x Mean
0 1.0 3.0 1.0 NaN 3.0 1.666667
1 5.0 NaN 2.0 4.0 2.0 4.500000
2 7.0 9.0 2.0 2.0 4.0 5.000000
3 6.0 8.0 1.0 7.0 3.0 7.000000
4 NaN NaN 6.0 6.0 3.0 6.000000
I don't know if there's a way to add a function to pandas.mean()
function, or if it's simpler to sum the top x of the week columns (maybe turning each row into a list?) and divide by the Top x column.
CodePudding user response:
Use DataFrame.melt
with DataFrame.sort_values
first and then compare counter by GroupBy.cumcount
,fompare by Top x
and filter, last aggregate mean
:
df1 = df.melt('Top x', ignore_index=False).sort_values('value', ascending=False)
df['Mean'] = (df1[df1.groupby(level=0).cumcount().lt(df1['Top x'])]
.groupby(level=0)['value'].mean())
print (df)
week 1 week 2 week 3 week 4 Top x Mean
0 1.0 3.0 1.0 NaN 3.0 1.666667
1 5.0 NaN 2.0 4.0 2.0 4.500000
2 7.0 9.0 2.0 2.0 4.0 5.000000
3 6.0 8.0 1.0 7.0 3.0 7.000000
4 NaN NaN 6.0 6.0 3.0 6.000000
Alternative solution (should be faster in large Dataframes) is test sorted values by DataFrame.rank
with DataFrame.where
:
df1 = df.drop('Top x', axis=1)
df['Mean'] = (df1.where(df1.rank(axis=1, method='first', ascending=False)
.le(df['Top x'], axis=0))
.mean(axis=1))
print (df)
week 1 week 2 week 3 week 4 Top x Mean
0 1.0 3.0 1.0 NaN 3.0 1.666667
1 5.0 NaN 2.0 4.0 2.0 4.500000
2 7.0 9.0 2.0 2.0 4.0 5.000000
3 6.0 8.0 1.0 7.0 3.0 7.000000
4 NaN NaN 6.0 6.0 3.0 6.000000