Pandas mean across highest x columns-CodePudding

I want to be able to calculate the mean of 4 week columns, but if the number in the Top x column is less than 4, I only want to use the maximum x values to calculate the mean (i.e. if Top x = 3, discard the lowest week value when calculating the mean.

Example dataframe:

df = pd.DataFrame({'week 1' : [1.0, 5.0, 7.0, 6.0, np.nan],
               'week 2' : [3.0, np.nan, 9.0, 8.0, np.nan],
               'week 3' : [1.0, 2.0, 2.0, 1.0, 6.0],
               'week 4' : [np.nan, 4.0, 2.0, 7.0, 6.0],
               'Top x' : [3.0, 2.0, 4.0, 3.0, 3.0]})

     week 1  week 2  week 3  week 4  Top x
0     1.0     3.0     1.0     NaN    3.0
1     5.0     NaN     2.0     4.0    2.0
2     7.0     9.0     2.0     2.0    4.0
3     6.0     8.0     1.0     7.0    3.0
4     NaN     NaN     6.0     6.0    3.0

Expected output:

     week 1  week 2  week 3  week 4  Top x   Mean
0     1.0     3.0     1.0     NaN    3.0  1.666667
1     5.0     NaN     2.0     4.0    2.0  4.500000
2     7.0     9.0     2.0     2.0    4.0  5.000000
3     6.0     8.0     1.0     7.0    3.0  7.000000
4     NaN     NaN     6.0     6.0    3.0  6.000000

I don't know if there's a way to add a function to pandas.mean() function, or if it's simpler to sum the top x of the week columns (maybe turning each row into a list?) and divide by the Top x column.

CodePudding user response：

Use DataFrame.melt with DataFrame.sort_values first and then compare counter by GroupBy.cumcount,fompare by Top x and filter, last aggregate mean:

df1 = df.melt('Top x', ignore_index=False).sort_values('value', ascending=False)

df['Mean'] = (df1[df1.groupby(level=0).cumcount().lt(df1['Top x'])]
                     .groupby(level=0)['value'].mean())
print (df)
   week 1  week 2  week 3  week 4  Top x      Mean
0     1.0     3.0     1.0     NaN    3.0  1.666667
1     5.0     NaN     2.0     4.0    2.0  4.500000
2     7.0     9.0     2.0     2.0    4.0  5.000000
3     6.0     8.0     1.0     7.0    3.0  7.000000
4     NaN     NaN     6.0     6.0    3.0  6.000000

Alternative solution (should be faster in large Dataframes) is test sorted values by DataFrame.rank with DataFrame.where:

df1 = df.drop('Top x', axis=1)
df['Mean'] = (df1.where(df1.rank(axis=1, method='first', ascending=False)
                           .le(df['Top x'], axis=0))
                 .mean(axis=1))
print (df)
   week 1  week 2  week 3  week 4  Top x      Mean
0     1.0     3.0     1.0     NaN    3.0  1.666667
1     5.0     NaN     2.0     4.0    2.0  4.500000
2     7.0     9.0     2.0     2.0    4.0  5.000000
3     6.0     8.0     1.0     7.0    3.0  7.000000
4     NaN     NaN     6.0     6.0    3.0  6.000000