Take the mean of the top 10% values according to the date-CodePudding

I have a dataframe that consist of the day column and a score column. So per day, there are many values. So I need to get the mean of the top 10% values per day. Simply means I need the output as a day column and mean of the top 10% values of that day.

this is a example dataset screenshot

`{'Date': 
 [
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1),
  datetime.date(2021, 4, 1)],
 'value': [     
  3.35,
  1.85,
  1.3,
  1.85,
  1.85,
  1.17,
  1.17,
  2.8,
  1.43,
  2.54,
  1.22,
  2.54,
  1.17,
  1.17,
  2.71,
  5.98,
  1.39,
  1.48,
  16.46,
  1.43,
  8.39,
  33.99,
  2.54,
  11.8,
  2.13,
  2.24,
  2.92,
  1.35,
  1.54,
  2.52]}`

CodePudding user response：

Should be pretty simple -

*Assuming you're using Pandas, and this is a pandas dataframe called df with columns date and value

Creating a demo dataframe and importing required packages, you would probably import your table as a dataframe!

import pandas as pd
import math
import statistics

df = pd.DataFrame({'date': ['2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02'],
                   'value': [12,32,12,23,12,14,15,54,43,64,21,15]})

#If you need to save results as a DataFrame later on
res = pd.DataFrame(columns = ['date','top_10p_mean'])

Filter based on dates

Basically getting a list of different dates and iterating through them to get values in a list

for date in df['date'].unique():
   temp = list(df['value'][df['date'] == date])

Sort filtered set by values

Sorting the list in reverse order to have highest values at the top or you can just omit the reverse=True part to keep values as is

   temp.sort(reverse=True)

Take mean of top 10% values

This will calculate the number of items in top 10% of the list (the index is rounded up to the next integer), take those values and calculate the mean.

Further explanation of the functions for beginners -

First "round_up_to_next_integer(total_number_of_items(in_list) * 10%)"

Then "give_me_mean_of(list_items[from_index_0 : the_number_I_got_from_the_percentage_calculation])"

   avg = statistics.mean(temp[0:math.ceil(len(temp)*0.1)])

Print it or save in a new DataFrame

Printing the results and appending it to the previously created empty DataFrame

   print('Mean value on '   str(date)   ' = '   str(avg))
   res = res.append({'date': date, 'top_10p_mean': avg}, ignore_index=True)

So in total it should work something like this -

import pandas as pd
import math
import statistics

df = pd.DataFrame({'date': ['2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02'],
                   'value': [12,32,12,23,12,14,15,54,43,64,21,15]})

df
Out[]:
          date  value
0   2021-04-01     12
1   2021-04-01     32
2   2021-04-01     12
3   2021-04-01     23
4   2021-04-01     12
5   2021-04-02     14
6   2021-04-02     15
7   2021-04-02     54
8   2021-04-02     43
9   2021-04-02     64
10  2021-04-02     21
11  2021-04-02     15

res = pd.DataFrame(columns = ['date','top_10p_mean'])

for date in df['date'].unique():
   temp = list(df['value'][df['date'] == date])
   temp.sort(reverse=True)
   print(temp)                                            #Just to show what it looks like
   avg = statistics.mean(temp[0:math.ceil(len(temp)*0.1)])
   print('\nMean value on '   str(date)   ' = '   str(avg)   '\n')
   res = res.append({'date': date, 'top_10p_mean': avg}, ignore_index=True)

Out[]:
[32, 23, 12, 12, 12]

Mean value on 2021-04-01 = 32

[64, 54, 43, 21, 15, 15, 14]

Mean value on 2021-04-02 = 64

res
Out[]: 
         date top_10p_mean
0  2021-04-01           32
1  2021-04-02           64

CodePudding user response：

df.nlargest is what you want. First determine how many values correspond with 10% by running (df being your dataframe):

highest10p = 0.1*len(df)

then you can select the 10 largest values in the value column by using

df.nlargest(highest10p, 'value')

So if you want the mean, you use .mean() function:

df.nlargest(highest10p, 'value').mean()