I have a dataframe that consist of the day column and a score column. So per day, there are many values. So I need to get the mean of the top 10% values per day. Simply means I need the output as a day column and mean of the top 10% values of that day.
this is a example dataset screenshot
`{'Date':
[
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1)],
'value': [
3.35,
1.85,
1.3,
1.85,
1.85,
1.17,
1.17,
2.8,
1.43,
2.54,
1.22,
2.54,
1.17,
1.17,
2.71,
5.98,
1.39,
1.48,
16.46,
1.43,
8.39,
33.99,
2.54,
11.8,
2.13,
2.24,
2.92,
1.35,
1.54,
2.52]}`
CodePudding user response:
Should be pretty simple -
*Assuming you're using Pandas, and this is a pandas dataframe called df
with columns date
and value
Creating a demo dataframe and importing required packages, you would probably import your table as a dataframe!
import pandas as pd
import math
import statistics
df = pd.DataFrame({'date': ['2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02'],
'value': [12,32,12,23,12,14,15,54,43,64,21,15]})
#If you need to save results as a DataFrame later on
res = pd.DataFrame(columns = ['date','top_10p_mean'])
- Filter based on dates
Basically getting a list of different dates and iterating through them to get values in a list
for date in df['date'].unique():
temp = list(df['value'][df['date'] == date])
- Sort filtered set by values
Sorting the list in reverse order to have highest values at the top or you can just omit the reverse=True
part to keep values as is
temp.sort(reverse=True)
- Take mean of top 10% values
This will calculate the number of items in top 10% of the list (the index is rounded up to the next integer), take those values and calculate the mean.
Further explanation of the functions for beginners -
First "round_up_to_next_integer(total_number_of_items(in_list) * 10%)"
Then "give_me_mean_of(list_items[from_index_0 : the_number_I_got_from_the_percentage_calculation])"
avg = statistics.mean(temp[0:math.ceil(len(temp)*0.1)])
- Print it or save in a new DataFrame
Printing the results and appending it to the previously created empty DataFrame
print('Mean value on ' str(date) ' = ' str(avg))
res = res.append({'date': date, 'top_10p_mean': avg}, ignore_index=True)
So in total it should work something like this -
import pandas as pd
import math
import statistics
df = pd.DataFrame({'date': ['2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02'],
'value': [12,32,12,23,12,14,15,54,43,64,21,15]})
df
Out[]:
date value
0 2021-04-01 12
1 2021-04-01 32
2 2021-04-01 12
3 2021-04-01 23
4 2021-04-01 12
5 2021-04-02 14
6 2021-04-02 15
7 2021-04-02 54
8 2021-04-02 43
9 2021-04-02 64
10 2021-04-02 21
11 2021-04-02 15
res = pd.DataFrame(columns = ['date','top_10p_mean'])
for date in df['date'].unique():
temp = list(df['value'][df['date'] == date])
temp.sort(reverse=True)
print(temp) #Just to show what it looks like
avg = statistics.mean(temp[0:math.ceil(len(temp)*0.1)])
print('\nMean value on ' str(date) ' = ' str(avg) '\n')
res = res.append({'date': date, 'top_10p_mean': avg}, ignore_index=True)
Out[]:
[32, 23, 12, 12, 12]
Mean value on 2021-04-01 = 32
[64, 54, 43, 21, 15, 15, 14]
Mean value on 2021-04-02 = 64
res
Out[]:
date top_10p_mean
0 2021-04-01 32
1 2021-04-02 64
CodePudding user response:
df.nlargest is what you want.
First determine how many values correspond with 10% by running (df
being your dataframe):
highest10p = 0.1*len(df)
then you can select the 10 largest values in the value
column by using
df.nlargest(highest10p, 'value')
So if you want the mean, you use .mean()
function:
df.nlargest(highest10p, 'value').mean()