I would like to find out the most utilized location for the date of 2/1/2022.
Data
ID location total marks_free marks_utilized date
1 NY 6 5 1 2/1/2022
2 NY 10 5 5 2/1/2022
3 NY 2 1 1 2/1/2022
4 CA 5 4 1 2/1/2022
5 CA 6 5 1 2/1/2022
6 CA 10 10 0 2/1/2022
7 NY 6 6 0 3/1/2022
8 NY 10 10 0 3/1/2022
9 NY 2 1 1 3/1/2022
10 CA 5 4 1 3/1/2022
11 CA 6 5 1 3/1/2022
12 CA 10 10 0 3/1/2022
Desired
location marks_utilized date
NY 38% 2/1/2022
Logic
filter to 2/1/2022, groupby location
for instance lets take NY
sum(marks_utilized) / sum(total) * 100
7/18 *100 = 38%
Doing
# filter to 2/1/2022
df1 = df.groupby(['location', 'date']).agg({'marks_utilized': 'sum', 'total': 'sum'})
df1['marks_utilized'] = df['marks_utilized'] / df['total'] * 100
Still researching this. Any suggestion is appreciated.
CodePudding user response:
just need a simple modification on your attempt, it would work.
df1['marks_utilized'] = df['marks_utilized'] / df['total'] * 100
should be df1['marks_utilized'] = df1['marks_utilized'] / df1['total'] * 100
If you only want result in 2/1/2022
, you could filter the df
and do groupby
afterwards. Also, could use df1.to_string(formatters={'marks_utilized': '{:,.2f}'.format}
to format the float
to percentage string.
ID,location,total,marks_free,marks_utilized,date
1,NY,6,5,1,2/1/2022
2,NY,10,5,5,2/1/2022
3,NY,2,1,1,2/1/2022
4,CA,5,4,1,3/1/2022
5,CA,6,5,1,3/1/2022
6,CA,10,10,0,3/1/2022
import pandas as pd
df = pd.read_csv("test.csv")
df1 = df.groupby(['location', 'date']).agg({'marks_utilized': 'sum', 'total': 'sum'})
df1['marks_utilized'] = df1['marks_utilized'] / df1['total']
max_row = df1.loc[df1['marks_utilized'].idxmax()]
print(max_row)
marks_utilized 0.388889
total 18.000000
Name: (NY, 2/1/2022), dtype: float64
CodePudding user response:
We could try
df.groupby(['location','date']).apply(lambda x : x['marks_utilized'].sum()/x['total'].sum()).\
mul(100).reset_index(name = 'marks_utilized')
Out[279]:
location date marks_utilized
0 CA 3/1/2022 9.523810
1 NY 2/1/2022 38.888889