Determine most utilized location for a specific date using Pandas-CodePudding

I would like to find out the most utilized location for the date of 2/1/2022.

Data

ID  location    total   marks_free  marks_utilized  date
1   NY          6       5           1               2/1/2022
2   NY          10      5           5               2/1/2022
3   NY          2       1           1               2/1/2022
4   CA          5       4           1               2/1/2022
5   CA          6       5           1               2/1/2022
6   CA          10      10          0               2/1/2022
7   NY          6       6           0               3/1/2022
8   NY          10      10          0               3/1/2022
9   NY          2       1           1               3/1/2022
10  CA          5       4           1               3/1/2022
11  CA          6       5           1               3/1/2022
12  CA          10      10          0               3/1/2022

Desired

location    marks_utilized  date        
NY          38%             2/1/2022

Logic

filter to 2/1/2022, groupby location
for instance lets take NY
sum(marks_utilized) / sum(total) * 100
7/18 *100 = 38%

Doing

# filter to 2/1/2022
df1 = df.groupby(['location', 'date']).agg({'marks_utilized': 'sum', 'total': 'sum'})
df1['marks_utilized'] = df['marks_utilized'] / df['total'] * 100

Still researching this. Any suggestion is appreciated.

CodePudding user response：

just need a simple modification on your attempt, it would work. df1['marks_utilized'] = df['marks_utilized'] / df['total'] * 100 should be df1['marks_utilized'] = df1['marks_utilized'] / df1['total'] * 100

If you only want result in 2/1/2022, you could filter the df and do groupby afterwards. Also, could use df1.to_string(formatters={'marks_utilized': '{:,.2f}'.format} to format the float to percentage string.

ID,location,total,marks_free,marks_utilized,date
1,NY,6,5,1,2/1/2022
2,NY,10,5,5,2/1/2022
3,NY,2,1,1,2/1/2022
4,CA,5,4,1,3/1/2022
5,CA,6,5,1,3/1/2022
6,CA,10,10,0,3/1/2022

import pandas as pd

df = pd.read_csv("test.csv")
df1 = df.groupby(['location', 'date']).agg({'marks_utilized': 'sum', 'total': 'sum'})
df1['marks_utilized'] = df1['marks_utilized'] / df1['total']
max_row = df1.loc[df1['marks_utilized'].idxmax()]
print(max_row)

marks_utilized     0.388889
total             18.000000
Name: (NY, 2/1/2022), dtype: float64

CodePudding user response：

We could try

df.groupby(['location','date']).apply(lambda x : x['marks_utilized'].sum()/x['total'].sum()).\
    mul(100).reset_index(name = 'marks_utilized')
Out[279]: 
  location      date  marks_utilized
0       CA  3/1/2022        9.523810
1       NY  2/1/2022       38.888889