I have a DataFrame that looks something like this:
Place| University | Field | Number of students
| New York | NYU | Fashion | Female | 1000
| New York | NYU | Fashion | Male | 500 |
| New York | NYU | CS | Female | 300 |
| New York | NYU | CS | Male |3050 |
| New York | NYU | Bio | Female | 350 |
| New York | NYU | Bio | Male | 400 |
| Texas | UT | Fashion | Female | 600 |
| Texas | UT | Fashion | Male | 100 |
| Texas | UT | CS | Female | 500 |
| Texas | UT | CS | Male |5050 |
| Texas | UT | Bio | Female | 650 |
| Texas | UT | Bio | Male | 700 |
Using only Pandas, I need to find all the unique fields of study then for each one, find the state with the most students:
| Field | Place | No. of Students|
| Fashion | New York | 1500 |
| CS | Texas | 5550 |
| Bio | Texas | 1350 |
I know we have to groupby the field but i'm unable to get the sum of students and then the one with the max number of students (col 2 and 3). I've tried agg and count but I might be using them in the wrong way. Please help guide me on this. TIA!
CodePudding user response:
You can do this
>>> df = df.groupby(['Place','Field'])['Number of students'].sum().reset_index()
Place Field Number of students
0 New York CS 500
1 New York Fashion 1500
2 Texas Bio 200
3 Texas CS 700
>>> df.sort_values('Number of students',ascending=False)
Place Field Number of students
1 New York Fashion 1500
3 Texas CS 700
0 New York CS 500
2 Texas Bio 200
CodePudding user response:
df.groupby(['State', 'Place', 'Field']).sum('Nos').groupby('Field').max()