Let's assume that we've got a dataframe composed by the following variables, among others:
Institution (name of the university)
Country (name of the country of the institution)
Year (integer, year in which that university was scored)
World_rank (integer, position in the world rank)
Alumni_employment (integer, number of alumni placements)
We want to filter all American-based universities, which were ranked <= 500 in 2015 and share the same value for Alumni_employment.
While the first 3 requirements are easy to meet, i got stuck at the last one.
Here it goes my attempt at this:
import pandas as pd
import numpy as np
data = pd.read_csv("data/cwurData.csv")
americanuniv = data[(data.country == 'USA') & (data.year == 2015) & (data.world_rank <= 500)]
for x in data.alumni_employment:
for y in data.alumni_employment:
if x == y:
print(americanuniv['institution'])
Naturally, it didn't work. To be hinest, I don't know how to move forward to accomplish the last challenge.. Do you guys have any thought about this?
Thank you so much!
CodePudding user response:
americanuniv = data.loc[(data["country"] == 'USA') & (data["year"] == 2015) & (data["world_rank"] <= 500)]
americanuniv.groupby(by = "Alumni_employment")["institution"]
CodePudding user response:
For simplicity, let's use the following dataframe:
df = pd.DataFrame({'institution': ['A', 'B', 'C', 'D', 'E', 'F', 'G'],
'alumni_employment': [10, 20, 10, 30, 20, 5, 20]})
To get institutions with the same 'alumni_employment', use groupby. Then, filter to eliminate the ones in groups of size 1.
g = df.groupby('alumni_employment')
final = g.filter(lambda x: len(x) > 1)
The result is:
institution alumni_employment
0 A 10
1 B 20
2 C 10
4 E 20
6 G 20
If you want the ones with the same 'alumni_employment' to be printed together, you can do:
final = final.sort_values('alumni_employment')