Home > Back-end >  Pandas: How to filter repeated values of an axis?
Pandas: How to filter repeated values of an axis?

Time:11-17

Let's assume that we've got a dataframe composed by the following variables, among others:

Institution (name of the university)
Country (name of the country of the institution)
Year (integer, year in which that university was scored)
World_rank (integer, position in the world rank)
Alumni_employment (integer, number of alumni placements)

We want to filter all American-based universities, which were ranked <= 500 in 2015 and share the same value for Alumni_employment.

While the first 3 requirements are easy to meet, i got stuck at the last one.

Here it goes my attempt at this:

import pandas as pd
import numpy as np
data = pd.read_csv("data/cwurData.csv")

americanuniv = data[(data.country == 'USA') & (data.year == 2015) & (data.world_rank <= 500)]
for x in data.alumni_employment:
    for y in data.alumni_employment:
        if x == y:   
            print(americanuniv['institution'])

Naturally, it didn't work. To be hinest, I don't know how to move forward to accomplish the last challenge.. Do you guys have any thought about this?

Thank you so much!

CodePudding user response:

    americanuniv = data.loc[(data["country"] == 'USA') & (data["year"] == 2015) & (data["world_rank"] <= 500)]
    americanuniv.groupby(by = "Alumni_employment")["institution"]

CodePudding user response:

For simplicity, let's use the following dataframe:

df = pd.DataFrame({'institution': ['A', 'B', 'C', 'D', 'E', 'F', 'G'], 
  'alumni_employment': [10, 20, 10, 30, 20, 5, 20]})

To get institutions with the same 'alumni_employment', use groupby. Then, filter to eliminate the ones in groups of size 1.

g = df.groupby('alumni_employment')
final = g.filter(lambda x: len(x) > 1)

The result is:


    institution alumni_employment
0   A           10
1   B           20
2   C           10
4   E           20
6   G           20

If you want the ones with the same 'alumni_employment' to be printed together, you can do:

final = final.sort_values('alumni_employment')
  • Related