Can someone help me to see if this could be a right way to create a new variable from an integer variables, or if there could be another to do so
I have a data frame with world ranking universities and i want to group them by their ranking
def create_category(ranking):
if (ranking >= 1) & (ranking <= 100):
return "First Tier Top Unversity"
elif (ranking >= 101) & (ranking <= 200):
return "Second Tier Top Unversity"
elif (ranking >= 201) & (ranking <= 300):
return "Third Tier Top Unversity"
return "Other Top Unversity"
CodePudding user response:
The way you're doing it will work and will be efficient. However, it isn't as easy to maintain as it could be.
Process the logic within a loop then, if you ever need to change the boundaries, you just change a couple of lists - not the logic.
Of course you could use numpy but that seems heavy-handed for this.
bounds = [0, 100, 200, 300]
values = ['First Tier Top University', 'Second Tier Top University', 'Third Tier Top University']
def create_category(ranking):
for i, (x, y) in enumerate(zip(bounds, bounds[1:])):
if ranking > x and ranking <= y:
return values[i]
return 'Other Top University'
CodePudding user response:
Your quickest way is to use np.where:
import numpy as np
df['cat'] = 'Other Top Unversity' # default
df['cat'] = np.where((df['ranking']>=1) & (df['ranking']<=100), 'First Tier Top Unversity', df['cat'])
df['cat'] = np.where((df['ranking']>=101) & (df['ranking']<=200), 'Second Tier Top Unversity', df['cat'])
df['cat'] = np.where((df['ranking']>=201) & (df['ranking']<=300), 'Third Tier Top Unversity', df['cat'])
CodePudding user response:
I find this type of code for present problem more readable:
import pandas as pd
import numpy as np
conditions = [
df['ranking '].ge(1) & df['ranking'].le(100),
df['ranking '].ge(101) & df['ranking'].le(200),
df['ranking '].ge(201) & df['ranking'].le(300)
df['ranking '].ge(301)
]
choices = ["First Tier Top Unversity", "Second Tier Top Unversity",
"Third Tier Top Unversity", "Other Top Unversity"]
df['university_ranking'] = np.select(conditions, choices, default="")
CodePudding user response:
Use pandas.cut
import pandas as pd
import numpy as np
# You can use float('inf') instead of np.inf if you don't want to import numpy.
bins = [0, 100, 200, 300, np.inf]
labels = ["First Tier Top Unversity", "Second Tier Top Unversity", "Third Tier Top Unversity", "Other Top Unversity"]
df['group'] = pd.cut(df['ranking'], bins=bins, labels=labels)