Right way to categorize an int variable with python?-CodePudding

Can someone help me to see if this could be a right way to create a new variable from an integer variables, or if there could be another to do so

I have a data frame with world ranking universities and i want to group them by their ranking

    def create_category(ranking):
    if (ranking >= 1) & (ranking <= 100):
        return "First Tier Top Unversity"
    elif (ranking >= 101) & (ranking <= 200):
        return "Second Tier Top Unversity"
    elif (ranking >= 201) & (ranking <= 300):
        return "Third Tier Top Unversity"
    return "Other Top Unversity"

CodePudding user response：

The way you're doing it will work and will be efficient. However, it isn't as easy to maintain as it could be.

Process the logic within a loop then, if you ever need to change the boundaries, you just change a couple of lists - not the logic.

Of course you could use numpy but that seems heavy-handed for this.

bounds = [0, 100, 200, 300]
values = ['First Tier Top University', 'Second Tier Top University', 'Third Tier Top University']

def create_category(ranking):
    for i, (x, y) in enumerate(zip(bounds, bounds[1:])):
        if ranking > x and ranking <= y:
            return values[i]
    return 'Other Top University'

CodePudding user response：

Your quickest way is to use np.where:

import numpy as np
df['cat'] = 'Other Top Unversity' # default
df['cat'] = np.where((df['ranking']>=1) &  (df['ranking']<=100), 'First Tier Top Unversity', df['cat'])
df['cat'] = np.where((df['ranking']>=101) &  (df['ranking']<=200), 'Second Tier Top Unversity', df['cat'])
df['cat'] = np.where((df['ranking']>=201) &  (df['ranking']<=300), 'Third Tier Top Unversity', df['cat'])

CodePudding user response：

I find this type of code for present problem more readable:

import pandas as pd
import numpy as np
conditions = [
    df['ranking '].ge(1) & df['ranking'].le(100),
    df['ranking '].ge(101) & df['ranking'].le(200),
    df['ranking '].ge(201) & df['ranking'].le(300)
    df['ranking '].ge(301)
]

choices = ["First Tier Top Unversity", "Second Tier Top Unversity", 
           "Third Tier Top Unversity", "Other Top Unversity"]

df['university_ranking'] = np.select(conditions, choices, default="")

CodePudding user response：

Use pandas.cut

import pandas as pd
import numpy as np 

# You can use float('inf') instead of np.inf if you don't want to import numpy.
bins = [0, 100, 200, 300, np.inf]
labels = ["First Tier Top Unversity", "Second Tier Top Unversity", "Third Tier Top Unversity", "Other Top Unversity"]

df['group'] = pd.cut(df['ranking'], bins=bins, labels=labels)