Home > database >  Right way to categorize an int variable with python?
Right way to categorize an int variable with python?

Time:08-07

Can someone help me to see if this could be a right way to create a new variable from an integer variables, or if there could be another to do so

I have a data frame with world ranking universities and i want to group them by their ranking

    def create_category(ranking):
    if (ranking >= 1) & (ranking <= 100):
        return "First Tier Top Unversity"
    elif (ranking >= 101) & (ranking <= 200):
        return "Second Tier Top Unversity"
    elif (ranking >= 201) & (ranking <= 300):
        return "Third Tier Top Unversity"
    return "Other Top Unversity"

CodePudding user response:

The way you're doing it will work and will be efficient. However, it isn't as easy to maintain as it could be.

Process the logic within a loop then, if you ever need to change the boundaries, you just change a couple of lists - not the logic.

Of course you could use numpy but that seems heavy-handed for this.

bounds = [0, 100, 200, 300]
values = ['First Tier Top University', 'Second Tier Top University', 'Third Tier Top University']

def create_category(ranking):
    for i, (x, y) in enumerate(zip(bounds, bounds[1:])):
        if ranking > x and ranking <= y:
            return values[i]
    return 'Other Top University'

CodePudding user response:

Your quickest way is to use np.where:

import numpy as np
df['cat'] = 'Other Top Unversity' # default
df['cat'] = np.where((df['ranking']>=1) &  (df['ranking']<=100), 'First Tier Top Unversity', df['cat'])
df['cat'] = np.where((df['ranking']>=101) &  (df['ranking']<=200), 'Second Tier Top Unversity', df['cat'])
df['cat'] = np.where((df['ranking']>=201) &  (df['ranking']<=300), 'Third Tier Top Unversity', df['cat'])

CodePudding user response:

I find this type of code for present problem more readable:

import pandas as pd
import numpy as np
conditions = [
    df['ranking '].ge(1) & df['ranking'].le(100),
    df['ranking '].ge(101) & df['ranking'].le(200),
    df['ranking '].ge(201) & df['ranking'].le(300)
    df['ranking '].ge(301)
]

choices = ["First Tier Top Unversity", "Second Tier Top Unversity", 
           "Third Tier Top Unversity", "Other Top Unversity"]

df['university_ranking'] = np.select(conditions, choices, default="")

CodePudding user response:

Use pandas.cut

import pandas as pd
import numpy as np 

# You can use float('inf') instead of np.inf if you don't want to import numpy.
bins = [0, 100, 200, 300, np.inf]
labels = ["First Tier Top Unversity", "Second Tier Top Unversity", "Third Tier Top Unversity", "Other Top Unversity"]

df['group'] = pd.cut(df['ranking'], bins=bins, labels=labels)
  • Related