Home > database >  Pandas: Find ratio of values for a column and then groupby on another column
Pandas: Find ratio of values for a column and then groupby on another column

Time:07-02

I am working with the Census Income dataset. You can find the dataset file adult.data in the "Data Folder".

To quickly reproduce the problem, here is how to load it:

training_df = pd.read_csv('adult.data', header = None, skipinitialspace = True)
columns = ['age','workclass','fnlwgt','education','education-num','marital-status',
           'occupation','relationship','race','sex','capital-gain','capital-loss',
           'hours-per-week','native-country','income']
training_df.columns = columns

I am trying to find out income imbalance for each native_country using a simple ratio:

income imbalance = population with <=50K income / population with >50K income

Here is my most naive and non-Pythonic and non-Pandas way to do it:

def native_country_income_imbalance():
    income_dict = {}
    for index, data in training_df.iterrows():
        native_country = data['native-country']
        income = data['income']

        # 1st number will store count of >50K and second <=50K
        income_count = [0,0]
        if not income_dict.get(native_country, False):
            if income == '>50K':
                income_count[0]  = 1
                income_dict[native_country] = income_count
            else:
                income_count[1]  = 1
                income_dict[native_country] = income_count
        else: 
            if income == '>50K':
                income_dict[native_country][0]  = 1
            else:
                income_dict[native_country][1]  = 1
    
    for country, incomes in income_dict.items():
            # For a native_country where there is no one with >50K
            # income, we'll make proportion 0 as a special case
            if incomes[0] != 0:
                proportion = round(incomes[1] / incomes[0], 2)
            else:
                proportion = 0
            income_dict[country] = proportion
    income_dict = dict(sorted(income_dict.items(), key=lambda item: item[1],reverse=True))
    
    return income_dict

call the function

native_country_income_imbalance()

returns the output correctly as

{'Dominican-Republic': 34.0,
 'Columbia': 28.5,
 'Guatemala': 20.33,
 'Mexico': 18.48,
 'Honduras': 12.0,
 .
 .
 'Taiwan': 1.55,
 'India': 1.5,
 'France': 1.42,
 'Iran': 1.39,
 'Outlying-US(Guam-USVI-etc)': 0,
 'Holand-Netherlands': 0}

This is clearly verbose and not something which is utilising Pandas' true power (vectorization groupby transform). How do I improve this?

Note: Please feel free to improve the question title.

CodePudding user response:

Pandas solution

Create a frequency table with crosstab, then mask the values where count is zero, then use eval to divide <=50K by >50K column to calculate ratio

s = pd.crosstab(df['native-country'], df['income'])
result = s[s != 0].eval('`<=50K` / `>50K`').fillna(0)
result = result.round(decimals = 2)
result = result.sort_values(ascending=False)
result

native-country
Dominican-Republic            34.00
Columbia                      28.50
Guatemala                     20.33
Mexico                        18.48
Nicaragua                     16.00
.
.
India                          1.50
France                         1.42
Iran                           1.39
Outlying-US(Guam-USVI-etc)     0.00
Holand-Netherlands             0.00
dtype: float64
  • Related