I am working with the Census Income dataset. You can find the dataset file adult.data in the "Data Folder".
To quickly reproduce the problem, here is how to load it:
training_df = pd.read_csv('adult.data', header = None, skipinitialspace = True)
columns = ['age','workclass','fnlwgt','education','education-num','marital-status',
'occupation','relationship','race','sex','capital-gain','capital-loss',
'hours-per-week','native-country','income']
training_df.columns = columns
I am trying to find out income imbalance for each native_country using a simple ratio:
income imbalance = population with <=50K income / population with >50K income
Here is my most naive and non-Pythonic and non-Pandas way to do it:
def native_country_income_imbalance():
income_dict = {}
for index, data in training_df.iterrows():
native_country = data['native-country']
income = data['income']
# 1st number will store count of >50K and second <=50K
income_count = [0,0]
if not income_dict.get(native_country, False):
if income == '>50K':
income_count[0] = 1
income_dict[native_country] = income_count
else:
income_count[1] = 1
income_dict[native_country] = income_count
else:
if income == '>50K':
income_dict[native_country][0] = 1
else:
income_dict[native_country][1] = 1
for country, incomes in income_dict.items():
# For a native_country where there is no one with >50K
# income, we'll make proportion 0 as a special case
if incomes[0] != 0:
proportion = round(incomes[1] / incomes[0], 2)
else:
proportion = 0
income_dict[country] = proportion
income_dict = dict(sorted(income_dict.items(), key=lambda item: item[1],reverse=True))
return income_dict
call the function
native_country_income_imbalance()
returns the output correctly as
{'Dominican-Republic': 34.0,
'Columbia': 28.5,
'Guatemala': 20.33,
'Mexico': 18.48,
'Honduras': 12.0,
.
.
'Taiwan': 1.55,
'India': 1.5,
'France': 1.42,
'Iran': 1.39,
'Outlying-US(Guam-USVI-etc)': 0,
'Holand-Netherlands': 0}
This is clearly verbose and not something which is utilising Pandas' true power (vectorization groupby transform). How do I improve this?
Note: Please feel free to improve the question title.
CodePudding user response:
Pandas solution
Create a frequency table with crosstab
, then mask the values where count is zero, then use eval to divide <=50K
by >50K
column to calculate ratio
s = pd.crosstab(df['native-country'], df['income'])
result = s[s != 0].eval('`<=50K` / `>50K`').fillna(0)
result = result.round(decimals = 2)
result = result.sort_values(ascending=False)
result
native-country
Dominican-Republic 34.00
Columbia 28.50
Guatemala 20.33
Mexico 18.48
Nicaragua 16.00
.
.
India 1.50
France 1.42
Iran 1.39
Outlying-US(Guam-USVI-etc) 0.00
Holand-Netherlands 0.00
dtype: float64