In a table that gives the demographics of a certain population, I want to find what is the proportion of German citizens. I was wondering if there is a feature in Pandas to find out what proportion of rows have a certain label, or in this case, what proportion of rows had "Germany" in the "native-country" column.
The data is from mlcourse.ai: https://raw.githubusercontent.com/Yorko/mlcourse.ai/master/data/
I tried to use the "value counts" function to see if I could print out the number of rows that had :Germany" under the "native-country" column. Setting normalize to true, I would just need to multiply the result by 100 to get the proportion of people who had Germany as their native country.
data[data["native-country"]=="Germany"].value_counts(normalize=True)
output:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
18 ? 85154 12th 8 Never-married ? Own-child Asian-Pac-Islander Female 0 0 24 Germany <=50K 0.007299
46 Private 35961 Assoc-acdm 12 Divorced Sales Not-in-family White Female 0 0 25 Germany <=50K 0.007299
45 Private 161954 Bachelors 13 Never-married Prof-specialty Not-in-family White Female 0 0 40 Germany <=50K 0.007299
174794 Bachelors 13 Separated Prof-specialty Unmarried White Female 0 0 56 Germany <=50K 0.007299
204057 Bachelors 13 Divorced Adm-clerical Unmarried White Female 0 0 40 Germany <=50K 0.007299
...
30 Private 318749 Assoc-voc 11 Married-civ-spouse Tech-support Wife White Female 0 0 35 Germany <=50K 0.007299
116508 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 40 Germany <=50K 0.007299
111415 HS-grad 9 Married-civ-spouse Other-service Husband White Male 0 0 55 Germany <=50K 0.007299
77143 Bachelors 13 Never-married Exec-managerial Own-child Black Male 0 0 40 Germany <=50K 0.007299
74 Self-emp-not-inc 199136 Bachelors 13 Widowed Craft-repair Not-in-family White Male 15831 0 8 Germany >50K 0.007299
Length: 137, dtype: float64
This did not seem to work and returned a the segments of the table that consisted of people with Germany as their native country. I could use this to get my answer, but I am looking for a simpler method, if there is one.
CodePudding user response:
What's happening is that you're currently filtering for rows that have ["native-country"]=="Germany"
and then running value counts on the whole resulting DataFrame. This will give you counts of 1 because each row is unique when all attributes are taken into account.
What you should do instead is isolate the native-country
column and then doing counts based on that.
The code would look like this:
native_countries = data["native-country"]
native_countries_count = native_countries.value_counts(normalize=True)
print(native_countries_count["Germany"])
I've created the following toy dataset to demonstrate:
df = pd.DataFrame({
"age": [12, 23, 34, 45],
"native-country": ["Germany", "Germany", "Germany", "France"]})
print(df)
# age native-country
# 0 12 Germany
# 1 23 Germany
# 2 34 Germany
# 3 45 France
native_countries = df["native-country"]
print(native_countries)
# 0 Germany
# 1 Germany
# 2 Germany
# 3 France
# Name: native-country, dtype: object
native_countries_count = native_countries.value_counts(normalize=True)
print(native_countries_count)
# Germany 0.75
# France 0.25
# Name: native-country, dtype: float64
print(native_countries_count["Germany"])
# 0.75
CodePudding user response:
Try this:
import pandas as pd
data = pd.DataFrame(
{'native-country':['France', 'Australia', 'South Africa', 'Germany', 'France', 'Australia', 'South Africa', 'Germany', 'France', 'Australia', 'South Africa', 'Germany', 'France', 'Australia', 'South Africa', 'Germany'],
'age':[21,22,23,24,25,26,27,28,29,30,29,28,27,26,25,24]})
print(data[data["native-country"]=="Germany"].shape[0] / data.shape[0])
Output:
0.25