What is the easiest way in Pandas to find what proportion of rows have a particular label?-CodePudding

In a table that gives the demographics of a certain population, I want to find what is the proportion of German citizens. I was wondering if there is a feature in Pandas to find out what proportion of rows have a certain label, or in this case, what proportion of rows had "Germany" in the "native-country" column.

The data is from mlcourse.ai: https://raw.githubusercontent.com/Yorko/mlcourse.ai/master/data/

I tried to use the "value counts" function to see if I could print out the number of rows that had :Germany" under the "native-country" column. Setting normalize to true, I would just need to multiply the result by 100 to get the proportion of people who had Germany as their native country.

data[data["native-country"]=="Germany"].value_counts(normalize=True)

output:

age  workclass         fnlwgt  education   education-num  marital-status      occupation       relationship   race                sex     capital-gain  capital-loss  hours-per-week  native-country  salary
18   ?                 85154   12th        8              Never-married       ?                Own-child      Asian-Pac-Islander  Female  0             0             24              Germany         <=50K     0.007299
46   Private           35961   Assoc-acdm  12             Divorced            Sales            Not-in-family  White               Female  0             0             25              Germany         <=50K     0.007299
45   Private           161954  Bachelors   13             Never-married       Prof-specialty   Not-in-family  White               Female  0             0             40              Germany         <=50K     0.007299
                       174794  Bachelors   13             Separated           Prof-specialty   Unmarried      White               Female  0             0             56              Germany         <=50K     0.007299
                       204057  Bachelors   13             Divorced            Adm-clerical     Unmarried      White               Female  0             0             40              Germany         <=50K     0.007299
                                                                                                                                                                                                                  ...   
30   Private           318749  Assoc-voc   11             Married-civ-spouse  Tech-support     Wife           White               Female  0             0             35              Germany         <=50K     0.007299
                       116508  HS-grad     9              Married-civ-spouse  Craft-repair     Husband        White               Male    0             0             40              Germany         <=50K     0.007299
                       111415  HS-grad     9              Married-civ-spouse  Other-service    Husband        White               Male    0             0             55              Germany         <=50K     0.007299
                       77143   Bachelors   13             Never-married       Exec-managerial  Own-child      Black               Male    0             0             40              Germany         <=50K     0.007299
74   Self-emp-not-inc  199136  Bachelors   13             Widowed             Craft-repair     Not-in-family  White               Male    15831         0             8               Germany         >50K      0.007299
Length: 137, dtype: float64

This did not seem to work and returned a the segments of the table that consisted of people with Germany as their native country. I could use this to get my answer, but I am looking for a simpler method, if there is one.

CodePudding user response：

What's happening is that you're currently filtering for rows that have ["native-country"]=="Germany" and then running value counts on the whole resulting DataFrame. This will give you counts of 1 because each row is unique when all attributes are taken into account.

What you should do instead is isolate the native-country column and then doing counts based on that.

The code would look like this:

native_countries = data["native-country"]
native_countries_count = native_countries.value_counts(normalize=True)
print(native_countries_count["Germany"])

I've created the following toy dataset to demonstrate:

df = pd.DataFrame({
"age": [12, 23, 34, 45],
"native-country": ["Germany", "Germany", "Germany", "France"]})

print(df)
#    age native-country
# 0   12        Germany
# 1   23        Germany
# 2   34        Germany
# 3   45         France

native_countries = df["native-country"]

print(native_countries)
# 0    Germany
# 1    Germany
# 2    Germany
# 3     France
# Name: native-country, dtype: object

native_countries_count = native_countries.value_counts(normalize=True)
print(native_countries_count)
# Germany    0.75
# France     0.25
# Name: native-country, dtype: float64

print(native_countries_count["Germany"])
# 0.75

CodePudding user response：

Try this:

import pandas as pd
data = pd.DataFrame(
    {'native-country':['France', 'Australia', 'South Africa', 'Germany', 'France', 'Australia', 'South Africa', 'Germany', 'France', 'Australia', 'South Africa', 'Germany', 'France', 'Australia', 'South Africa', 'Germany'],
    'age':[21,22,23,24,25,26,27,28,29,30,29,28,27,26,25,24]})

print(data[data["native-country"]=="Germany"].shape[0] / data.shape[0])

Output:

0.25