Home > Blockchain >  How to calculate percentage per year of a dataframe column with qualitative data?
How to calculate percentage per year of a dataframe column with qualitative data?

Time:01-04

Suppose my dataframe looks as follows:

Build year Brand
2010 Mercedes
2010 Mercedes
2010 BMW
2010 Kia
2011 Toyota
2011 Mercedes
2011 Mercedes
2012 Tesla

I want to find all unique combinations of build year and brand, and them count the values and calculate the percentage for each color per year. Currently I have this:

df.groupby(["Build year", "Brand"]).count()

Is there an easy way to convert this to percentage per year? The desired output is:

Build year Brand Count Percentage of annual count
2010 Mercedes 2 0.5
2010 BMW 1 0.25
2010 Kia 1 0.25
2011 Toyota 1 0.33
2011 Mercedes 2 0.66
2012 Tesla 1 1

CodePudding user response:

Instead of grouping by both Build Year and Brand, you only have to groupby the Build Year, and use .value_counts:

import pandas as pd

df = pd.read_clipboard() # Your df here
groups = df.groupby("Build year")

count = groups.value_counts()
percentage = groups.value_counts(normalize=True)

out = pd.concat([count, percentage], axis=1, keys=["Count", "Percentage of annual count"])
                     Count  Percentage of annual count
Build year Brand
2010       Mercedes      2                    0.500000
           BMW           1                    0.250000
           Kia           1                    0.250000
2011       Mercedes      2                    0.666667
           Toyota        1                    0.333333
2012       Tesla         1                    1.000000

CodePudding user response:

You can calculate the Percentage of annual count using lambda function, see below :

grouped_df = df.groupby(["Build year", "Brand"])

counts = grouped_df.size().reset_index(name='Count')

counts['Percentage of annual count'] = grouped_df.apply(lambda x: 100 * x.count() / x.count().sum())

Example of full code

import pandas as pd

data = {'Build year': [2010, 2010, 2010, 2010, 2011, 2011, 2011, 2012],
        'Brand': ['Mercedes', 'Mercedes', 'BMW', 'Kia', 'Toyota', 'Mercedes', 'Mercedes', 'Tesla']}

df = pd.DataFrame(data)

grouped_df = df.groupby(["Build year", "Brand"])
counts = grouped_df.size().reset_index(name='Count')
counts['Percentage of annual count'] = grouped_df.apply(lambda x: 100 * x.count() / x.count().sum())

print(counts)

Output

Build year     Brand         Count                 Percentage of annual count
0        2010  Mercedes      2                     50.00
1        2010       BMW      1                     25.00
2        2010       Kia      1                     25.00
3        2011    Toyota      1                     33.33
4        2011  Mercedes      2                     66.67
5        2012     Tesla      1                    100.00

Hope it helps

CodePudding user response:

A method chaining approach (assuming names without spaces), could look like this:

(
df.groupby(["build_year","brand"])
    .agg(count=('build_year', "count"))
    .assign(Percentage_of_annual_count = 
                lambda x: x["count"]/x.groupby("build_year")["count"]
                                      .transform(lambda x: sum(x)))
)

Full solution:

df = pd.DataFrame({
"build_year":[2010,2010,2010,2010,2011,2011,2011,2012,],
"brand":["Mercedes","Mercedes","BMW","Kia","Toyota","Mercedes","Mercedes","Tesla",]
})
(
df.groupby(["build_year","brand"])
    .agg(count=('build_year', "count"))
    .assign(Percentage_of_annual_count = 
                lambda x: x["count"]/x.groupby("build_year")["count"]
                                      .transform(lambda x: sum(x)))
)

Output:

                     count  Percentage_of_annual_count
build_year brand                                      
2010       BMW           1                    0.250000
           Kia           1                    0.250000
           Mercedes      2                    0.500000
2011       Mercedes      2                    0.666667
           Toyota        1                    0.333333
2012       Tesla         1                    1.000000
  • Related