How to calculate percentage per year of a dataframe column with qualitative data?-CodePudding

Suppose my dataframe looks as follows:

Build year	Brand
2010	Mercedes
2010	Mercedes
2010	BMW
2010	Kia
2011	Toyota
2011	Mercedes
2011	Mercedes
2012	Tesla

I want to find all unique combinations of build year and brand, and them count the values and calculate the percentage for each color per year. Currently I have this:

df.groupby(["Build year", "Brand"]).count()

Is there an easy way to convert this to percentage per year? The desired output is:

Build year	Brand	Count	Percentage of annual count
2010	Mercedes	2	0.5
2010	BMW	1	0.25
2010	Kia	1	0.25
2011	Toyota	1	0.33
2011	Mercedes	2	0.66
2012	Tesla	1	1

CodePudding user response：

Instead of grouping by both Build Year and Brand, you only have to groupby the Build Year, and use .value_counts:

import pandas as pd

df = pd.read_clipboard() # Your df here
groups = df.groupby("Build year")

count = groups.value_counts()
percentage = groups.value_counts(normalize=True)

out = pd.concat([count, percentage], axis=1, keys=["Count", "Percentage of annual count"])

                     Count  Percentage of annual count
Build year Brand
2010       Mercedes      2                    0.500000
           BMW           1                    0.250000
           Kia           1                    0.250000
2011       Mercedes      2                    0.666667
           Toyota        1                    0.333333
2012       Tesla         1                    1.000000

CodePudding user response：

You can calculate the Percentage of annual count using lambda function, see below :

grouped_df = df.groupby(["Build year", "Brand"])

counts = grouped_df.size().reset_index(name='Count')

counts['Percentage of annual count'] = grouped_df.apply(lambda x: 100 * x.count() / x.count().sum())

Example of full code

import pandas as pd

data = {'Build year': [2010, 2010, 2010, 2010, 2011, 2011, 2011, 2012],
        'Brand': ['Mercedes', 'Mercedes', 'BMW', 'Kia', 'Toyota', 'Mercedes', 'Mercedes', 'Tesla']}

df = pd.DataFrame(data)

grouped_df = df.groupby(["Build year", "Brand"])
counts = grouped_df.size().reset_index(name='Count')
counts['Percentage of annual count'] = grouped_df.apply(lambda x: 100 * x.count() / x.count().sum())

print(counts)

Output

Build year     Brand         Count                 Percentage of annual count
0        2010  Mercedes      2                     50.00
1        2010       BMW      1                     25.00
2        2010       Kia      1                     25.00
3        2011    Toyota      1                     33.33
4        2011  Mercedes      2                     66.67
5        2012     Tesla      1                    100.00

Hope it helps

CodePudding user response：

A method chaining approach (assuming names without spaces), could look like this:

(
df.groupby(["build_year","brand"])
    .agg(count=('build_year', "count"))
    .assign(Percentage_of_annual_count = 
                lambda x: x["count"]/x.groupby("build_year")["count"]
                                      .transform(lambda x: sum(x)))
)

Full solution:

df = pd.DataFrame({
"build_year":[2010,2010,2010,2010,2011,2011,2011,2012,],
"brand":["Mercedes","Mercedes","BMW","Kia","Toyota","Mercedes","Mercedes","Tesla",]
})
(
df.groupby(["build_year","brand"])
    .agg(count=('build_year', "count"))
    .assign(Percentage_of_annual_count = 
                lambda x: x["count"]/x.groupby("build_year")["count"]
                                      .transform(lambda x: sum(x)))
)

Output:

                     count  Percentage_of_annual_count
build_year brand                                      
2010       BMW           1                    0.250000
           Kia           1                    0.250000
           Mercedes      2                    0.500000
2011       Mercedes      2                    0.666667
           Toyota        1                    0.333333
2012       Tesla         1                    1.000000